Feature #6679

Default Ruby source file encoding to utf-8

Added by Clay Trump almost 2 years ago. Updated over 1 year ago.

[ruby-core:46021]
Status:Closed
Priority:Normal
Assignee:Yui NARUSE
Category:-
Target version:2.0.0

Description

Let's change the default encoding for Ruby source files from US-ASCII
to UTF-8 in Ruby 2.0

• Convention over Configuration
• Ruby 1.9 forced encoding for code that was not pure ASCII, so
existing codebase already has magic comments.

In Ruby 2.0, "# encoding: utf-8" can be the default.

utf.pdf (36.3 KB) Clay Trump, 07/01/2012 07:23 AM

utf.pdf (37.1 KB) Clay Trump, 07/03/2012 12:23 AM

Associated revisions

Revision 37485
Added by Yui NARUSE over 1 year ago

  • ruby.c (loadfileinternal): set default source encoding as
    UTF-8 instead of US-ASCII. [Feature #6679]

  • parse.y (parser_initialize): set default parser encoding as
    UTF-8 instead of US-ASCII.

Revision 37533
Added by Nobuyoshi Nakada over 1 year ago

ruby-additional.el: set encoding

  • misc/ruby-additional.el (ruby-mode-set-encoding): now encoding needs to be set always explicitly actually. [Feature #6679]

History

#1 Updated by Clay Trump almost 2 years ago

Oh, and here's a slide for the feature meetup. It's ugly, I know.
--

#3 Updated by Yusuke Endoh almost 2 years ago

  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE

Received. Thank you!

Naruse-san, what do you think?

Yusuke Endoh mame@tsg.ne.jp

#4 Updated by Nobuyoshi Nakada almost 2 years ago

claytrump (Clay Trump) wrote:

• Ruby 1.9 forced encoding for code that was not pure ASCII,

Could you elaborate?

#5 Updated by Martin Dürst almost 2 years ago

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we
were moving to 1.9.

Regards, Martin.

On 2012/07/02 3:15, mame (Yusuke Endoh) wrote:

Issue #6679 has been updated by mame (Yusuke Endoh).

Status changed from Open to Assigned
Assignee set to naruse (Yui NARUSE)

Received. Thank you!

Naruse-san, what do you think?

#6 Updated by Clay Trump almost 2 years ago

claytrump (Clay Trump) wrote:

• Ruby 1.9 forced encoding for code that was not pure ASCII,

Could you elaborate?

Sure. Ruby 1.9 forced us to specify the encoding for code that was not pure
ASCII.

I'm no expert, but I think that in Ruby 1.8, you could write code using an
encoding compatbile with ASCII like 8859-1. Things would kind of work, it
would output the expected sequence of bytes, etc... at least as long as
you're using and expecting that encoding everywhere.

If Ruby 1.9 had assumed utf-8, that legacy code would now output the wrong
stuff, and you might not notice right away. Subttle errors, etc.. So it's
cool that in Ruby 1.9 it produces an error; you need to put the encoding.

So any code like that has the right # coding comment by now.

Attached a slide with clearer sentence
--

#7 Updated by Clay Trump almost 2 years ago

On Mon, Jul 2, 2012 at 2:34 AM, "Martin J. Dürst" duerst@it.aoyama.ac.jpwrote:

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we were
moving to 1.9.

Cool, sounds like a plan.

#8 Updated by Eric Hodel almost 2 years ago

duerst (Martin Dürst) wrote:

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we
were moving to 1.9.

#5206 (make -K warn) may be relevant to removing -U

#9 Updated by Yui NARUSE almost 2 years ago

= Default Ruby source file encoding to utf-8

it almost can keep compatibility but breaks
* escaped bytes in string literal like "a\xff", its encoding changed from ASCII-8BIT to UTF-8.
* escaped bytes in regexp literal like above

= -U as default

What is the expected merit of this?

#10 Updated by Rodrigo Rosenfeld Rosas almost 2 years ago

You could at least consider it for 3.0 and yielding a deprecation warning in such strings on 2.0... Although I think much more people are currently complaining about UTF-8 not being default when compared to those who might complain because they were using ASCII-8BIT encoded escaped chars in strings.

#11 Updated by Martin Dürst almost 2 years ago

On 2012/07/03 10:33, naruse (Yui NARUSE) wrote:

Issue #6679 has been updated by naruse (Yui NARUSE).

= Default Ruby source file encoding to utf-8

it almost can keep compatibility but breaks
* escaped bytes in string literal like "a\xff", its encoding changed from ASCII-8BIT to UTF-8.
* escaped bytes in regexp literal like above

Good point. Thinking about it, the rule that \x in strings means these
strings are in the source encoding seems to work well for non-UTF-8
strings. For UTF-8, because we have \u, we could make string containing
\x be ASCII-8BIT.

But maybe that's too complicated.

Regards, Martin.

#12 Updated by Yusuke Endoh over 1 year ago

Clay Trump,

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

Yusuke Endoh mame@tsg.ne.jp

#13 Updated by Anonymous over 1 year ago

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

#14 Updated by Rodrigo Rosenfeld Rosas over 1 year ago

You mean the default would be UTF-8 right?

In Ruby I believe happiness > performance :)

Em 23-07-2012 10:57, Perry Smith escreveu:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

On Jul 23, 2012, at 8:44 AM, mame (Yusuke Endoh) wrote:

Issue #6679 has been updated by mame (Yusuke Endoh).

Clay Trump,

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

Yusuke Endohmame@tsg.ne.jp

Feature #6679: Default Ruby source file encoding to utf-8
https://bugs.ruby-lang.org/issues/6679#change-28316

Author: claytrump (Clay Trump)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category:
Target version:

Let's change the default encoding for Ruby source files from US-ASCII
to UTF-8 in Ruby 2.0

• Convention over Configuration
• Ruby 1.9 forced encoding for code that was not pure ASCII, so
existing codebase already has magic comments.

In Ruby 2.0, "# encoding: utf-8" can be the default.

http://bugs.ruby-lang.org/

#15 Updated by Yui NARUSE over 1 year ago

匿名ユーザ wrote:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

Benchmark by yourself, and if it shows performance impact, please report it.

#16 Updated by Koichi Sasada over 1 year ago

(2012/07/23 22:57), Perry Smith wrote:

Making it a configuration option may be nice anyway.

+1

--
// SASADA Koichi at atdot dot net

#17 Updated by Yui NARUSE over 1 year ago

mame (Yusuke Endoh) wrote:

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

diff --git a/lib/rexml/encoding.rb b/lib/rexml/encoding.rb
index d1d5172..23e912f 100644
--- a/lib/rexml/encoding.rb
+++ b/lib/rexml/encoding.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
module REXML
module Encoding
# ID ---> Encoding name
diff --git a/lib/rexml/source.rb b/lib/rexml/source.rb
index 112393c..7ecb98f 100644
--- a/lib/rexml/source.rb
+++ b/lib/rexml/source.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'rexml/encoding'

module REXML
diff --git a/parse.y b/parse.y
index 049e356..00e80a2 100644
--- a/parse.y
+++ b/parse.y
@@ -10558,7 +10558,7 @@ parserinitialize(struct parserparams *parser)
#ifdef YYMALLOC
parser->heap = NULL;
#endif
- parser->enc = rbusasciiencoding();
+ parser->enc = rbutf8encoding();
}

#ifdef RIPPER
diff --git a/ruby.c b/ruby.c
index ab4b674..5ab5ca2 100644
--- a/ruby.c
+++ b/ruby.c
@@ -1630,7 +1630,7 @@ loadfileinternal(VALUE arg)
enc = rblocaleencoding();
}
else {
- enc = rbusasciiencoding();
+ enc = rbutf8encoding();
}
if (NILP(f)) {
f = rb
strnew(0, 0);
diff --git a/test/base64/test
base64.rb b/test/base64/testbase64.rb
index 9ae54cb..c5e61b3 100644
--- a/test/base64/test
base64.rb
+++ b/test/base64/test_base64.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require "test/unit"
require "base64"

diff --git a/test/dl/testimport.rb b/test/dl/testimport.rb
index 26b9f3c..41def7c 100644
--- a/test/dl/testimport.rb
+++ b/test/dl/test
import.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
requirerelative 'testbase'
require 'dl/import'

diff --git a/test/logger/testlogger.rb b/test/logger/testlogger.rb
index 8fc02f8..100c1ea 100644
--- a/test/logger/testlogger.rb
+++ b/test/logger/test
logger.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'logger'
require 'tempfile'
diff --git a/test/net/http/testhttp.rb b/test/net/http/testhttp.rb
index fc7bfa9..cb8bf44 100644
--- a/test/net/http/testhttp.rb
+++ b/test/net/http/test
http.rb
@@ -1,5 +1,4 @@

-# $Id$

+# coding: US-ASCII
require 'test/unit'
require 'net/http'
require 'stringio'
diff --git a/test/net/http/testhttpresponse.rb b/test/net/http/testhttpresponse.rb
index d57614b..ccff224 100644
--- a/test/net/http/testhttpresponse.rb
+++ b/test/net/http/test
httpresponse.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'net/http'
require 'test/unit'
require 'stringio'
diff --git a/test/openssl/testx509name.rb b/test/openssl/testx509name.rb
index 90c0992..968ad97 100644
--- a/test/openssl/testx509name.rb
+++ b/test/openssl/test
x509name.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative 'utils'

if defined?(OpenSSL)
diff --git a/test/psych/testyaml.rb b/test/psych/testyaml.rb
index 807c058..796a44f 100644
--- a/test/psych/testyaml.rb
+++ b/test/psych/test
yaml.rb
@@ -1,4 +1,4 @@
-# -- mode: ruby; ruby-indent-level: 4; tab-width: 4 --
+# -- coding: us-ascii; mode: ruby; ruby-indent-level: 4; tab-width: 4 --
# vim:sw=4:ts=4
# $Id$
#
diff --git a/test/psych/visitors/testtoruby.rb b/test/psych/visitors/testtoruby.rb
index 5b0702c..ee473c9 100644
--- a/test/psych/visitors/testtoruby.rb
+++ b/test/psych/visitors/testtoruby.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'psych/helper'

module Psych
diff --git a/test/ripper/testripper.rb b/test/ripper/testripper.rb
index 72dc52d..1d6e893 100644
--- a/test/ripper/testripper.rb
+++ b/test/ripper/test
ripper.rb
@@ -17,7 +17,7 @@ class TestRipper::Ripper < Test::Unit::TestCase
end

def testencoding
- assert
equal Encoding::USASCII, @ripper.encoding
+ assert
equal Encoding::UTF_8, @ripper.encoding
end

def testendseeneh
diff --git a/test/ruby/test
array.rb b/test/ruby/testarray.rb
index fff55e1..856a994 100644
--- a/test/ruby/test
array.rb
+++ b/test/ruby/testarray.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require
relative 'envutil'

diff --git a/test/ruby/testio.rb b/test/ruby/testio.rb
index d1edaaf..93967c6 100644
--- a/test/ruby/testio.rb
+++ b/test/ruby/test
io.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tmpdir'
require "fcntl"
diff --git a/test/ruby/testiom17n.rb b/test/ruby/testiom17n.rb
index b6358e0..3cc8437 100644
--- a/test/ruby/testiom17n.rb
+++ b/test/ruby/testiom17n.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tmpdir'
require 'timeout'
diff --git a/test/ruby/testm17n.rb b/test/ruby/testm17n.rb
index dfcaa94..ce94886 100644
--- a/test/ruby/testm17n.rb
+++ b/test/ruby/test
m17n.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require_relative 'envutil'

diff --git a/test/ruby/testpack.rb b/test/ruby/testpack.rb
index c72035c..4810c6e 100644
--- a/test/ruby/testpack.rb
+++ b/test/ruby/test
pack.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'

class TestPack < Test::Unit::TestCase
diff --git a/test/ruby/testparse.rb b/test/ruby/testparse.rb
index 563e2ce..b5d31db 100644
--- a/test/ruby/testparse.rb
+++ b/test/ruby/test
parse.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'stringio'

diff --git a/test/ruby/testregexp.rb b/test/ruby/testregexp.rb
index 7e31e99..781af50 100644
--- a/test/ruby/testregexp.rb
+++ b/test/ruby/test
regexp.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'envutil'

diff --git a/test/syck/testyaml.rb b/test/syck/testyaml.rb
index 132bc92..c286b03 100644
--- a/test/syck/testyaml.rb
+++ b/test/syck/test
yaml.rb
@@ -1,4 +1,4 @@
-# -- mode: ruby; ruby-indent-level: 4; tab-width: 4; indent-tabs-mode: t --
+# -- coding: us-ascii; mode: ruby; ruby-indent-level: 4; tab-width: 4; indent-tabs-mode: t --
# vim:sw=4:ts=4
# $Id$
#
diff --git a/test/syslog/testsysloglogger.rb b/test/syslog/testsysloglogger.rb
index 9224296..d382b4a 100644
--- a/test/syslog/testsysloglogger.rb
+++ b/test/syslog/testsysloglogger.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tempfile'
require 'syslog/logger'
diff --git a/test/webrick/testcgi.rb b/test/webrick/testcgi.rb
index d930c26..282183e 100644
--- a/test/webrick/testcgi.rb
+++ b/test/webrick/test
cgi.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative "utils"
require "webrick"
require "test/unit"

#18 Updated by Martin Dürst over 1 year ago

On 2012/07/24 3:27, naruse (Yui NARUSE) wrote:

Issue #6679 has been updated by naruse (Yui NARUSE).

匿名ユーザ wrote:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

Benchmark by yourself, and if it shows performance impact, please report it.

I agree. For a file that's ASCII only, I can't imagine that performance
decreases much (but of course I might be wrong). For a file that's
UTF-8, there's no change. Same for a file that's in another encoding
(because that can't use the default).

Regards, Martin.

#19 Updated by Yui NARUSE over 1 year ago

ko1 (Koichi Sasada) wrote:

(2012/07/23 22:57), Perry Smith wrote:

Making it a configuration option may be nice anyway.

+1

diff --git a/ruby.c b/ruby.c
index ab4b674..d6a8a91 100644
--- a/ruby.c
+++ b/ruby.c
@@ -702,6 +702,7 @@ static long
procoptions(long argc, char **argv, struct cmdlineoptions *opt, int envopt)
{
long n, argc0 = argc;
+ int optKp = FALSE;
const char *s;

 if (argc == 0)

@@ -909,6 +910,7 @@ procoptions(long argc, char **argv, struct cmdlineoptions opt, int envopt)
break;
}
if (encname) {
+ opt
Kp = TRUE;
opt->src.enc.name = rb
strnew2(encname);
if (!opt->ext.enc.name)
opt->ext.enc.name = opt->src.enc.name;
@@ -1013,10 +1015,8 @@ proc_options(long argc, char *
argv, struct cmdlineoptions opt, int envopt)
if (!
(s = ++p)) break;
set
encodingpart(internal);
if (!*(s = ++p)) break;
-#if defined ALLOW
DEFAULTSOURCEENCODING && ALLOWDEFAULTSOURCEENCODING
set
encodingpart(source);
if (!*(s = ++p)) break;
-#endif
rb
raise(rbeRuntimeError, "extra argument for %s: %s",
(arg[1] == '-' ? "--encoding" : "-E"), s);
# undef set
encodingpart
@@ -1028,11 +1028,9 @@ proc
options(long argc, char *argv, struct cmdlineoptions *opt, int envopt)
else if (is
optionwitharg("external-encoding", Qfalse, Qtrue)) {
setexternalencodingonce(opt, s, 0);
}
-#if defined ALLOW
DEFAULTSOURCEENCODING && ALLOWDEFAULTSOURCEENCODING
else if (is
optionwitharg("source-encoding", Qfalse, Qtrue)) {
setsourceencodingonce(opt, s, 0);
}
-#endif
else if (strcmp("version", s) == 0) {
if (envopt) goto noenvopt
long;
opt->dump |= DUMPBIT(version);
@@ -1097,6 +1095,9 @@ proc
options(long argc, char *
argv, struct cmdline_options *opt, int envopt)
}

switchend:
+ if (opt
Kp)
+ rb
warning("-K is specified; it is for 1.8 compatibility and may cause odd behavior");
+
return argc0 - argc;
}

@@ -1268,9 +1269,6 @@ processoptions(int argc, char **argv, struct cmdlineoptions *opt)
opt->intern.enc.name = intencname;
}

  • if (opt->src.enc.name)

- rb_warning("-K is specified; it is for 1.8 compatibility and may cause odd behavior");

 if (opt->dump & DUMP_BIT(version)) {
ruby_show_version();
return Qtrue;

#20 Updated by Yui NARUSE over 1 year ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r37485.
Clay, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • ruby.c (loadfileinternal): set default source encoding as
    UTF-8 instead of US-ASCII. [Feature #6679]

  • parse.y (parser_initialize): set default parser encoding as
    UTF-8 instead of US-ASCII.

#21 Updated by Yusuke Endoh over 1 year ago

  • Target version set to 2.0.0

Also available in: Atom PDF