Feature #6679

Default Ruby source file encoding to utf-8

Added by Clay Trump about 3 years ago. Updated over 2 years ago.

[ruby-core:46021]
Status:Closed
Priority:Normal
Assignee:Yui NARUSE

Description

Let's change the default encoding for Ruby source files from US-ASCII
to UTF-8 in Ruby 2.0

• Convention over Configuration
• Ruby 1.9 forced encoding for code that was not pure ASCII, so
existing codebase already has magic comments.

In Ruby 2.0, "# encoding: utf-8" can be the default.

utf.pdf (36.3 KB) Clay Trump, 07/01/2012 07:23 AM

utf.pdf (37.1 KB) Clay Trump, 07/03/2012 12:23 AM

Associated revisions

Revision 37485
Added by Yui NARUSE over 2 years ago

  • ruby.c (load_file_internal): set default source encoding as
    UTF-8 instead of US-ASCII. [Feature #6679]

  • parse.y (parser_initialize): set default parser encoding as
    UTF-8 instead of US-ASCII.

Revision 37485
Added by Yui NARUSE over 2 years ago

  • ruby.c (load_file_internal): set default source encoding as
    UTF-8 instead of US-ASCII. [Feature #6679]

  • parse.y (parser_initialize): set default parser encoding as
    UTF-8 instead of US-ASCII.

Revision 37533
Added by Nobuyoshi Nakada over 2 years ago

ruby-additional.el: set encoding

  • misc/ruby-additional.el (ruby-mode-set-encoding): now encoding needs to be set always explicitly actually. [Feature #6679]

Revision 37533
Added by Nobuyoshi Nakada over 2 years ago

ruby-additional.el: set encoding

  • misc/ruby-additional.el (ruby-mode-set-encoding): now encoding needs to be set always explicitly actually. [Feature #6679]

History

#1 Updated by Clay Trump about 3 years ago

Oh, and here's a slide for the feature meetup. It's ugly, I know.
--

#3 Updated by Yusuke Endoh about 3 years ago

  • Status changed from Open to Assigned
  • Assignee set to Yui NARUSE

Received. Thank you!

Naruse-san, what do you think?

Yusuke Endoh mame@tsg.ne.jp

#4 Updated by Nobuyoshi Nakada about 3 years ago

claytrump (Clay Trump) wrote:

• Ruby 1.9 forced encoding for code that was not pure ASCII,

Could you elaborate?

#5 Updated by Martin Dürst about 3 years ago

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we
were moving to 1.9.

Regards, Martin.

On 2012/07/02 3:15, mame (Yusuke Endoh) wrote:

Issue #6679 has been updated by mame (Yusuke Endoh).

Status changed from Open to Assigned
Assignee set to naruse (Yui NARUSE)

Received. Thank you!

Naruse-san, what do you think?

#6 Updated by Clay Trump about 3 years ago

claytrump (Clay Trump) wrote:

• Ruby 1.9 forced encoding for code that was not pure ASCII,

Could you elaborate?

Sure. Ruby 1.9 forced us to specify the encoding for code that was not pure
ASCII.

I'm no expert, but I think that in Ruby 1.8, you could write code using an
encoding compatbile with ASCII like 8859-1. Things would kind of work, it
would output the expected sequence of bytes, etc... at least as long as
you're using and expecting that encoding everywhere.

If Ruby 1.9 had assumed utf-8, that legacy code would now output the wrong
stuff, and you might not notice right away. Subttle errors, etc.. So it's
cool that in Ruby 1.9 it produces an error; you need to put the encoding.

So any code like that has the right # coding comment by now.

Attached a slide with clearer sentence
--

#7 Updated by Clay Trump about 3 years ago

On Mon, Jul 2, 2012 at 2:34 AM, "Martin J. Dürst" duerst@it.aoyama.ac.jpwrote:

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we were
moving to 1.9.

Cool, sounds like a plan.

#8 Updated by Eric Hodel about 3 years ago

duerst (Martin Dürst) wrote:

I think this is the right direction to go, and doing it for a major
version (2.0) is the right timing.

Maybe we can also on this occasion abolish the -U option and make its
action the default? Matz proposed that quite a long time ago, when we
were moving to 1.9.

#5206 (make -K warn) may be relevant to removing -U

#9 Updated by Yui NARUSE about 3 years ago

= Default Ruby source file encoding to utf-8

it almost can keep compatibility but breaks
* escaped bytes in string literal like "a\xff", its encoding changed from ASCII-8BIT to UTF-8.
* escaped bytes in regexp literal like above

= -U as default

What is the expected merit of this?

#10 Updated by Rodrigo Rosenfeld Rosas about 3 years ago

You could at least consider it for 3.0 and yielding a deprecation warning in such strings on 2.0... Although I think much more people are currently complaining about UTF-8 not being default when compared to those who might complain because they were using ASCII-8BIT encoded escaped chars in strings.

#11 Updated by Martin Dürst about 3 years ago

On 2012/07/03 10:33, naruse (Yui NARUSE) wrote:

Issue #6679 has been updated by naruse (Yui NARUSE).

= Default Ruby source file encoding to utf-8

it almost can keep compatibility but breaks
* escaped bytes in string literal like "a\xff", its encoding changed from ASCII-8BIT to UTF-8.
* escaped bytes in regexp literal like above

Good point. Thinking about it, the rule that \x in strings means these
strings are in the source encoding seems to work well for non-UTF-8
strings. For UTF-8, because we have \u, we could make string containing
\x be ASCII-8BIT.

But maybe that's too complicated.

Regards, Martin.

#12 Updated by Yusuke Endoh about 3 years ago

Clay Trump,

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

Yusuke Endoh mame@tsg.ne.jp

#13 Updated by Anonymous about 3 years ago

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

#14 Updated by Rodrigo Rosenfeld Rosas about 3 years ago

You mean the default would be UTF-8 right?

In Ruby I believe happiness > performance :)

Em 23-07-2012 10:57, Perry Smith escreveu:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

On Jul 23, 2012, at 8:44 AM, mame (Yusuke Endoh) wrote:

Issue #6679 has been updated by mame (Yusuke Endoh).

Clay Trump,

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

Yusuke Endohmame@tsg.ne.jp

Feature #6679: Default Ruby source file encoding to utf-8
https://bugs.ruby-lang.org/issues/6679#change-28316

Author: claytrump (Clay Trump)
Status: Assigned
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category:
Target version:

Let's change the default encoding for Ruby source files from US-ASCII
to UTF-8 in Ruby 2.0

• Convention over Configuration
• Ruby 1.9 forced encoding for code that was not pure ASCII, so
existing codebase already has magic comments.

In Ruby 2.0, "# encoding: utf-8" can be the default.

http://bugs.ruby-lang.org/

#15 Updated by Yui NARUSE about 3 years ago

匿名ユーザ wrote:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

Benchmark by yourself, and if it shows performance impact, please report it.

#16 Updated by Koichi Sasada about 3 years ago

(2012/07/23 22:57), Perry Smith wrote:

Making it a configuration option may be nice anyway.

+1

--
// SASADA Koichi at atdot dot net

#17 Updated by Yui NARUSE about 3 years ago

mame (Yusuke Endoh) wrote:

I'm happy to inform you that matz has (basically) accepted your
proposal.

But not that the decision may be cancelled if the compatibility
impact is considered serious.
Naruse-san will implement and experiment it.

diff --git a/lib/rexml/encoding.rb b/lib/rexml/encoding.rb
index d1d5172..23e912f 100644
--- a/lib/rexml/encoding.rb
+++ b/lib/rexml/encoding.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
module REXML
module Encoding
# ID ---> Encoding name
diff --git a/lib/rexml/source.rb b/lib/rexml/source.rb
index 112393c..7ecb98f 100644
--- a/lib/rexml/source.rb
+++ b/lib/rexml/source.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'rexml/encoding'

module REXML
diff --git a/parse.y b/parse.y
index 049e356..00e80a2 100644
--- a/parse.y
+++ b/parse.y
@@ -10558,7 +10558,7 @@ parser_initialize(struct parser_params *parser)
#ifdef YYMALLOC
parser->heap = NULL;
#endif
- parser->enc = rb_usascii_encoding();
+ parser->enc = rb_utf8_encoding();
}

#ifdef RIPPER
diff --git a/ruby.c b/ruby.c
index ab4b674..5ab5ca2 100644
--- a/ruby.c
+++ b/ruby.c
@@ -1630,7 +1630,7 @@ load_file_internal(VALUE arg)
enc = rb_locale_encoding();
}
else {
- enc = rb_usascii_encoding();
+ enc = rb_utf8_encoding();
}
if (NIL_P(f)) {
f = rb_str_new(0, 0);
diff --git a/test/base64/test_base64.rb b/test/base64/test_base64.rb
index 9ae54cb..c5e61b3 100644
--- a/test/base64/test_base64.rb
+++ b/test/base64/test_base64.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require "test/unit"
require "base64"

diff --git a/test/dl/test_import.rb b/test/dl/test_import.rb
index 26b9f3c..41def7c 100644
--- a/test/dl/test_import.rb
+++ b/test/dl/test_import.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative 'test_base'
require 'dl/import'

diff --git a/test/logger/test_logger.rb b/test/logger/test_logger.rb
index 8fc02f8..100c1ea 100644
--- a/test/logger/test_logger.rb
+++ b/test/logger/test_logger.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'logger'
require 'tempfile'
diff --git a/test/net/http/test_http.rb b/test/net/http/test_http.rb
index fc7bfa9..cb8bf44 100644
--- a/test/net/http/test_http.rb
+++ b/test/net/http/test_http.rb
@@ -1,5 +1,4 @@

-# $Id$

+# coding: US-ASCII
require 'test/unit'
require 'net/http'
require 'stringio'
diff --git a/test/net/http/test_httpresponse.rb b/test/net/http/test_httpresponse.rb
index d57614b..ccff224 100644
--- a/test/net/http/test_httpresponse.rb
+++ b/test/net/http/test_httpresponse.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'net/http'
require 'test/unit'
require 'stringio'
diff --git a/test/openssl/test_x509name.rb b/test/openssl/test_x509name.rb
index 90c0992..968ad97 100644
--- a/test/openssl/test_x509name.rb
+++ b/test/openssl/test_x509name.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative 'utils'

if defined?(OpenSSL)
diff --git a/test/psych/test_yaml.rb b/test/psych/test_yaml.rb
index 807c058..796a44f 100644
--- a/test/psych/test_yaml.rb
+++ b/test/psych/test_yaml.rb
@@ -1,4 +1,4 @@
-# -- mode: ruby; ruby-indent-level: 4; tab-width: 4 --
+# -- coding: us-ascii; mode: ruby; ruby-indent-level: 4; tab-width: 4 --
# vim:sw=4:ts=4
# $Id$
#
diff --git a/test/psych/visitors/test_to_ruby.rb b/test/psych/visitors/test_to_ruby.rb
index 5b0702c..ee473c9 100644
--- a/test/psych/visitors/test_to_ruby.rb
+++ b/test/psych/visitors/test_to_ruby.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'psych/helper'

module Psych
diff --git a/test/ripper/test_ripper.rb b/test/ripper/test_ripper.rb
index 72dc52d..1d6e893 100644
--- a/test/ripper/test_ripper.rb
+++ b/test/ripper/test_ripper.rb
@@ -17,7 +17,7 @@ class TestRipper::Ripper < Test::Unit::TestCase
end

def test_encoding
- assert_equal Encoding::US_ASCII, @ripper.encoding
+ assert_equal Encoding::UTF_8, @ripper.encoding
end

def test_end_seen_eh
diff --git a/test/ruby/test_array.rb b/test/ruby/test_array.rb
index fff55e1..856a994 100644
--- a/test/ruby/test_array.rb
+++ b/test/ruby/test_array.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require_relative 'envutil'

diff --git a/test/ruby/test_io.rb b/test/ruby/test_io.rb
index d1edaaf..93967c6 100644
--- a/test/ruby/test_io.rb
+++ b/test/ruby/test_io.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tmpdir'
require "fcntl"
diff --git a/test/ruby/test_io_m17n.rb b/test/ruby/test_io_m17n.rb
index b6358e0..3cc8437 100644
--- a/test/ruby/test_io_m17n.rb
+++ b/test/ruby/test_io_m17n.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tmpdir'
require 'timeout'
diff --git a/test/ruby/test_m17n.rb b/test/ruby/test_m17n.rb
index dfcaa94..ce94886 100644
--- a/test/ruby/test_m17n.rb
+++ b/test/ruby/test_m17n.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require_relative 'envutil'

diff --git a/test/ruby/test_pack.rb b/test/ruby/test_pack.rb
index c72035c..4810c6e 100644
--- a/test/ruby/test_pack.rb
+++ b/test/ruby/test_pack.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'

class TestPack < Test::Unit::TestCase
diff --git a/test/ruby/test_parse.rb b/test/ruby/test_parse.rb
index 563e2ce..b5d31db 100644
--- a/test/ruby/test_parse.rb
+++ b/test/ruby/test_parse.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'stringio'

diff --git a/test/ruby/test_regexp.rb b/test/ruby/test_regexp.rb
index 7e31e99..781af50 100644
--- a/test/ruby/test_regexp.rb
+++ b/test/ruby/test_regexp.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'envutil'

diff --git a/test/syck/test_yaml.rb b/test/syck/test_yaml.rb
index 132bc92..c286b03 100644
--- a/test/syck/test_yaml.rb
+++ b/test/syck/test_yaml.rb
@@ -1,4 +1,4 @@
-# -- mode: ruby; ruby-indent-level: 4; tab-width: 4; indent-tabs-mode: t --
+# -- coding: us-ascii; mode: ruby; ruby-indent-level: 4; tab-width: 4; indent-tabs-mode: t --
# vim:sw=4:ts=4
# $Id$
#
diff --git a/test/syslog/test_syslog_logger.rb b/test/syslog/test_syslog_logger.rb
index 9224296..d382b4a 100644
--- a/test/syslog/test_syslog_logger.rb
+++ b/test/syslog/test_syslog_logger.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require 'test/unit'
require 'tempfile'
require 'syslog/logger'
diff --git a/test/webrick/test_cgi.rb b/test/webrick/test_cgi.rb
index d930c26..282183e 100644
--- a/test/webrick/test_cgi.rb
+++ b/test/webrick/test_cgi.rb
@@ -1,3 +1,4 @@
+# coding: US-ASCII
require_relative "utils"
require "webrick"
require "test/unit"

#18 Updated by Martin Dürst about 3 years ago

On 2012/07/24 3:27, naruse (Yui NARUSE) wrote:

Issue #6679 has been updated by naruse (Yui NARUSE).

匿名ユーザ wrote:

If this seem to be too large of a performance impact, please consider making it a configuration option.

Making it a configuration option may be nice anyway.

Benchmark by yourself, and if it shows performance impact, please report it.

I agree. For a file that's ASCII only, I can't imagine that performance
decreases much (but of course I might be wrong). For a file that's
UTF-8, there's no change. Same for a file that's in another encoding
(because that can't use the default).

Regards, Martin.

#19 Updated by Yui NARUSE about 3 years ago

ko1 (Koichi Sasada) wrote:

(2012/07/23 22:57), Perry Smith wrote:

Making it a configuration option may be nice anyway.

+1

diff --git a/ruby.c b/ruby.c
index ab4b674..d6a8a91 100644
--- a/ruby.c
+++ b/ruby.c
@@ -702,6 +702,7 @@ static long
proc_options(long argc, char **argv, struct cmdline_options *opt, int envopt)
{
long n, argc0 = argc;
+ int opt_K_p = FALSE;
const char *s;

 if (argc == 0)

@@ -909,6 +910,7 @@ proc_options(long argc, char *argv, struct cmdline_options opt, int envopt)
break;
}
if (enc_name) {
+ opt_K_p = TRUE;
opt->src.enc.name = rb_str_new2(enc_name);
if (!opt->ext.enc.name)
opt->ext.enc.name = opt->src.enc.name;
@@ -1013,10 +1015,8 @@ proc_options(long argc, char *
argv, struct cmdline_options *opt, int envopt)
if (!
(s = ++p)) break;
set_encoding_part(internal);
if (!(s = ++p)) break;
-#if defined ALLOW_DEFAULT_SOURCE_ENCODING && ALLOW_DEFAULT_SOURCE_ENCODING
set_encoding_part(source);
if (!
(s = ++p)) break;
-#endif
rb_raise(rb_eRuntimeError, "extra argument for %s: %s",
(arg[1] == '-' ? "--encoding" : "-E"), s);
# undef set_encoding_part
@@ -1028,11 +1028,9 @@ proc_options(long argc, char *argv, struct cmdline_options *opt, int envopt)
else if (is_option_with_arg("external-encoding", Qfalse, Qtrue)) {
set_external_encoding_once(opt, s, 0);
}
-#if defined ALLOW_DEFAULT_SOURCE_ENCODING && ALLOW_DEFAULT_SOURCE_ENCODING
else if (is_option_with_arg("source-encoding", Qfalse, Qtrue)) {
set_source_encoding_once(opt, s, 0);
}
-#endif
else if (strcmp("version", s) == 0) {
if (envopt) goto noenvopt_long;
opt->dump |= DUMP_BIT(version);
@@ -1097,6 +1095,9 @@ proc_options(long argc, char *
argv, struct cmdline_options *opt, int envopt)
}

switch_end:
+ if (opt_K_p)
+ rb_warning("-K is specified; it is for 1.8 compatibility and may cause odd behavior");
+
return argc0 - argc;
}

@@ -1268,9 +1269,6 @@ process_options(int argc, char **argv, struct cmdline_options *opt)
opt->intern.enc.name = int_enc_name;
}

  • if (opt->src.enc.name)

- rb_warning("-K is specified; it is for 1.8 compatibility and may cause odd behavior");

 if (opt->dump & DUMP_BIT(version)) {
ruby_show_version();
return Qtrue;

#20 Updated by Yui NARUSE over 2 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r37485.
Clay, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


  • ruby.c (load_file_internal): set default source encoding as
    UTF-8 instead of US-ASCII. [Feature #6679]

  • parse.y (parser_initialize): set default parser encoding as
    UTF-8 instead of US-ASCII.

#21 Updated by Yusuke Endoh over 2 years ago

  • Target version set to 2.0.0

Also available in: Atom PDF