Project

General

Profile

Feature #16604

Set default for Encoding.default_external to UTF-8 on Windows

Added by larskanis (Lars Kanis) 10 months ago. Updated 3 days ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-core:97049]

Description

This issue is related to https://bugs.ruby-lang.org/issues/13488 where we already discussed the topic and postponed the change for ruby-3. A patch is here: https://github.com/ruby/ruby/pull/2877

What should be changed?

Currently Encoding.default_external is initialized to the local console encoding of the Windows installation unless changed per option -E. This is e.g. cp850 for Western Europe. It should be changed to UTF-8.

The above patch only changes the default for Encoding.default_external. It can still be overwritten per command line option -Elocale or in ruby code.

Reasons for the change

Changing to UTF-8 fixes various inconsistencies within ruby and with external tools. A very common case is writing a non-ASCII text to a file. It writes the string content as its binary representation, which is usually UTF-8, since this is the default ruby source encoding. But reading the content back, tags the string with the wrong encoding leading to mojibakes.

s = "äöü"
File.write("x", s)   # => 6 bytes
File.read("x") == s  # => true in irb but false in .rb file

As noted in the last line, the result in irb is different from regular .rb files, since it already sets Encoding.default_external = "utf-8" on it's own. This is another inconsistency with the current default.

Another issue is that many non-asian regions have distinct legacy encodings for OEM code page (aka Encoding.find('locale') ) and ANSI code page (aka Encoding.find('filesystem') ), so that a file written in current default external encoding Encoding.find('locale') is not properly interpret in Windows GUI tools like notepad. It is therefore uncommon to store files in OEM-ANSI encoding and doing so is almost certainly wrong.

RubyInstaller ships the MSYS2 environment, which defaults to UTF-8 as well.

Powershell made the switch to UTF-8 (without BOM) in Powershell-6.0 and even more in 6.1.

Will it work?

Yes. RubyInstaller provided a checkbox for RUBYOPT=-Eutf-8 since version 2.4. This checkbox was disabled at first, but since RubyInstaller-2.7.0 this checkbox is enabled per default. So UTF-8 as the default external encoding is the expected encoding for most of the people on Windows, now.

However setting RUBYOPT per installer is obtrusive and doesn't work with a 7z archive distribution. I would like to remove this hack starting with ruby-3.0.

Alternatives

Changing the default of Encoding.default_external to UTF-8 is a trade-off. It doesn't fit to every case, but in my experience this is the best overall option. And it's just the default for the default, so that it can be overwritten in many ways.

There are some alternatives to it:

Changing the Windows console to code page 65001:

  • The Windows implementation of 65001 is buggy in the console. I didn't verify it lately but chcp 65001 didn't work reliable years ago.
  • It is not the default and input methods like IME are incompatible.
  • It sets locale to UTF-8, so that the native console encoding isn't easily available.

Setting Encoding.default_internal in addition:

  • This triggers transcoding of output strings, which is not enabled on other systems, causing unexpected results and incompatibilities.

Change ruby to use Encoding.find("filesystem") as encoding for file operations:

  • That would fix the compatibility with some builtin Windows tools, but doesn't fix encoding issues due to increased use of UTF-8.

What doesn't change?

Please note that changing Encoding.default_external doesn't affect file or IO output, unless Encoding.default_internal is set as well (which is not the default).

Also "locale" and "filesystem" pseudo encodings don't change. Both can still be used explicit in cases where these encodings are required.

The patch is currently about Windows only, because I would like to focus on that question for now. Possibly it's a subsequent question whether Encoding.default_external should default to UTF-8 on all operating systems or at least in case of LANG=C locale (which currently triggers US-ASCII).

#1

Updated by larskanis (Lars Kanis) 10 months ago

  • Description updated (diff)
#2

Updated by larskanis (Lars Kanis) 10 months ago

  • Description updated (diff)
#3

Updated by larskanis (Lars Kanis) 28 days ago

  • Description updated (diff)

Updated by larskanis (Lars Kanis) 28 days ago

usa (Usaku NAKAMURA) nobu (Nobuyoshi Nakada) naruse (Yui NARUSE) Could you please take a look at this request? I pushed an updated patch here: https://github.com/ruby/ruby/pull/2877

Updated by larskanis (Lars Kanis) 3 days ago

Both Appveyor and Github-Actions use Encoding.default_external = UTF-8 in their default ruby versions on Windows. Appveyor sets this per RUBYOPT=-Eutf-8 and Github-Actions goes a step further and sets it per chcp 65001, which also changes the pseudo encoding locale to UTF-8.

So Encoding.default_external = UTF-8 is already the de facto standard and ruby just need to follow the standard. 😊

Also available in: Atom PDF