Project

General

Profile

Actions

Misc #20774

open

Remove remaining locale dependent code from Windows port

Added by larskanis (Lars Kanis) 3 months ago. Updated 18 days ago.

Status:
Open
Assignee:
-
[ruby-core:119380]

Description

The external_encoding of files, file names and ENV on Windows were changed from locale codepage to UTF-8 in ruby-3.0.
But there are still several remaining points where locale encoding is used although there is no need to do so.
The Windows port is already fully UTF-16/UTF-8 based and locale encoding is only used for historical and not for technical reasons.

My proposal is to remove (most of) the locale dependent conversions from the ruby code for Windows.
Before I open pull requests in this regard, I would like to confirm this direction with the ruby core team.

Let me show what I mean:

# täst-locale-enc.rb
def pr(*strs)
  strs.each do |str|
    p [str, IO===str ? str.external_encoding&.name : str.encoding.name]
  end
end

if $0==__FILE__
  pr STDIN      # => [#<IO:<STDIN>>, "CP850"]
  pr $0         # => ["ruby/t\x84st-locale-enc.rb", "CP850"]
  pr __FILE__   # => ["ruby/t\x84st-locale-enc.rb", "CP850"]
  pr __dir__    # => ["C:/Users/kanis/ruby", "CP850"]
  pr 'ä'        # => ["ä", "UTF-8"]
  pr '€'        # => ["€", "UTF-8"]
  pr $:.first   # => ["C:/Users/kanis/t\xE2\x82\xACst", "ASCII-8BIT"]
  pr $:.last    # => ["C:/Ruby33-x64/lib/ruby/3.3.0/x64-mingw-ucrt", "CP850"]

  require "win32/registry"
  pr Win32::Registry::HKEY_CURRENT_USER.open("Environment")['TMP']
    # => ["C:\\Users\\kanis\\AppData\\Local\\Temp", "UTF-8"]
  pr Win32::Registry::HKEY_CURRENT_USER.open("\\").each_key{ break _1 }
    # => ["AppEvents", "CP850"]
end

# execute with: ruby -It€st ruby\täst-locale-enc.rb

I wrote the results on ruby-3.3 x64-mingw-ucrt right into the code.
The situation is even worse when called with -e script:

$ ruby -It€st -r .\ruby\täst-locale-enc.rb -e "pr STDIN, $0, __FILE__, __dir__, 'ä', '€', $:.first, $:.last"
[#<IO:<STDIN>>, "CP850"]
["-e", "CP850"]
["-e", "UTF-8"]
[".", "US-ASCII"]
["\x84", "CP850"]
["?", "CP850"]
["C:/Users/kanis/t\xE2\x82\xACst", "ASCII-8BIT"]
["C:/Ruby33-x64/lib/ruby/3.3.0/x64-mingw-ucrt", "CP850"]

There are also some inconsistencies like that it's possible to require script names with characters outside of the codepage, but it fails to execute a script directly or by using require_relative :

$ ruby -r .\t€st-locale-enc.rb -e "pr STDIN"
[#<IO:<STDIN>>, "CP850"]

$ ruby .\t€st-locale-enc.rb
ruby: Invalid argument -- ./t?st-locale-enc.rb (LoadError)

Maybe there are more places which are working with locale codepage - these are only the few that I remember from memory.
I would like to change all the above results to be UTF-8 encoded, like it is the case on Ubuntu.

Compatibility

Changing the encoding of returned strings is of course an API change.
IMHO it is still something we should change in a minor release of ruby.
The reason is that I don't remember about only a single issue cased by the change to UTF-8 in ruby-3.0 in the company I work for.
To the contrary many issues are caused by using locale codepage where some non-ASCII characters work and other characters don't.
Most issue with ruby-3.0 were cased by the keyword argument changes.

Updated by YO4 (Yoshinao Muramatsu) 19 days ago

github PR#11799 currently has two patches:

  • Windows: Change command line interface to UTF-8
  • Windows: Use Unicode aware function to retrieve console inputs

Could the former be split and merged?
The latter is no longer a must, as console input in codepage 65001(UTF-8) works as intended in the latest Windows Terminal.

Of course, it remains important for environments where Windows Terminal is not available, including Windows Server 2019, or for when the Command Prompt Window are used.
Differences in behavior due to codepage settings and console selections should be minimized as they confuse the novice.

Updated by larskanis (Lars Kanis) 18 days ago

@YO4 This is a good idea! I opened a PR: https://github.com/ruby/ruby/pull/12377

Updated by larskanis (Lars Kanis) 18 days ago

From the issue description above:

  require "win32/registry"
  pr Win32::Registry::HKEY_CURRENT_USER.open("Environment")['TMP']
    # => ["C:\\Users\\kanis\\AppData\\Local\\Temp", "UTF-8"]
  pr Win32::Registry::HKEY_CURRENT_USER.open("\\").each_key{ break _1 }
    # => ["AppEvents", "CP850"]

The inconsistency in win32-registry is solved in the meantime on ruby master branch by https://github.com/ruby/win32-registry/commit/f5ea80d985dd374c8f1e92d4ddc41b9fb5526257 .

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0