Bug #19383

open encoding for German display language in Windows is incorrect

Added by stringsn88keys (Thomas Powell) 4 months ago. Updated 3 months ago.

Target version:


Verified on Windows 10 and Windows Server 2022 and Ruby 2.7.7 through 3.1.3

Display language:
Verified on German, but may impact other languages in which returns characters that aren't [A-Za-z].

Time zone:
CET (UTC +01:00) Amsterdam, Berlin, ... # => "Mitteleuro\xE3ische Zeit" # => #Encoding:IBM437
puts # => "Mitteleurop∑ische Zeit" (should be "Mitteleuropäische Zeit") # => "Mitteleurop∑ische Zeit"

Doing a force_encoding on all encodings in Encoding.list reveals that ISO-8859-(1..16) and Windows-125(0,2,4,7) work to coerce the ä out of the time zone string: # => "Mitteleuro\xE3ische Zeit"
... but ... #=> "Mitteleuropäische Zeit"

Related issue: This improper encoding/rendering caused Ohai's JSON output to be unparseable. Workaround was forcing to Windows-1252.

Updated by austin (Austin Ziegler) 4 months ago

It’s been a long time since I’ve used Windows, but the Windows console is notoriously stuck in 1980s encodings and using codepage 65001 should fix this in general. Otherwise, you’re going to get Windows 1252 encoding as your default input/output encoding even if Ruby is otherwise using UTF-8.

I believe that since Ruby 3.0, Ruby by default uses UTF-8 but the boundaries caused by your console codepage may be a confounding factor.

Updated by stringsn88keys (Thomas Powell) 4 months ago

By "console" do you mean irb are you referencing PowerShell or cmd.exe/Command Prompt? Windows Terminal produces the same results as well.

Also, the source for this is from one process to another without user interactivity.

Looking at the Code Page 437 vs. Windows-1252, 0xE4 would be ∑ in Code Page 437 and ä in Windows-1252

The byte sequence of "Mitteleuropäische Zeit" as encoded from (which reports itself as "IBM437" is (hex values):
=> ["4d", "69", "74", "74", "65", "6c", "65", "75", "72", "6f", "70", "e4", "69", "73", "63", "68", "65", "20", "5a", "65", "69", "74"]

70 e4 69 would be "päi" in Windows-1252, but "p∑i" in IBM437 as reported. If UTF-8 is assumed, then e4 is a leading byte for a CJK script byte, but packing them doesn't associate the e4 with the following byte, which is confirmed by occasional invalid byte sequence errors depending on how the string is picked up.

Updated by austin (Austin Ziegler) 4 months ago

Yes, I mean cmd.exe or any other windows command-line. I repeat, that it has been years since I have used Windows in any serious manner, but this was the absolute bane of my existence, and process boundaries on Windows were nightmarish when I last dealt with Windows, primarily because of the emphasis on backwards compatibility at any cost. There does appear to be a bug if is not returning the data in the code page expected for where it’s used (e.g., UTF-8).

Updated by nobu (Nobuyoshi Nakada) 4 months ago

  • Status changed from Open to Feedback

What are:

  • the output from command
  • Encoding.locale_charmap


Updated by nobu (Nobuyoshi Nakada) 4 months ago

Maybe msvcrt converts timezone names to ACP, not ConsoleCP.
If so, this patch may work, but I have no idea how to test this in a CI.

diff --git i/time.c w/time.c
index 9c4c93939e0..2e1a2dca29b 100644
--- i/time.c
+++ w/time.c
@@ -929,7 +929,7 @@ timegmw_noleapsecond(struct vtm *vtm)
 static VALUE
-zone_str(const char *zone)
+zone_str_enc(const char *zone, rb_encoding *enc)
     const char *p;
     int ascii_only = 1;
@@ -950,11 +950,18 @@ zone_str(const char *zone)
         str = rb_usascii_str_new(zone, len);
     else {
-        str = rb_enc_str_new(zone, len, rb_locale_encoding());
+        if (!enc) enc = rb_locale_encoding();
+        str = rb_enc_str_new(zone, len, enc);
     return rb_fstring(str);
+static VALUE
+zone_str(const char *zone)
+    return zone_str_enc(zone, NULL);
 static void
 gmtimew_noleapsecond(wideval_t timew, struct vtm *vtm)
@@ -1653,12 +1660,18 @@ localtime_with_gmtoff_zone(const time_t *t, struct tm *result, long *gmtoff, VAL
 #if defined(HAVE_TM_ZONE)
             *zone = zone_str(tm.tm_zone);
 #elif defined(HAVE_TZNAME) && defined(HAVE_DAYLIGHT)
+            rb_encoding *enc = NULL;
+# if defined(_WIN32)
+            char cp[(sizeof(UINT) * 8 / 3) + 4];
+            snprintf(cp, sizeof(cp), "CP%u", GetACP());
+            enc = rb_enc_find(cp);
+# endif
 #  define tzname _tzname
 #  define daylight _daylight
 # endif
             /* this needs tzset or localtime, instead of localtime_r */
-            *zone = zone_str(tzname[daylight && tm.tm_isdst]);
+            *zone = zone_str_enc(tzname[daylight && tm.tm_isdst], enc);
                 char buf[64];

Updated by stringsn88keys (Thomas Powell) 4 months ago

nobu (Nobuyoshi Nakada) wrote in #note-4:

What are:

  • the output from command
  • Encoding.locale_charmap

? output:
"Aktive Codepage: 437." ("Active code page: 437" on English display language.)

Encoding.locale_charmap # => "CP437" (German and English)

Updated by stringsn88keys (Thomas Powell) 4 months ago

The top level status of this bug says "Closed" but last updated status says "Feedback". Can anyone clarify?

Updated by nobu (Nobuyoshi Nakada) 4 months ago

  • Status changed from Feedback to Assigned
  • Assignee set to windows

Updated by nobu (Nobuyoshi Nakada) 3 months ago

How about the patch at #note-5?


Also available in: Atom PDF