Project

General

Profile

Actions

Bug #18588

closed

ruby -e 'p gets' with japanese charactors gets additional invalid leading chars and caught Encoding::InvalidByteSequenceError

Added by YO4 (Yoshinao Muramatsu) 6 months ago. Updated 5 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
[ruby-dev:51165]

Description

Input a line starting with japanese charactor from console, almost every time ruby gets additional invalid leading charactors.

Reproduce process

R:\ruby32\bin>ruby -e 'p gets'
あ
-e:1:in `gets': "\\xA0" on Windows-31J (Encoding::InvalidByteSequenceError)
        from -e:1:in `gets'
        from -e:1:in `<main>'

expected result

R:\ruby32\bin>ruby -e 'p gets'
あ
"あ"

your ruby version (ruby -v)

R:\ruby32\bin>ruby -v
ruby 3.2.0dev (2022-02-16T08:57:04Z master 00c7a0d491) [x64-mswin64_140]

R:\ruby32\bin>ver

Microsoft Windows [Version 10.0.19043.1526]

other observations

environment

  • On command prompt window with Legacy Console mode, this issue NOT occurs.
  • On Windows Terminal, this issue occurs.
  • On Windows Sandbox(Japanese Locale), this issue occurs.
  • RubyInstaller binaries has same issue
C:\src\git>ruby -v
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x64-mingw-ucrt]

C:\src\git>ruby -Eutf-8 -e 'p gets'
あ
-e:1:in `gets': "\\xA0" on Windows-31J (Encoding::InvalidByteSequenceError)
        from -e:1:in `gets'
        from -e:1:in `<main>'

A line starting with single byte charactor(s) got valid value.

R:\ruby32\bin>ruby -e 'p gets'
:あ
":あ\n"  # <= valid

external encoding affects

  • with Windows-31J, second enter key for line input.
R:\ruby32\bin>ruby -EWindows-31J -e 'p gets'
あ
   # <= Second enter key required
"\xA0\xFFあ\n" # <= \xA0\xFF is additional chars

charactor variations

R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b'
あ  # <= \x{82A0}

"\xA0\xFF\x82\xA0\n"

R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b'
   # <= \x{8140} fullwidth space

"@\x00\x81@\n"

R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b'
、  # <= \x{8141}

"A\x00\x81A\n"

R:\ruby32\bin>ruby -EWindows-31J -e 'p gets.b'
。  # <= \x{8142}

"B\x00\x81B\n"

sysread got valid value.

R:\ruby32\bin>ruby -e 'p STDIN.sysread(1024).force_encoding(Encoding::Windows_31J)'
あ
"\x{82A0}\r\n" # <= valid

STDIN.binmode can not resolv this.

R:\ruby32\bin>ruby -e 'STDIN.binmode; p gets.force_encoding(Encoding::Windows_31J)'
あ
   # <= Second enter key required
"\xA0\xFF\x{82A0}\r\r\n" # <= invalid

Ruby 3.0 and earlier versions has a different behavior. especialy sysread returns invalid.

C:\src\git>ruby -v
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x64-mingw32]

C:\src\git>ruby -Eutf-8 -e 'p gets'
あ
   # <= Second enter key required
"\xA0\xFF\x82\xA0\n"  # <= exception not occures but invalid value
C:\src\git>ruby -EWindows-31J -e 'p gets'
あ
   # <= Second enter key required
"\xA0\xFFあ\n"  # <= also invalid value
C:\src\git>ruby -e 'p STDIN.sysread(1024).force_encoding(Encoding::Windows_31J)'
あ
"\xA0\xFF\x{82A0}\r"

conclusion

  1. ruby 3.1/3.2dev gets return invalid vs sysread return valid
  2. ruby 3.1/3.2dev sysread return valid vs 3.0 sysread return invalid
  3. The fact that it works fine in legacy console suggests that windows has some issue, but from the previous it looks like ruby can handle it.

Updated by YO4 (Yoshinao Muramatsu) 5 months ago

It seems to ANSI version of PeekConsoleInput read multibyte charactor partially, subsequent ReadFile returns wrong data on newer Windows 10 versions.
I reported this to microsoft/terminal (https://github.com/microsoft/terminal/issues/12626)

To avoid this behavior, we can use Unicode version of of PeekConsoleInput/ReadConsoleInput.
PR https://github.com/ruby/ruby/pull/5634.

Actions #2

Updated by YO4 (Yoshinao Muramatsu) 5 months ago

  • Status changed from Open to Closed

Applied in changeset git|5d90c6010999ac11d25822f13f0b29d377f81755.


Avoid console input behavior in windows 10 [Bug #18588]

When ANSI versions of PeekConsoleInput read multibyte charactor
partially, subsequent ReadFile returns wrong data on newer Windows
10 versions (probably since Windows Terminal introduced). To
avoid this, use Unicode version of of PeekConsoleInput/ReadConsole.

Actions

Also available in: Atom PDF