Backport #8516

IO#readchar returns wrong codepoints when converting encoding

Added by Xiao Ba about 2 years ago. Updated about 2 years ago.

[ruby-core:55444]
Status:Closed
Priority:Normal
Assignee:Usaku NAKAMURA

Description

I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.

$ file -i utf_8.txt
utf_8.txt: text/plain; charset=utf-8

$ file -i iso_8859_1.txt
iso_8859_1.txt: text/plain; charset=iso-8859-1

Code:
utf_8_file = "utf_8.txt"
iso_file = "iso_8859_1.txt"

puts "Processing #{utf_8_file}"
File.open(utf_8_file) do |io|
line, char = "", nil

until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join}"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end

line
end
puts "\n"
puts "Processing #{iso_file}"
File.open(iso_file) do |io|
io.set_encoding("#{Encoding::ISO_8859_1}:#{Encoding::UTF_8}")
line, char = "", nil

until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join(', ')}"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end

line
end

Output:
Processing utf_8.txt
Character á has 1 codepoints
Character á codepoints: 225
Character Á has 1 codepoints
Character Á codepoints: 193
Character ð has 1 codepoints
Character ð codepoints: 240
Character
has 1 codepoints
Character
codepoints: 10

Processing iso_8859_1.txt
Character á has 2 codepoints
Character á codepoints: 195, 161
SLICE FAIL
Character Á has 2 codepoints
Character Á codepoints: 195, 129
SLICE FAIL
Character ð has 2 codepoints
Character ð codepoints: 195, 176
SLICE FAIL
Character
has 1 codepoints
Character
codepoints: 10

With the ISO-8859-1 encoded file, readchar is returning the character bytes when I would expect UTF-8 codepoints.

utf_8.txt Magnifier (7 Bytes) Xiao Ba, 06/12/2013 04:15 AM

iso_8859_1.txt Magnifier (4 Bytes) Xiao Ba, 06/12/2013 04:15 AM

Associated revisions

Revision 41250
Added by Nobuyoshi Nakada about 2 years ago

io.c: fix 7bit coderange condition

  • io.c (io_getc): fix 7bit coderange condition, check if ascii read data instead of read length. [Bug #8516]

Revision 41644
Added by Usaku NAKAMURA about 2 years ago

merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length.  [Bug #8516]

History

#1 Updated by Nobuyoshi Nakada about 2 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r41250.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


io.c: fix 7bit coderange condition

  • io.c (io_getc): fix 7bit coderange condition, check if ascii read data instead of read length. [Bug #8516]

#2 Updated by Nobuyoshi Nakada about 2 years ago

  • Backport changed from 1.9.3: UNKNOWN, 2.0.0: UNKNOWN to 1.9.3: REQUIRED, 2.0.0: REQUIRED

#3 Updated by Tomoyuki Chikanaga about 2 years ago

  • Tracker changed from Bug to Backport
  • Project changed from Ruby trunk to Backport200
  • Status changed from Closed to Assigned
  • Assignee set to Tomoyuki Chikanaga

#4 Updated by Tomoyuki Chikanaga about 2 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r41260.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length.  [Bug #8516]

#5 Updated by Tomoyuki Chikanaga about 2 years ago

  • Project changed from Backport200 to Backport193
  • Status changed from Closed to Assigned
  • Assignee changed from Tomoyuki Chikanaga to Usaku NAKAMURA

#6 Updated by Usaku NAKAMURA about 2 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r41644.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length.  [Bug #8516]

Also available in: Atom PDF