Backport #8516

IO#readchar returns wrong codepoints when converting encoding

Added by Xiao Ba 11 months ago. Updated 10 months ago.

[ruby-core:55444]
Status:Closed
Priority:Normal
Assignee:Usaku NAKAMURA

Description

I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.

$ file -i utf8.txt
utf
8.txt: text/plain; charset=utf-8

$ file -i iso88591.txt
iso88591.txt: text/plain; charset=iso-8859-1

Code:
utf8file = "utf8.txt"
iso
file = "iso88591.txt"

puts "Processing #{utf8file}"
File.open(utf8file) do |io|
line, char = "", nil

until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.eachcodepoint.count} codepoints"
puts "Character #{char} codepoints: #{char.each
codepoint.to_a.join}"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end

line
end
puts "\n"
puts "Processing #{isofile}"
File.open(iso
file) do |io|
io.setencoding("#{Encoding::ISO88591}:#{Encoding::UTF8}")
line, char = "", nil

until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.eachcodepoint.count} codepoints"
puts "Character #{char} codepoints: #{char.each
codepoint.to_a.join(', ')}"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end

line
end

Output:
Processing utf_8.txt
Character á has 1 codepoints
Character á codepoints: 225
Character Á has 1 codepoints
Character Á codepoints: 193
Character ð has 1 codepoints
Character ð codepoints: 240
Character
has 1 codepoints
Character
codepoints: 10

Processing iso88591.txt
Character á has 2 codepoints
Character á codepoints: 195, 161
SLICE FAIL
Character Á has 2 codepoints
Character Á codepoints: 195, 129
SLICE FAIL
Character ð has 2 codepoints
Character ð codepoints: 195, 176
SLICE FAIL
Character
has 1 codepoints
Character
codepoints: 10

With the ISO-8859-1 encoded file, readchar is returning the character bytes when I would expect UTF-8 codepoints.

utf_8.txt Magnifier (7 Bytes) Xiao Ba, 06/12/2013 04:15 AM

iso_8859_1.txt Magnifier (4 Bytes) Xiao Ba, 06/12/2013 04:15 AM

Associated revisions

Revision 41644
Added by Usaku NAKAMURA 10 months ago

merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length.  [Bug #8516]

History

#1 Updated by Nobuyoshi Nakada 11 months ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r41250.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


io.c: fix 7bit coderange condition

  • io.c (io_getc): fix 7bit coderange condition, check if ascii read data instead of read length. [Bug #8516]

#2 Updated by Nobuyoshi Nakada 11 months ago

  • Backport changed from 1.9.3: UNKNOWN, 2.0.0: UNKNOWN to 1.9.3: REQUIRED, 2.0.0: REQUIRED

#3 Updated by Tomoyuki Chikanaga 11 months ago

  • Tracker changed from Bug to Backport
  • Project changed from ruby-trunk to Backport200
  • Status changed from Closed to Assigned
  • Assignee set to Tomoyuki Chikanaga

#4 Updated by Tomoyuki Chikanaga 11 months ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r41260.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length.  [Bug #8516]

#5 Updated by Tomoyuki Chikanaga 11 months ago

  • Project changed from Backport200 to Backport93
  • Status changed from Closed to Assigned
  • Assignee changed from Tomoyuki Chikanaga to Usaku NAKAMURA

#6 Updated by Usaku NAKAMURA 10 months ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r41644.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length.  [Bug #8516]

Also available in: Atom PDF