Project

General

Profile

Actions

Backport #8516

closed

IO#readchar returns wrong codepoints when converting encoding

Added by bbxiao1 (Xiao Ba) almost 11 years ago. Updated almost 11 years ago.

Status:
Closed
[ruby-core:55444]

Description

I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.

$ file -i utf_8.txt
utf_8.txt: text/plain; charset=utf-8

$ file -i iso_8859_1.txt
iso_8859_1.txt: text/plain; charset=iso-8859-1

Code:
utf_8_file = "utf_8.txt"
iso_file = "iso_8859_1.txt"

puts "Processing #{utf_8_file}"
File.open(utf_8_file) do |io|
line, char = "", nil

until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join}"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end

line
end
puts "\n"
puts "Processing #{iso_file}"
File.open(iso_file) do |io|
io.set_encoding("#{Encoding::ISO_8859_1}:#{Encoding::UTF_8}")
line, char = "", nil

until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "Character #{char} codepoints: #{char.each_codepoint.to_a.join(', ')}"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end

line
end

Output:
Processing utf_8.txt
Character á has 1 codepoints
Character á codepoints: 225
Character Á has 1 codepoints
Character Á codepoints: 193
Character ð has 1 codepoints
Character ð codepoints: 240
Character
has 1 codepoints
Character
codepoints: 10

Processing iso_8859_1.txt
Character á has 2 codepoints
Character á codepoints: 195, 161
SLICE FAIL
Character Á has 2 codepoints
Character Á codepoints: 195, 129
SLICE FAIL
Character ð has 2 codepoints
Character ð codepoints: 195, 176
SLICE FAIL
Character
has 1 codepoints
Character
codepoints: 10

With the ISO-8859-1 encoded file, readchar is returning the character bytes when I would expect UTF-8 codepoints.


Files

utf_8.txt (7 Bytes) utf_8.txt bbxiao1 (Xiao Ba), 06/12/2013 04:15 AM
iso_8859_1.txt (4 Bytes) iso_8859_1.txt bbxiao1 (Xiao Ba), 06/12/2013 04:15 AM
Actions #1

Updated by nobu (Nobuyoshi Nakada) almost 11 years ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

This issue was solved with changeset r41250.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


io.c: fix 7bit coderange condition

  • io.c (io_getc): fix 7bit coderange condition, check if ascii read
    data instead of read length. [ruby-core:55444] [Bug #8516]

Updated by nobu (Nobuyoshi Nakada) almost 11 years ago

  • Backport changed from 1.9.3: UNKNOWN, 2.0.0: UNKNOWN to 1.9.3: REQUIRED, 2.0.0: REQUIRED
Actions #3

Updated by nagachika (Tomoyuki Chikanaga) almost 11 years ago

  • Tracker changed from Bug to Backport
  • Project changed from Ruby master to Backport200
  • Status changed from Closed to Assigned
  • Assignee set to nagachika (Tomoyuki Chikanaga)
Actions #4

Updated by nagachika (Tomoyuki Chikanaga) almost 11 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r41260.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length. [ruby-core:55444] [Bug #8516]
Actions #5

Updated by nagachika (Tomoyuki Chikanaga) almost 11 years ago

  • Project changed from Backport200 to Backport193
  • Status changed from Closed to Assigned
  • Assignee changed from nagachika (Tomoyuki Chikanaga) to usa (Usaku NAKAMURA)
Actions #6

Updated by usa (Usaku NAKAMURA) almost 11 years ago

  • Status changed from Assigned to Closed

This issue was solved with changeset r41644.
Xiao, thank you for reporting this issue.
Your contribution to Ruby is greatly appreciated.
May Ruby be with you.


merge revision(s) 41250: [Backport #8516]

* io.c (io_getc): fix 7bit coderange condition, check if ascii read
  data instead of read length. [ruby-core:55444] [Bug #8516]
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0