Actions
Bug #18238
closedCSV encoding issue with parsing from Zlib::GzipReader stream
Status:
Third Party's Issue
Assignee:
-
Target version:
-
ruby -v:
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
Description
Hi,
I found an issue with parsing CSVs directly from a Zlib::GzipReader
IO which I am trying to debug. Unfortunately, I am not at liberty to share the (proprietary) CSV file and I couldn't recreate the issue with a simplified/obfuscated version, but maybe you can point me in the right direction. Here's what's happening:
CSV::VERSION # => "3.1.9"
File.open("file.csv.gz", encoding: 'binary') do |io|
Zlib::GzipReader.wrap(io) do |rio|
CSV.new(rio).count
end
end
Results in:
~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:346:in `rescue in parse': Invalid byte sequence in UTF-8 in line 38424. (CSV::MalformedCSVError)
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:329:in `parse'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
...
~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:237:in `read_chunk': CSV::Parser::InvalidEncoding (CSV::Parser::InvalidEncoding)
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:157:in `scan_all'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:1009:in `parse_quoted_column_value'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:962:in `parse_column_value'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:886:in `parse_quotable_robust'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:864:in `block in parse_quotable_loose'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:127:in `block in each_line'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:825:in `parse_quotable_loose'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:336:in `parse'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
from (irb):3:in `count'
While the following succeeds:
File.open("file.csv", 'w', encoding: 'binary') do |wio|
File.open("file.csv.gz", encoding: 'binary') do |io|
Zlib::GzipReader.wrap(io) do |rio|
IO.copy_stream rio, wio
end
end
end
File.open("file.csv") do |rio|
CSV.new(rio).count
end
I have narrowed it down to https://github.com/ruby/csv/blob/v3.1.9/lib/csv/parser.rb#L235-L237, it looks like reading the chunk truncates the string at an UTF8 character and chunk.valid_encoding?
therefore results in false.
Actions
Like0
Like0