Project

General

Profile

Actions

Bug #18238

closed

CSV encoding issue with parsing from Zlib::GzipReader stream

Added by dim (Dimitrij Denissenko) about 3 years ago. Updated about 3 years ago.

Status:
Third Party's Issue
Assignee:
-
Target version:
-
ruby -v:
ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
[ruby-core:105536]

Description

Hi,

I found an issue with parsing CSVs directly from a Zlib::GzipReader IO which I am trying to debug. Unfortunately, I am not at liberty to share the (proprietary) CSV file and I couldn't recreate the issue with a simplified/obfuscated version, but maybe you can point me in the right direction. Here's what's happening:

CSV::VERSION # => "3.1.9"
File.open("file.csv.gz", encoding: 'binary') do |io|
  Zlib::GzipReader.wrap(io) do |rio|
    CSV.new(rio).count
  end
end

Results in:

~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:346:in `rescue in parse': Invalid byte sequence in UTF-8 in line 38424. (CSV::MalformedCSVError)
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:329:in `parse'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
  ...
~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:237:in `read_chunk': CSV::Parser::InvalidEncoding (CSV::Parser::InvalidEncoding)
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:157:in `scan_all'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:1009:in `parse_quoted_column_value'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:962:in `parse_column_value'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:886:in `parse_quotable_robust'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:864:in `block in parse_quotable_loose'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:127:in `block in each_line'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:103:in `each_line'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:825:in `parse_quotable_loose'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv/parser.rb:336:in `parse'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
	from ~/.rbenv/versions/3.0.1/lib/ruby/3.0.0/csv.rb:2345:in `each'
	from (irb):3:in `count'

While the following succeeds:

File.open("file.csv", 'w', encoding: 'binary') do |wio|
  File.open("file.csv.gz", encoding: 'binary') do |io|
    Zlib::GzipReader.wrap(io) do |rio|
      IO.copy_stream rio, wio
    end
  end
end

File.open("file.csv") do |rio|
  CSV.new(rio).count
end

I have narrowed it down to https://github.com/ruby/csv/blob/v3.1.9/lib/csv/parser.rb#L235-L237, it looks like reading the chunk truncates the string at an UTF8 character and chunk.valid_encoding? therefore results in false.

Actions

Also available in: Atom PDF

Like0
Like0