Project

General

Profile

Bug #14804

GzipReader cannot read Freebase dump (but gzcat/zless can)

Added by amadan (Goran Topic) 10 months ago. Updated 10 months ago.

Status:
Open
Priority:
Normal
Assignee:
-
Target version:
-
ruby -v:
Ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin17]
[ruby-core:87339]

Description

This is likely related to https://stackoverflow.com/questions/35354951/gzipstream-quietly-fails-on-large-file-stream-ends-at-2gb (and its accepted answer).

The file in question: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz
(watch out, it's 30Gb compressed!)

Steps to reproduce:

require "zlib"
Zlib::GzipReader.open("freebase-rdf-latest.gz") { |f| f.each_line.count }
# => 14374340

However, the correct answer is different:

$ gzcat freebase-rdf-latest.gz | wc -l
3130753066

Another experiment showed that the last f.tell was 1945715682, while there's considerably more bytes in the uncompressed version. This fits well with the Stack Overflow report from C# linked above, which states the first "substream" contains exactly that many bytes.

If this is a hard constraint from the wrapped library (and thus should be fixed upstream), at least the documentation should mention it.

History

Updated by amadan (Goran Topic) 10 months ago

(Note that f.each_line.count would return the wrong result anyway, due to https://bugs.ruby-lang.org/issues/14805 , since 3130753066 is outside int32 range, but it doesn't have the chance to do so, on account of stopping prematurely.)

Also available in: Atom PDF