Bug #10101

Zlib::GzipReader produce different outputs for different methods applied

Added by Rafael Manzo 7 months ago. Updated 6 months ago.

[ruby-core:64128]
Status:Closed
Priority:Normal
Assignee:-
ruby -v:ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux] Backport:2.0.0: DONE, 2.1: DONE

Description

The methods read, readbyte and each_byte are producing different outputs. Comparing with the unziped file, only the result of readbyte is correct according to the size but comparing byte per byte with the original file sometimes gives differences at the same positions.

This part of the differences I couldn't reproduce in a way that I could share on the internet because the original file is a magnetic resonance image subject to confidentiality.

But fortunately I was able to reproduce the bug on input size. I've attached a script that illustrates the problem and here is the link for the file that I've used for the following sample output:

https://drive.google.com/file/d/0B3O0CbLN-q0TcmhGR0RGeWM2UHM/edit?usp=sharing

Sorry about the size, but I couldn't produce a smaller file.

[manzo@WALL-A gz_debug]$ ruby -v
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-linux]
[manzo@WALL-A gz_debug]$ ruby test1.rb sample.gz
Size of read: 45102570
Size of each_byte: 4668
Size of readbyte: 45158752

I hope I'm right on this report and thank you a lot for your time!

test1.rb Magnifier - script that reproduces the errors (316 Bytes) Rafael Manzo, 07/31/2014 12:33 AM

Associated revisions

Revision 47327
Added by normal 6 months ago

zlib: GzipReader#rewind preserves ZSTREAM_FLAG_GZFILE

  • ext/zlib/zlib.c (gzfile_reset): preserve ZSTREAM_FLAG_GZFILE
    [Bug #10101]

  • test/zlib/test_zlib.rb (test_rewind): test each_byte

We must preserve the ZSTREAM_FLAG_GZFILE flag to prevent
zstream_detach_buffer from:

a) returning Qnil and breaking out of the `each_byte' loop
b) yielding a large string to each_byte

Note: the test case in bug report takes a long time. I found this
bug because I noticed the massive time descrepancy between
each_byte' andreadbyte' loop before this patch. With this patch,
each_byte' andreadbyte' both take very long.

Revision 47327
Added by normal 6 months ago

zlib: GzipReader#rewind preserves ZSTREAM_FLAG_GZFILE

  • ext/zlib/zlib.c (gzfile_reset): preserve ZSTREAM_FLAG_GZFILE
    [Bug #10101]

  • test/zlib/test_zlib.rb (test_rewind): test each_byte

We must preserve the ZSTREAM_FLAG_GZFILE flag to prevent
zstream_detach_buffer from:

a) returning Qnil and breaking out of the `each_byte' loop
b) yielding a large string to each_byte

Note: the test case in bug report takes a long time. I found this
bug because I noticed the massive time descrepancy between
each_byte' andreadbyte' loop before this patch. With this patch,
each_byte' andreadbyte' both take very long.

Revision 47419
Added by Tomoyuki Chikanaga 6 months ago

merge revision(s) r47327: [Backport #10008]

* ext/zlib/zlib.c (gzfile_reset): preserve ZSTREAM_FLAG_GZFILE
  [Bug #10101]

* test/zlib/test_zlib.rb (test_rewind): test each_byte

Revision 47500
Added by Usaku NAKAMURA 6 months ago

merge revision(s) 47327: [Backport #10101]

* ext/zlib/zlib.c (gzfile_reset): preserve ZSTREAM_FLAG_GZFILE
  [Bug #10101]

* test/zlib/test_zlib.rb (test_rewind): test each_byte

History

#1 Updated by Tomoyuki Chikanaga 6 months ago

  • Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN to 2.0.0: REQUIRED, 2.1: REQUIRED

Hello, Rafael.
Thank you for your report.

I can reproduce with your sample on 2.0.0p433 and 2.1.2, and it can be easily reproduced similar case with large gzip'ed file as follows.

$ dd if=/dev/zero of=foo count=5000
$ gzip foo
$ ruby test1.rb foo.gz
Size of read: 2560000
Size of each_byte: 2097151
Size of readbyte: 2560000

In this case, only `each_byte' returns wrong value. I suspect there are several different cause.
I don't have time to investigate this right now.
And zlib has no maintainer according to https://bugs.ruby-lang.org/projects/ruby/wiki/MaintainersStdlib
Are there anyone who can handle this?

#2 Updated by cremno phobia 6 months ago

read returns a string with external encoding. In your case it seems to be UTF-8. The encodings of the given IO object are ignored. Using Zlib::GzipReader.open doesn't work either, by the way. It still ignores the b, but as a workaround you can change the encoding of the returned string, passexternal_encoding: Encoding::ASCII_8BITas new argument, callString#bytesize`, etc.

After some rearranging and duplicating of the remaining two cases, I can't say why each_byte sometimes fails. But with the following lines, [-2048, 1] (2048 looks interesting) is printed by f_gz.rewind when it fails.
~~~ruby
def f.seek(*args)
p args
super
end
~~~

#3 Updated by Anonymous 6 months ago

  • Status changed from Open to Closed
  • % Done changed from 0 to 100

Applied in changeset r47327.


zlib: GzipReader#rewind preserves ZSTREAM_FLAG_GZFILE

  • ext/zlib/zlib.c (gzfile_reset): preserve ZSTREAM_FLAG_GZFILE
    [Bug #10101]

  • test/zlib/test_zlib.rb (test_rewind): test each_byte

We must preserve the ZSTREAM_FLAG_GZFILE flag to prevent
zstream_detach_buffer from:

a) returning Qnil and breaking out of the `each_byte' loop
b) yielding a large string to each_byte

Note: the test case in bug report takes a long time. I found this
bug because I noticed the massive time descrepancy between
each_byte' andreadbyte' loop before this patch. With this patch,
each_byte' andreadbyte' both take very long.

#4 Updated by Eric Wong 6 months ago

nagachika00@gmail.com wrote:

I don't have time to investigate this right now.
And zlib has no maintainer according to
https://bugs.ruby-lang.org/projects/ruby/wiki/MaintainersStdlib
Are there anyone who can handle this?

Hi, r47327 should fix this:


r47327 | normal | 2014-08-30 23:53:28 +0000 (Sat, 30 Aug 2014) | 18 lines

zlib: GzipReader#rewind preserves ZSTREAM_FLAG_GZFILE

  • ext/zlib/zlib.c (gzfile_reset): preserve ZSTREAM_FLAG_GZFILE
    [Bug #10101]

  • test/zlib/test_zlib.rb (test_rewind): test each_byte

We must preserve the ZSTREAM_FLAG_GZFILE flag to prevent
zstream_detach_buffer from:

a) returning Qnil and breaking out of the `each_byte' loop
b) yielding a large string to each_byte

Note: the test case in bug report takes a long time. I found this
bug because I noticed the massive time descrepancy between
each_byte' andreadbyte' loop before this patch. With this patch,
each_byte' andreadbyte' both take very long.


I should be able to help out on zlib in the future (and many bugs
reproducible without graphical or proprietary dependency).

#5 Updated by Tomoyuki Chikanaga 6 months ago

  • Backport changed from 2.0.0: REQUIRED, 2.1: REQUIRED to 2.0.0: REQUIRED, 2.1: DONE

Thank you Eric! It is a great insight.

Backported into ruby_2_1 at r47419.

#6 Updated by Usaku NAKAMURA 6 months ago

  • Backport changed from 2.0.0: REQUIRED, 2.1: DONE to 2.0.0: DONE, 2.1: DONE

backported into ruby_2_0_0 at r47500.

Also available in: Atom PDF