Bug #9790: Zlib::GzipReader only decompressed the first of concatenated files - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #9790

closed

Zlib::GzipReader only decompressed the first of concatenated files

Added by quainjn (Jake Quain) over 11 years ago. Updated about 5 years ago.

Status:

Closed

Assignee:

drbrain (Eric Hodel)

Target version:

ruby -v:

2.1.1

Backport:

2.0.0: UNKNOWN, 2.1: UNKNOWN

[ruby-core:62257]

Tags:

ext

Description

There is a similar old issue in Node that I came across that perfectly describes the situation in ruby:

https://github.com/joyent/node/issues/6032

In ruby given the following setup:

echo "1" > 1.txt
echo "2" > 2.txt
gzip 1.txt
gzip 2.txt
cat 1.txt.gz 2.txt.gz > 3.txt.gz

Calling:

Zlib::GzipReader.open("3.txt.gz") do |gz|
  print gz.read
end

would just print:

Files

zlib-gzreader-each_file-9790.patch (3.47 KB) zlib-gzreader-each_file-9790.patch

jeremyevans0 (Jeremy Evans), 11/27/2019 03:35 PM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

#1 [ruby-core:62271]

Updated by drbrain (Eric Hodel) over 11 years ago

Category set to ext
Status changed from Open to Assigned
Assignee set to drbrain (Eric Hodel)
Target version set to 2.2.0

Actions

Copy link

#2 [ruby-core:68286]

Updated by akostadinov (Aleksandar Kostadinov) over 10 years ago

Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream getNextEntry() [1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.

On the other hand the command line gzip utility only supports reading the whole thing as one. So a convenience method to read everything in one go, would also be nice.

[1] http://docs.oracle.com/javase/7/docs/api/java/util/zip/ZipInputStream.html

Actions

Copy link

#3 [ruby-core:68303]

Updated by duerst (Martin Dürst) over 10 years ago

Aleksandar Kostadinov wrote:

Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream getNextEntry() [1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.

Good idea, but it should be more Ruby-like, such as .each_file or so.

Actions

Copy link

#4 [ruby-core:69362]

Updated by exAspArk (Evgeny Li) about 10 years ago

Hey guys, is there any updates?

I have created a small gem yesterday to make it able to read multiple files https://github.com/exAspArk/multiple_files_gzip_reader

> MultipleFilesGzipReader.open("3.txt.gz") do |gz|
>   puts gz.read
> end

# 1
# 2
# => nil

Actions

Copy link

Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago

Has duplicate Bug #11180: Missing lines with Zlib::GzipReader added

Actions

Copy link

Updated by jeremyevans0 (Jeremy Evans) almost 6 years ago

Related to Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can) added

Actions

Copy link

#7 [ruby-core:95987]

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

File zlib-gzreader-each_file-9790.patch zlib-gzreader-each_file-9790.patch added

Attached is a patch that adds Zlib::GzipReader.each_file will which handle multiple concatenated gzip streams in the same file, similar to common tools that operate on .gz files. Zlib::GzipReader.each_file yields one Zlib::GzipReader instance per gzip stream in the file.

Actions

Copy link

#8 [ruby-core:96832]

Updated by ko1 (Koichi Sasada) over 5 years ago

Status changed from Assigned to Feedback

do you have real (popular) usecases?

Actions

Copy link

#9 [ruby-core:96833]

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

ko1 (Koichi Sasada) wrote:

do you have real (popular) usecases?

For real but not necessarily popular, but at least OpenBSD's package format uses this. I'm not sure if package formats for other operating systems use it, though. #14804 pointed out it was used by this file: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz

There are 3 bug reports for this, and I think while the need isn't common, it's something worth supporting via a new method. However, if we decide we don't want to support this, I'm fine closing the 3 bug reports.

Actions

Copy link

#10 [ruby-core:96898]

Updated by ko1 (Koichi Sasada) over 5 years ago

mame: can each_file return an Enumerator? Seems difficult to implement it
matz: How about always behaving like zcat? Is an option to keep the old behaviors really needed?
akr: The traditional behavior should be kept
akr: gzip(1) describes concatenation of gzip files in ADVANCED USAGE. https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

Conclusion:

matz: it should behave like zcat. Handling each member should be deleted.

Actions

Copy link

#11 [ruby-core:96910]

Updated by Dan0042 (Daniel DeLorme) over 5 years ago

matz: it should behave like zcat. Handling each member should be deleted.

Really? I agree that Zlib::GzipReader.open should behave like zcat, but each_file is a useful additional feature to have. And it's already implemented.

Actions

Copy link

#12 [ruby-core:96922]

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Dan0042 (Daniel DeLorme) wrote:

matz: it should behave like zcat. Handling each member should be deleted.

Really? I agree that Zlib::GzipReader.open should behave like zcat, but each_file is a useful additional feature to have. And it's already implemented.

I was at the meeting @ko1 (Koichi Sasada) is talking about. Devs at the meeting were not sure if "each_file is a useful additional feature to have" you say is real or not.

Do you know if there are cases when taking a random member of such gzip file is actually useful?

Actions

Copy link

#13 [ruby-core:96927]

Updated by Dan0042 (Daniel DeLorme) over 5 years ago

I can't think of taking a random member but I can imagine wanting to extract to separate files.

Zlib::GzipReader.each_file("input.gz").with_index do |gzip,i|
  File.open("output#{i}","w"){ |f| f.write(gzip.read) }
end

But I admit I haven't needed this personally, so maybe this is not so useful.

Actions

Copy link

#14 [ruby-core:96929]

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

shyouhei (Shyouhei Urabe) wrote:

Do you know if there are cases when taking a random member of such gzip file is actually useful?

It's not taking a random member, it's processing the separate gzip streams in order, which is actually useful. For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.

Actions

Copy link

#15 [ruby-core:96933]

Updated by nobu (Nobuyoshi Nakada) over 5 years ago

jeremyevans0 (Jeremy Evans) wrote:

For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.

It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?

Actions

Copy link

#16 [ruby-core:96934]

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

nobu (Nobuyoshi Nakada) wrote:

It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?

It's easier and more efficient if the metadata is a separate member. I'm guessing it could be changed to use a single large stream, but there is no reason to do so, and doing so would break existing tooling.

Actions

Copy link

#17 [ruby-core:98625]

Updated by jeremyevans0 (Jeremy Evans) about 5 years ago

ko1 (Koichi Sasada) wrote in #note-10:

mame: can each_file return an Enumerator? Seems difficult to implement it

matz: How about always behaving like zcat? Is an option to keep the old behaviors really needed?

akr: The traditional behavior should be kept

akr: gzip(1) describes concatenation of gzip files in ADVANCED USAGE. https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

Conclusion:

matz: it should behave like zcat. Handling each member should be deleted.

I'm not sure, but it seems the proposal here would be to make all Zlib::GzipReader methods transparently handle multiple streams. There are two issues with doing that:

It's very invasive, all methods would need to change, some in fairly complex ways.
More importantly, it would break cases where non-gzip data was stored after gzip data. Currently you can use GzipReader in such cases, and the io pointer after the gzip processing will be directly after where the gzip stream ends, at which point you can use the io normally. Basically, if you make this change, you could no longer embed a gzip stream in the middle of a file and read just that stream.

If we don't want to add Zlib::GzipReader.each_file but we want to add something like zcat, here's a pull request that implements Zlib::GzipReader.zcat: https://github.com/ruby/zlib/pull/13. I think Zlib::GzipReader.each_file is a more useful and flexible method than Zlib::GzipReader.zcat. We could certainly have both, though.

Actions

Copy link

#18 [ruby-core:99267]

Updated by jeremyevans0 (Jeremy Evans) about 5 years ago

Status changed from Feedback to Closed

matz approved Zlib::GzipReader.zcat, so I merged the pull request into zlib.

Actions

Copy link

Also available in: Atom PDF

Like0

Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0

Project

General

Profile

Ruby

Tags

Custom queries

Bug #9790

Zlib::GzipReader only decompressed the first of concatenated files

Updated by drbrain (Eric Hodel) over 11 years ago

Updated by akostadinov (Aleksandar Kostadinov) over 10 years ago

Updated by duerst (Martin Dürst) over 10 years ago

Updated by exAspArk (Evgeny Li) about 10 years ago

Updated by nagachika (Tomoyuki Chikanaga) about 10 years ago

Updated by jeremyevans0 (Jeremy Evans) almost 6 years ago

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

Updated by ko1 (Koichi Sasada) over 5 years ago

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

Updated by ko1 (Koichi Sasada) over 5 years ago

Updated by Dan0042 (Daniel DeLorme) over 5 years ago

Updated by shyouhei (Shyouhei Urabe) over 5 years ago

Updated by Dan0042 (Daniel DeLorme) over 5 years ago

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

Updated by nobu (Nobuyoshi Nakada) over 5 years ago

Updated by jeremyevans0 (Jeremy Evans) over 5 years ago

Updated by jeremyevans0 (Jeremy Evans) about 5 years ago

Updated by jeremyevans0 (Jeremy Evans) about 5 years ago

	Related to Ruby - Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can)	Closed					Actions
	Has duplicate Ruby - Bug #11180: Missing lines with Zlib::GzipReader	Closed					Actions