Project

General

Profile

Actions

Bug #9790

closed

Zlib::GzipReader only decompressed the first of concatenated files

Added by quainjn (Jake Quain) over 10 years ago. Updated over 4 years ago.

Status:
Closed
Target version:
-
ruby -v:
2.1.1
[ruby-core:62257]
Tags:

Description

There is a similar old issue in Node that I came across that perfectly describes the situation in ruby:

https://github.com/joyent/node/issues/6032

In ruby given the following setup:

echo "1" > 1.txt
echo "2" > 2.txt
gzip 1.txt
gzip 2.txt
cat 1.txt.gz 2.txt.gz > 3.txt.gz

Calling:

Zlib::GzipReader.open("3.txt.gz") do |gz|
  print gz.read
end

would just print:

1

Files

zlib-gzreader-each_file-9790.patch (3.47 KB) zlib-gzreader-each_file-9790.patch jeremyevans0 (Jeremy Evans), 11/27/2019 03:35 PM

Related issues 2 (0 open2 closed)

Related to Ruby master - Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can)ClosedActions
Has duplicate Ruby master - Bug #11180: Missing lines with Zlib::GzipReaderClosedActions

Updated by drbrain (Eric Hodel) over 10 years ago

  • Category set to ext
  • Status changed from Open to Assigned
  • Assignee set to drbrain (Eric Hodel)
  • Target version set to 2.2.0

Updated by akostadinov (Aleksandar Kostadinov) over 9 years ago

Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream getNextEntry() [1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.

On the other hand the command line gzip utility only supports reading the whole thing as one. So a convenience method to read everything in one go, would also be nice.

[1] http://docs.oracle.com/javase/7/docs/api/java/util/zip/ZipInputStream.html

Updated by duerst (Martin Dürst) over 9 years ago

Aleksandar Kostadinov wrote:

Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream getNextEntry() [1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.

Good idea, but it should be more Ruby-like, such as .each_file or so.

Updated by exAspArk (Evgeny Li) over 9 years ago

Hey guys, is there any updates?

I have created a small gem yesterday to make it able to read multiple files https://github.com/exAspArk/multiple_files_gzip_reader

> MultipleFilesGzipReader.open("3.txt.gz") do |gz|
>   puts gz.read
> end

# 1
# 2
# => nil
Actions #5

Updated by nagachika (Tomoyuki Chikanaga) over 9 years ago

  • Has duplicate Bug #11180: Missing lines with Zlib::GzipReader added
Actions #6

Updated by jeremyevans0 (Jeremy Evans) about 5 years ago

  • Related to Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can) added

Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago

Attached is a patch that adds Zlib::GzipReader.each_file will which handle multiple concatenated gzip streams in the same file, similar to common tools that operate on .gz files. Zlib::GzipReader.each_file yields one Zlib::GzipReader instance per gzip stream in the file.

Updated by ko1 (Koichi Sasada) almost 5 years ago

  • Status changed from Assigned to Feedback

do you have real (popular) usecases?

Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago

ko1 (Koichi Sasada) wrote:

do you have real (popular) usecases?

For real but not necessarily popular, but at least OpenBSD's package format uses this. I'm not sure if package formats for other operating systems use it, though. #14804 pointed out it was used by this file: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz

There are 3 bug reports for this, and I think while the need isn't common, it's something worth supporting via a new method. However, if we decide we don't want to support this, I'm fine closing the 3 bug reports.

Updated by ko1 (Koichi Sasada) almost 5 years ago

  • mame: can each_file return an Enumerator? Seems difficult to implement it
  • matz: How about always behaving like zcat? Is an option to keep the old behaviors really needed?
  • akr: The traditional behavior should be kept
  • akr: gzip(1) describes concatenation of gzip files in ADVANCED USAGE. https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

Conclusion:

  • matz: it should behave like zcat. Handling each member should be deleted.

Updated by Dan0042 (Daniel DeLorme) almost 5 years ago

matz: it should behave like zcat. Handling each member should be deleted.

Really? I agree that Zlib::GzipReader.open should behave like zcat, but each_file is a useful additional feature to have. And it's already implemented.

Updated by shyouhei (Shyouhei Urabe) almost 5 years ago

Dan0042 (Daniel DeLorme) wrote:

matz: it should behave like zcat. Handling each member should be deleted.

Really? I agree that Zlib::GzipReader.open should behave like zcat, but each_file is a useful additional feature to have. And it's already implemented.

I was at the meeting @ko1 (Koichi Sasada) is talking about. Devs at the meeting were not sure if "each_file is a useful additional feature to have" you say is real or not.

Do you know if there are cases when taking a random member of such gzip file is actually useful?

Updated by Dan0042 (Daniel DeLorme) almost 5 years ago

I can't think of taking a random member but I can imagine wanting to extract to separate files.

Zlib::GzipReader.each_file("input.gz").with_index do |gzip,i|
  File.open("output#{i}","w"){ |f| f.write(gzip.read) }
end

But I admit I haven't needed this personally, so maybe this is not so useful.

Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago

shyouhei (Shyouhei Urabe) wrote:

Do you know if there are cases when taking a random member of such gzip file is actually useful?

It's not taking a random member, it's processing the separate gzip streams in order, which is actually useful. For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.

Updated by nobu (Nobuyoshi Nakada) almost 5 years ago

jeremyevans0 (Jeremy Evans) wrote:

For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.

It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?

Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago

nobu (Nobuyoshi Nakada) wrote:

It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?

It's easier and more efficient if the metadata is a separate member. I'm guessing it could be changed to use a single large stream, but there is no reason to do so, and doing so would break existing tooling.

Updated by jeremyevans0 (Jeremy Evans) over 4 years ago

ko1 (Koichi Sasada) wrote in #note-10:

  • mame: can each_file return an Enumerator? Seems difficult to implement it
  • matz: How about always behaving like zcat? Is an option to keep the old behaviors really needed?
  • akr: The traditional behavior should be kept
  • akr: gzip(1) describes concatenation of gzip files in ADVANCED USAGE. https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

Conclusion:

  • matz: it should behave like zcat. Handling each member should be deleted.

I'm not sure, but it seems the proposal here would be to make all Zlib::GzipReader methods transparently handle multiple streams. There are two issues with doing that:

  1. It's very invasive, all methods would need to change, some in fairly complex ways.

  2. More importantly, it would break cases where non-gzip data was stored after gzip data. Currently you can use GzipReader in such cases, and the io pointer after the gzip processing will be directly after where the gzip stream ends, at which point you can use the io normally. Basically, if you make this change, you could no longer embed a gzip stream in the middle of a file and read just that stream.

If we don't want to add Zlib::GzipReader.each_file but we want to add something like zcat, here's a pull request that implements Zlib::GzipReader.zcat: https://github.com/ruby/zlib/pull/13. I think Zlib::GzipReader.each_file is a more useful and flexible method than Zlib::GzipReader.zcat. We could certainly have both, though.

Updated by jeremyevans0 (Jeremy Evans) over 4 years ago

  • Status changed from Feedback to Closed

matz approved Zlib::GzipReader.zcat, so I merged the pull request into zlib.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0