Project

General

Profile

Bug #9790

Zlib::GzipReader only decompressed the first of concatenated files

Added by quainjn (Jake Quain) almost 6 years ago. Updated 3 months ago.

Status:
Feedback
Priority:
Normal
Target version:
-
ruby -v:
2.1.1
[ruby-core:62257]
Tags:

Description

There is a similar old issue in Node that I came across that perfectly describes the situation in ruby:

https://github.com/joyent/node/issues/6032

In ruby given the following setup:

echo "1" > 1.txt
echo "2" > 2.txt
gzip 1.txt
gzip 2.txt
cat 1.txt.gz 2.txt.gz > 3.txt.gz

Calling:

Zlib::GzipReader.open("3.txt.gz") do |gz|
  print gz.read
end

would just print:

1

Files

zlib-gzreader-each_file-9790.patch (3.47 KB) zlib-gzreader-each_file-9790.patch jeremyevans0 (Jeremy Evans), 11/27/2019 03:35 PM

Related issues

Related to Ruby master - Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can)OpenActions
Has duplicate Ruby master - Bug #11180: Missing lines with Zlib::GzipReaderOpenActions

Updated by drbrain (Eric Hodel) almost 6 years ago

  • Category set to ext
  • Status changed from Open to Assigned
  • Assignee set to drbrain (Eric Hodel)
  • Target version set to 2.2.0

Updated by akostadinov (Aleksandar Kostadinov) about 5 years ago

Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream getNextEntry() [1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.

On the other hand the command line gzip utility only supports reading the whole thing as one. So a convenience method to read everything in one go, would also be nice.

[1] http://docs.oracle.com/javase/7/docs/api/java/util/zip/ZipInputStream.html

Updated by duerst (Martin Dürst) about 5 years ago

Aleksandar Kostadinov wrote:

Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream getNextEntry() [1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.

Good idea, but it should be more Ruby-like, such as .each_file or so.

Updated by exAspArk (Evgeny Li) almost 5 years ago

Hey guys, is there any updates?

I have created a small gem yesterday to make it able to read multiple files https://github.com/exAspArk/multiple_files_gzip_reader

> MultipleFilesGzipReader.open("3.txt.gz") do |gz|
>   puts gz.read
> end

# 1
# 2
# => nil
#5

Updated by nagachika (Tomoyuki Chikanaga) almost 5 years ago

  • Has duplicate Bug #11180: Missing lines with Zlib::GzipReader added
#6

Updated by jeremyevans0 (Jeremy Evans) 5 months ago

  • Related to Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can) added

Updated by jeremyevans0 (Jeremy Evans) 4 months ago

Attached is a patch that adds Zlib::GzipReader.each_file will which handle multiple concatenated gzip streams in the same file, similar to common tools that operate on .gz files. Zlib::GzipReader.each_file yields one Zlib::GzipReader instance per gzip stream in the file.

Updated by ko1 (Koichi Sasada) 3 months ago

  • Status changed from Assigned to Feedback

do you have real (popular) usecases?

Updated by jeremyevans0 (Jeremy Evans) 3 months ago

ko1 (Koichi Sasada) wrote:

do you have real (popular) usecases?

For real but not necessarily popular, but at least OpenBSD's package format uses this. I'm not sure if package formats for other operating systems use it, though. #14804 pointed out it was used by this file: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz

There are 3 bug reports for this, and I think while the need isn't common, it's something worth supporting via a new method. However, if we decide we don't want to support this, I'm fine closing the 3 bug reports.

Updated by ko1 (Koichi Sasada) 3 months ago

  • mame: can each_file return an Enumerator? Seems difficult to implement it
  • matz: How about always behaving like zcat? Is an option to keep the old behaviors really needed?
  • akr: The traditional behavior should be kept
  • akr: gzip(1) describes concatenation of gzip files in ADVANCED USAGE. https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

Conclusion:

  • matz: it should behave like zcat. Handling each member should be deleted.

Updated by Dan0042 (Daniel DeLorme) 3 months ago

matz: it should behave like zcat. Handling each member should be deleted.

Really? I agree that Zlib::GzipReader.open should behave like zcat, but each_file is a useful additional feature to have. And it's already implemented.

Updated by shyouhei (Shyouhei Urabe) 3 months ago

Dan0042 (Daniel DeLorme) wrote:

matz: it should behave like zcat. Handling each member should be deleted.

Really? I agree that Zlib::GzipReader.open should behave like zcat, but each_file is a useful additional feature to have. And it's already implemented.

I was at the meeting ko1 (Koichi Sasada) is talking about. Devs at the meeting were not sure if "each_file is a useful additional feature to have" you say is real or not.

Do you know if there are cases when taking a random member of such gzip file is actually useful?

Updated by Dan0042 (Daniel DeLorme) 3 months ago

I can't think of taking a random member but I can imagine wanting to extract to separate files.

Zlib::GzipReader.each_file("input.gz").with_index do |gzip,i|
  File.open("output#{i}","w"){ |f| f.write(gzip.read) }
end

But I admit I haven't needed this personally, so maybe this is not so useful.

Updated by jeremyevans0 (Jeremy Evans) 3 months ago

shyouhei (Shyouhei Urabe) wrote:

Do you know if there are cases when taking a random member of such gzip file is actually useful?

It's not taking a random member, it's processing the separate gzip streams in order, which is actually useful. For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.

Updated by nobu (Nobuyoshi Nakada) 3 months ago

jeremyevans0 (Jeremy Evans) wrote:

For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.

It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?

Updated by jeremyevans0 (Jeremy Evans) 3 months ago

nobu (Nobuyoshi Nakada) wrote:

It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?

It's easier and more efficient if the metadata is a separate member. I'm guessing it could be changed to use a single large stream, but there is no reason to do so, and doing so would break existing tooling.

Also available in: Atom PDF