Bug #9790
closedZlib::GzipReader only decompressed the first of concatenated files
Added by quainjn (Jake Quain) over 10 years ago. Updated over 4 years ago.
Description
There is a similar old issue in Node that I came across that perfectly describes the situation in ruby:
https://github.com/joyent/node/issues/6032
In ruby given the following setup:
echo "1" > 1.txt
echo "2" > 2.txt
gzip 1.txt
gzip 2.txt
cat 1.txt.gz 2.txt.gz > 3.txt.gz
Calling:
Zlib::GzipReader.open("3.txt.gz") do |gz|
print gz.read
end
would just print:
1
Files
zlib-gzreader-each_file-9790.patch (3.47 KB) zlib-gzreader-each_file-9790.patch | jeremyevans0 (Jeremy Evans), 11/27/2019 03:35 PM |
Updated by drbrain (Eric Hodel) over 10 years ago
- Category set to ext
- Status changed from Open to Assigned
- Assignee set to drbrain (Eric Hodel)
- Target version set to 2.2.0
Updated by akostadinov (Aleksandar Kostadinov) over 9 years ago
Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream getNextEntry()
[1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.
On the other hand the command line gzip utility only supports reading the whole thing as one. So a convenience method to read everything in one go, would also be nice.
[1] http://docs.oracle.com/javase/7/docs/api/java/util/zip/ZipInputStream.html
Updated by duerst (Martin Dürst) over 9 years ago
Aleksandar Kostadinov wrote:
Because gzip format allows multiple entries with filename I'd suggest to support a method like Java's ZipInputStream
getNextEntry()
[1]. This way programmer can choose to read everything as one chunk of data or multiple chunks each with its own name. This would allow storing and then retrieving multiple files in/from one gz.
Good idea, but it should be more Ruby-like, such as .each_file or so.
Updated by exAspArk (Evgeny Li) over 9 years ago
Hey guys, is there any updates?
I have created a small gem yesterday to make it able to read multiple files https://github.com/exAspArk/multiple_files_gzip_reader
> MultipleFilesGzipReader.open("3.txt.gz") do |gz|
> puts gz.read
> end
# 1
# 2
# => nil
Updated by nagachika (Tomoyuki Chikanaga) over 9 years ago
- Has duplicate Bug #11180: Missing lines with Zlib::GzipReader added
Updated by jeremyevans0 (Jeremy Evans) about 5 years ago
- Related to Bug #14804: GzipReader cannot read Freebase dump (but gzcat/zless can) added
Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago
Attached is a patch that adds Zlib::GzipReader.each_file
will which handle multiple concatenated gzip streams in the same file, similar to common tools that operate on .gz
files. Zlib::GzipReader.each_file
yields one Zlib::GzipReader
instance per gzip stream in the file.
Updated by ko1 (Koichi Sasada) almost 5 years ago
- Status changed from Assigned to Feedback
do you have real (popular) usecases?
Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago
ko1 (Koichi Sasada) wrote:
do you have real (popular) usecases?
For real but not necessarily popular, but at least OpenBSD's package format uses this. I'm not sure if package formats for other operating systems use it, though. #14804 pointed out it was used by this file: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz
There are 3 bug reports for this, and I think while the need isn't common, it's something worth supporting via a new method. However, if we decide we don't want to support this, I'm fine closing the 3 bug reports.
Updated by ko1 (Koichi Sasada) almost 5 years ago
- mame: can each_file return an Enumerator? Seems difficult to implement it
- matz: How about always behaving like zcat? Is an option to keep the old behaviors really needed?
- akr: The traditional behavior should be kept
- akr: gzip(1) describes concatenation of gzip files in ADVANCED USAGE. https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
Conclusion:
- matz: it should behave like zcat. Handling each member should be deleted.
Updated by Dan0042 (Daniel DeLorme) almost 5 years ago
matz: it should behave like zcat. Handling each member should be deleted.
Really? I agree that Zlib::GzipReader.open
should behave like zcat, but each_file
is a useful additional feature to have. And it's already implemented.
Updated by shyouhei (Shyouhei Urabe) almost 5 years ago
Dan0042 (Daniel DeLorme) wrote:
matz: it should behave like zcat. Handling each member should be deleted.
Really? I agree that
Zlib::GzipReader.open
should behave like zcat, buteach_file
is a useful additional feature to have. And it's already implemented.
I was at the meeting @ko1 (Koichi Sasada) is talking about. Devs at the meeting were not sure if "each_file
is a useful additional feature to have" you say is real or not.
Do you know if there are cases when taking a random member of such gzip file is actually useful?
Updated by Dan0042 (Daniel DeLorme) almost 5 years ago
I can't think of taking a random member but I can imagine wanting to extract to separate files.
Zlib::GzipReader.each_file("input.gz").with_index do |gzip,i|
File.open("output#{i}","w"){ |f| f.write(gzip.read) }
end
But I admit I haven't needed this personally, so maybe this is not so useful.
Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago
shyouhei (Shyouhei Urabe) wrote:
Do you know if there are cases when taking a random member of such gzip file is actually useful?
It's not taking a random member, it's processing the separate gzip streams in order, which is actually useful. For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.
Updated by nobu (Nobuyoshi Nakada) almost 5 years ago
jeremyevans0 (Jeremy Evans) wrote:
For example, in OpenBSD's packages, the first gzip stream in the package contains the package metadata. That is read completely, and after processing the metadata, the programs that handle packages determine whether it needs to read the actual package data (stored in the second gzip stream). If it doesn't need to process the package data, it can then stop at that point without having read any more than it needs to.
It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?
Updated by jeremyevans0 (Jeremy Evans) almost 5 years ago
nobu (Nobuyoshi Nakada) wrote:
It is an interesting usage.
Does that metadata need to be a separate member, not the leading part of a large stream?
It's easier and more efficient if the metadata is a separate member. I'm guessing it could be changed to use a single large stream, but there is no reason to do so, and doing so would break existing tooling.
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
ko1 (Koichi Sasada) wrote in #note-10:
- mame: can each_file return an Enumerator? Seems difficult to implement it
- matz: How about always behaving like zcat? Is an option to keep the old behaviors really needed?
- akr: The traditional behavior should be kept
- akr: gzip(1) describes concatenation of gzip files in ADVANCED USAGE. https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage
Conclusion:
- matz: it should behave like zcat. Handling each member should be deleted.
I'm not sure, but it seems the proposal here would be to make all Zlib::GzipReader methods transparently handle multiple streams. There are two issues with doing that:
-
It's very invasive, all methods would need to change, some in fairly complex ways.
-
More importantly, it would break cases where non-gzip data was stored after gzip data. Currently you can use GzipReader in such cases, and the io pointer after the gzip processing will be directly after where the gzip stream ends, at which point you can use the io normally. Basically, if you make this change, you could no longer embed a gzip stream in the middle of a file and read just that stream.
If we don't want to add Zlib::GzipReader.each_file but we want to add something like zcat, here's a pull request that implements Zlib::GzipReader.zcat: https://github.com/ruby/zlib/pull/13. I think Zlib::GzipReader.each_file is a more useful and flexible method than Zlib::GzipReader.zcat. We could certainly have both, though.
Updated by jeremyevans0 (Jeremy Evans) over 4 years ago
- Status changed from Feedback to Closed
matz approved Zlib::GzipReader.zcat, so I merged the pull request into zlib.