RFR: 4890732: GZIPOutputStream doesn't support optional GZIP fields [v3]

Sun Apr 11 03:16:31 UTC 2021

On Thu, 8 Apr 2021 08:54:06 GMT, Alan Bateman <alanb at openjdk.org> wrote:

>> Dear All,
>> May I ask your help to review this change? Thanks!
>> 
>> BRs,
>> Lin
>
>> Dear All,
>> May I ask your help to review this change? Thanks!
> 
> @LanceAndersen Do you have cycles to help Lin? This proposal will require discussion, they may be case for the header to be a record for example. My personal view is that the PR should be set aside until there is at least at least some agreement on the API.

Dear @AlanBateman  and @LanceAndersen,

Thanks a lot for your review and comments!

> 
> We should look to see if it makes sense to use some of the more recent java features such as Record.
> 
> If we are adding a specific class to hold the header fields, perhaps using the builder pattern should be considered vs constructors.

I agree, we should finalize the API before moving forward. I am not quite fimilar with Record, I will do some investigation, trying to use it in this PR and discuss with you later. 

> If we are writing out these additional fields, what should we do with them when reading a gzip file back?

IMO, generally there should be check of header crc, and some other checks like the header format, magic number etc. May be we could implement it like the gnu gzip tool, which ingore contents of most header fields but verified the general header info.

> Have you looked at other gzip api implementations to see what they provide in this area?

I have looked at the gzip implementation at https://github.com/linzang/gzip/blob/distrotech-gzip/gzip.c#L1281, this method is use to generate header crc value for check. And there is some legal header check in this method.  And `file name` is used to save original file name in gzip when it is truncated. refer to https://github.com/linzang/gzip/blob/distrotech-gzip/zip.c#L37

In JDK, there is a usage of `file comment` in the gzip heap dump file generated by jcmd `jmap -dump` command.  at https://github.com/openjdk/jdk/blob/ff52f2989fd60ec8251eaf76f4c4b78f10d3e048/src/hotspot/share/services/heapDumperCompression.cpp#L127 , and this info is used in testing like an hint for uncompression, please refer https://github.com/openjdk/jdk/blob/ff52f2989fd60ec8251eaf76f4c4b78f10d3e048/test/lib/jdk/test/lib/hprof/parser/GzipRandomAccess.java#L280.

And IMHO the behavior of adding information in `file comments` seems like  a general way to transfer extra information between compressor and uncompressor. 

> Is there anyone who relies on this additional header info? As I mentioned in an earlier comment, we should validate the changes against another implementation to ensure that we can read the data back correctly and that the additional header data that we write can be read by other tools.

May be we could have similar behavior? which just do some general header info check, calculate CRC of the header and ignore the real contents of most header fields. Do you think it is reasonable? 

Thanks!
Lin

-------------

PR: https://git.openjdk.java.net/jdk/pull/3072