RFR 8080640: Reduce copying when reading JAR/ZIP entries

Mon Jun 29 20:49:53 UTC 2015

Hi,

Discovered a late bug where signed JARs would not be verified using the 
faster path as the verification is done on the returned stream.
The JarFile.getInputStream will return different stream types depending 
on if the entry should be verified or not.

Also added a size check of the entry to make sure we do not try to 
allocate too large arrays when reading an entry, mainly a safeguard for 
corrupt entries.

webrev: http://cr.openjdk.java.net/~sfriberg/JDK-8080640/webrev.5/

Not sure if the added complexity is worth it in this case, but still see 
a 1-2% improvement when starting a large server application.

Thanks,
Staffan

On 06/23/2015 12:56 PM, Xueming Shen wrote:
> looks good!
>
> -Sherman
>
> On 06/23/2015 12:47 PM, Staffan Friberg wrote:
>> Hi Sherman,
>>
>> Removed the check, 
>> http://cr.openjdk.java.net/~sfriberg/JDK-8080640/webrev.4
>>
>> Cheers,
>> Staffan
>>
>> On 06/23/2015 11:22 AM, Xueming Shen wrote:
>>> Hi Staffan,
>>>
>>> #527 check is probably unnecessary. The size and csize are 32-bit 
>>> unsigned integer, they
>>> should never be < 0.
>>>
>>> The rest looks good.
>>>
>>> Thanks,
>>> -Sherman
>>>
>>> On 06/23/2015 10:54 AM, Staffan Friberg wrote:
>>>> Hi Sherman,
>>>>
>>>> Thanks for the review. I removed the unused import and the changes 
>>>> to reduce the lock region.
>>>>
>>>> New webrev, http://cr.openjdk.java.net/~sfriberg/JDK-8080640/webrev.3
>>>>
>>>> Thanks,
>>>> Staffan
>>>>
>>>>
>>>> On 06/08/2015 02:37 PM, Xueming Shen wrote:
>>>>> Staffan,
>>>>>
>>>>> (1) ByteArrayInputSteram is no longer needed in ZipFile
>>>>> (2) You've changed the lock region in ZipFile.getInputSteram. 
>>>>> Given we are not
>>>>>      doing ByteArrayInpusteram for this method, can we just not 
>>>>> touch this method
>>>>>      and the class ZipFileInputSteram()?
>>>>>      The concern is that we did some changes in that area back to 
>>>>> 2010 and triggered
>>>>>      a complicated race condition regression [1], it was finally 
>>>>> fixed after lot of rounds
>>>>>      of discussion. I still have all those emails in my inbox. It 
>>>>> would be better to keep
>>>>>      whatever works for now, instead of re-fresh all the memory 
>>>>> (read all those emails)
>>>>>      to figure out if the latest change might have a negative impact.
>>>>>
>>>>> The getBytes() implementation looks good to me.
>>>>>
>>>>> Thanks!
>>>>> -Sherman
>>>>>
>>>>> [1] 
>>>>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-March/006355.html
>>>>>
>>>>> On 06/05/2015 11:09 AM, Staffan Friberg wrote:
>>>>>> Hi Sherman,
>>>>>>
>>>>>> I have a new webrev which reverts that part, 
>>>>>> http://cr.openjdk.java.net/~sfriberg/JDK-8080640/webrev.2/
>>>>>>
>>>>>> Summary of changes
>>>>>>     Reduce lock region in ZipFile.getInputstream
>>>>>>     Add private ZipFile.getBytes that can be used in select 
>>>>>> places in the JDK where all bytes will be read
>>>>>>
>>>>>> Could you sponsor this change once it has been reviewed?
>>>>>>
>>>>>> Thanks,
>>>>>> Staffan
>>>>>>
>>>>>> On 06/03/2015 10:45 AM, Xueming Shen wrote:
>>>>>>> Staffan,
>>>>>>>
>>>>>>> I'm not convinced that the benefit here is significant enough to 
>>>>>>> change the
>>>>>>> getInputStream() to return a ByteArrayInputStream, given this 
>>>>>>> can be easily
>>>>>>> achieved by wrapping the returned byte[] from getBytes(ZipEntry) 
>>>>>>> at user's
>>>>>>> site. I would suggest to file a separate rfe on this 
>>>>>>> disagreement and move on
>>>>>>> with the agreed getBytes() for now.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -Sherman
>>>>>>>
>>>>>>> On 06/02/2015 10:27 AM, Staffan Friberg wrote:
>>>>>>>>
>>>>>>>> On 05/22/2015 01:15 PM, Staffan Friberg wrote:
>>>>>>>>> On 05/22/2015 11:51 AM, Xueming Shen wrote:
>>>>>>>>>> On 05/22/2015 11:41 AM, Staffan Friberg wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 05/21/2015 11:00 AM, Staffan Friberg wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 05/21/2015 09:48 AM, Staffan Friberg wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 05/20/2015 10:57 AM, Xueming Shen wrote:
>>>>>>>>>>>>>> On 05/18/2015 06:44 PM, Staffan Friberg wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Wanted to get reviews and feedback on this performance 
>>>>>>>>>>>>>>> improvement for reading from JAR/ZIP files during 
>>>>>>>>>>>>>>> classloading by reducing unnecessary copying and reading 
>>>>>>>>>>>>>>> the entry in one go instead of in small portions. This 
>>>>>>>>>>>>>>> shows a significant improvement when reading a single 
>>>>>>>>>>>>>>> entry and for a large application with 10k classes and 
>>>>>>>>>>>>>>> 500+ JAR files it improved the startup time by 4%.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For more details on the background and performance 
>>>>>>>>>>>>>>> results please see the RFE entry.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> RFE - https://bugs.openjdk.java.net/browse/JDK-8080640
>>>>>>>>>>>>>>> WEBREV - 
>>>>>>>>>>>>>>> http://cr.openjdk.java.net/~sfriberg/JDK-8080640/webrev.0
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Staffan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Staffan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I did not miss something here, from your use scenario 
>>>>>>>>>>>>>> it appears to me the only thing you really
>>>>>>>>>>>>>> need here to help boost your performance is
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>     byte[] ZipFile.getAllBytes(ZipEntry ze);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You are allocating a byte[] at use side and wrapping it 
>>>>>>>>>>>>>> with a ByteBuffer if the size is small enough,
>>>>>>>>>>>>>> otherwise, you letting the ZipFile to allocate a big 
>>>>>>>>>>>>>> enough one for you. It does not look like you
>>>>>>>>>>>>>> can re-use that byte[] (has to be wrapped by the 
>>>>>>>>>>>>>> ByteArrayInputStream and return), why do you
>>>>>>>>>>>>>> need two different methods here? The logic would be much 
>>>>>>>>>>>>>> easier to simply let the ZipFile to allocate
>>>>>>>>>>>>>> the needed buffer with appropriate size, fill the bytes 
>>>>>>>>>>>>>> and return, with a "OOME" if the entry size
>>>>>>>>>>>>>> is bigger than 2g.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The only thing we use from the input ze is its name, get 
>>>>>>>>>>>>>> the size/csize from the jzentry, I don't think
>>>>>>>>>>>>>> jzentry.csize/size can be "unknown", they are from the 
>>>>>>>>>>>>>> "cen" table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If the real/final use of the bytes is to wrap it with a 
>>>>>>>>>>>>>> ByteArrayInputStream,why bother using ByteBuffer
>>>>>>>>>>>>>> here? Shouldn't a direct byte[] with exactly the size of 
>>>>>>>>>>>>>> the entry server better.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Sherman
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Sherman,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the comments. I agree, was starting out with 
>>>>>>>>>>>>> bytebuffer because I was hoping to be able to cache things 
>>>>>>>>>>>>> where the buffer was being used, but since the buffer is 
>>>>>>>>>>>>> past along further I couldn't figure out a clean way to do 
>>>>>>>>>>>>> it.
>>>>>>>>>>>>> Will rewrite it to simply just return a buffer, and only 
>>>>>>>>>>>>> wrap it in the Resource class getByteBuffer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would be your thought on updating the 
>>>>>>>>>>>>> ZipFile.getInputStream to return ByteArrayInputStream for 
>>>>>>>>>>>>> small entries? Currently I do that work outside in two 
>>>>>>>>>>>>> places and moving it would potentially speed up others 
>>>>>>>>>>>>> reading small entries as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Staffan
>>>>>>>>>>>> Just realized that my use of ByteArrayInputStream would 
>>>>>>>>>>>> miss Jar verification if enabled so the way to go hear 
>>>>>>>>>>>> would be to add it if possible to the ZipFile.getInputStream.
>>>>>>>>>>>>
>>>>>>>>>>>> //Staffan
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Here is an updated webrev which uses a byte[] directly and 
>>>>>>>>>>> also uses ByteArrayInputStream in ZipFile for small entries 
>>>>>>>>>>> below 128k.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm not sure about the benefit of doing the 
>>>>>>>>>> ByteArrayInputStream in ZipFile.getInputStream. It has
>>>>>>>>>> the consequence of changing the "expected" behavior of 
>>>>>>>>>> getInputStream() (instead of return an
>>>>>>>>>> input stream waiting for reading, it now reads all bytes in 
>>>>>>>>>> advance), something we might not want
>>>>>>>>>> to do in a performance tuning. Though it might be reasonable 
>>>>>>>>>> to guess everyone get an input stream
>>>>>>>>>> is to read all bytes from it later.
>>>>>>>>>>
>>>>>>>>>> -Sherman
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> http://cr.openjdk.java.net/~sfriberg/JDK-8080640/webrev.1
>>>>>>>>>>>
>>>>>>>>>>> //Staffan
>>>>>>>>>>
>>>>>>>>> Agree that it will change the behavior slightly, but as you 
>>>>>>>>> said it is probably expected that some one will read the 
>>>>>>>>> stream eventually.
>>>>>>>>> We could reduce the size further if that makes a difference, 
>>>>>>>>> if the size is below 65k we would not use more memory than the 
>>>>>>>>> buffer allocated for the InflaterStream today.
>>>>>>>>> The total allocation would be slightly larger for deflated 
>>>>>>>>> entries as we would allocate a byte[] for the compressed 
>>>>>>>>> bytes, but it would be GC:able and not kept alive. So from a 
>>>>>>>>> memory perspective the difference is very limited.
>>>>>>>>>
>>>>>>>>> //Staffan
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Bumping this thread to get some more comments on the concern 
>>>>>>>> about changing the ZipFile.getInputStream behavior. The benefit 
>>>>>>>> of doing this change is that any read of small entries from ZIP 
>>>>>>>> and JAR files will be much faster and less resources will be 
>>>>>>>> held, including native resources normally held by the 
>>>>>>>> ZipInputStream.
>>>>>>>>
>>>>>>>> The behavior change that will occur is that the full entry will 
>>>>>>>> be read as part of creating the stream and not lazily as might 
>>>>>>>> be expected. However when getting a today InputStream zip file 
>>>>>>>> will be accessed to read information about the size of the 
>>>>>>>> entry, so the zip file is already touched when getting an 
>>>>>>>> InputStream, but not the compressed bytes.
>>>>>>>>
>>>>>>>> I'm fine with removing this part of the change and just push 
>>>>>>>> the private getBytes function and the updates to the JDK 
>>>>>>>> libraries to use it.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Staffan
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>