RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Mon Apr 5 20:10:27 UTC 2021

On Mon, 29 Mar 2021 09:22:43 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> Thank you Kim for your reply and please see my inline comments.
>>> 
>>> 
>>> > [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
>>> > [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>>> 
>>> I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.
>> 
>> Sorry for the confusion. I didnt test again and pre-touch time taken with different chunk size per thread was already recorded in the spreadsheet and thought to use it for reference to reply to David feedback "A change like this needs a lot more testing than that, both functionally and performance.".
>> 
>>> 
>>> Also, can you provide details about how this information is generated and collected? I would like to reproduce the results and do some other experiments.
>> 
>> JDK-8254699 was fixed by me and test was done using SPECJbb composite with following command along with JVM flag "PreTouchParallelChunkSize" with values 1G (earlier default on x86) and 4M (current default on x86). To generate this info,  "Ticks::now()" function was called to record the time taken for pretouch operation. The first table (please ignore resize related) in the sheet "SPECjbb_Summary"  was created from the sheet "SPECjbb_128C256T_GCPretouchTime" that contains pretouch log dumps for SPECjbb composite tests. 
>> 
>> Command (also mentioned in SPECjbb_128C256T_GCPretouchTime sheet):
>> -Xms24g -Xmx970g -Xmn960g -server -XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:MaxTenuringThreshold=15 -XX:+ScavengeBeforeFullGC -XX:+UseAdaptiveSizePolicy -XX:ParallelGCThreads=256 -XX:-UseBiasedLocking -XX:SurvivorRatio=200 -XX:TargetSurvivorRatio=95 -XX:+UseNUMA -XX:-UseNUMAInterleaving -XX:+UseTransparentHugePages -XX:-UseAdaptiveNUMAChunkSizing
>> 
>>> 
>>> I think that, as proposed, the change is making already messy code still more messy, and some refactoring is needed. I don't know what the details of that might look like yet; I'm still exploring and poking at the code. I also need to look at the corresponding part of G1 that you mentioned.
>> 
>> First, I thought to push this change for ParallelGC and next G1GC. Earlier ParallelGC pretouch was done with single threaded and later this was fixed by moving multithreaded pretouch common code from G1GC and ParallelGC to PretouchTask class. I thought to do it similarly for this case too.
>> 
>>> 
>>> > The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB, 2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.
>>> 
>>> I think something like this has a serious risk of growing the oldgen a lot more than needed, which may have serious downsides.
>> 
>> I believe this depends on application runtime nature and it may expand as needed. If you still think that this change is not useful then please suggest. Will withdraw this PR.
>> 
>> Thanks,
>> Amit
>
>> > > [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>> > 
>> > 
>> > I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.
>> 
>> Sorry for the confusion. I didnt test again and pre-touch time taken with different chunk size per thread was already recorded in the spreadsheet and thought to use it for reference to reply to David feedback "A change like this needs a lot more testing than that, both functionally and performance.".
> 
> Getting any benefit from parallel pretouch here suggests we have many threads piling up on the expand mutex. Before JDK-8260045 (pushed together with JDK-8260044 on 2/12/2021) that would lead to "expansion storms", where several threads piled up on that mutex and serially entered and did a new expansion (with associated pretouch, and possibly of significant size). Getting rid of the expansion storms may alleviate the serial pretouch quite a bit. There may still be some benefit to cooperative parallization, but new measurements are needed to determine how much of a problem still exists.
> 
> If it is still worth addressing, I think the proposed approach has problems, as it makes messy complicated code messier and more complicated.  I think some preliminary refactoring is needed.  For example, splitting MutableSpace::initialize into a setup-pages part and set-the-range part.  I've not explored in detail what's needed yet, pending new measurements.  (I haven't had time to do those measurements yet myself.)

Thanks Kim for your feedback.

I have updated the code and it looks good now. Please use "Full" diff to see all the changes.

Summary of changes: added new class in pretouchTask.hpp file for synchronization. Now co-operative pretouch is enabled for both ParallelGC and G1GC. For testing purpose, added new flag "UseMultithreadedPretouchForOldGen" and timing is enabled for old-gen pretouch case for measurements. Enabling this flag makes try_lock to be used instead of lock (which is current default).

This change will be useful only if PreTouchParallelChunkSize value is not very huge compared to old-gen expansion size. For Linux it is set to 4MB and for other OS it is still 1GB.

Testing SPECJbb composite on AMD EPYC machine shows following improvement.

- Parallel GC: with UseNUMA, minimum expansion size is 64MB
-- with 2MB hugepages, master branch time taken is ~9-10ms and current fix time taken is ~1-2ms`
-- without hugepages, master branch time taken is ~14-17ms and current fix time taken is ~2-3ms

- Parallel GC: without UseNUMA, minimum expansion size is 512KB so this change is not useful for this size. Sometime expansion goes upto 20MB then pretouch takes less time for such large size.

- G1 GC:  with UseNUMA, maximum expansion size is 32MB
-- with 2MB hugepages, master branch time taken is ~5-7ms and current fix time taken is ~1-7ms` (higher time due to single thread activity as there was no other thread waiting to acquire the lock)

- G1 GC:  without UseNUMA, minimum expansion size 64KB and maximum expansion size is 32MB
-- without hugepages, master branch time taken = ~9-11ms and current fix time taken ~2-9ms (same here too)

Please check.

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976