RFR: JDK-8260332: ParallelGC: Cooperative pretouch for oldgen expansion [v2]

Amit Pawar github.com+71302734+amitdpawar at openjdk.java.net
Mon Mar 22 17:42:40 UTC 2021


On Mon, 22 Mar 2021 04:39:48 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>>> 
>>> 
>>> _Mailing list message from [Kim Barrett](mailto:kim.barrett at oracle.com) on [hotspot-dev](mailto:hotspot-dev at openjdk.java.net):_
>>> 
>>> > On Mar 12, 2021, at 3:01 PM, Amit Pawar <github.com+71302734+amitdpawar at openjdk.java.net> wrote:
>>> > In case of ParallelGC, oldgen expansion can happen during promotion. Expanding thread will touch the pages and can't request for task execution as this GC thread is already executing a task. The expanding thread holds the lock on "ExpandHeap_lock" to resize the oldgen and other threads may wait for their turn. This is a blocking call.
>>> > This patch changes this behavior by adding another constructor in "MutexLocker" class to enable non blocking or try_lock operation. This way one thread will acquire the lock and other threads can join pretouch work. Threads failed to acquire the lock will join pretouch only when task is marked ready by expanding thread.
>>> > Following minimum expansion size are seen during expansion.
>>> > 1. 512KB without largepages and without UseNUMA.
>>> > 2. 64MB without largepages and with UseNUMA,
>>> > 3. 2MB (on x86)  with large pages and without UseNUMA,
>>> > 4. 64MB without large pages and with UseNUMA.
>>> > When Oldgen is expanding repeatedly with smaller size then this change wont help. For such cases, resize size should adapt to application demand to make use of this change. For example if application nature triggers 100 expansion with smaller sizes in same GC then it is better to increase the expansion size during each resize to reduce the number of resizes. If this patch is accepted then will plan to fix this case in another patch.
>>> 
>>> Sorry, but a change like this needs better motivation. What you say
>>> above suggests this change doesn't actually help.
>>> 
>>> It's intentional that oldgen expansions aren't generally large, as the
>>> oldgen shouldn't be grown unnecessarily. There are already parameters
>>> such as MinHeapDeltaBytes to control and manipulate this.
>>> 
>>> It is also preferable to complete an expansion request quickly to make
>>> the additional space available to other threads in the main allocation
>>> path, rather than making them go to the expand path. Making expansions
>>> larger could force more threads to take the slower expand path, which
>>> doesn't seem like a win even if they then help with the pretouch part
>>> of another thread's expansion. (And that also assumes UsePreTouch is
>>> even enabled.)
>>> 
>>> So the followup change that you say is needed to make this one
>>> profitable seems questionable.
>>> 
>>> The proposed change is also surprisingly large and intrusive for
>>> something that seems like it should be very localized.
>>> 
>>> > Jtreg all test passed.
>>> 
>>> A change like this needs a lot more testing than that, both functionally
>>> and performance.
>> 
>> [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
>> 
>> [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
>> 
>> Running SPECJbb composite shows 30-40% reduction in GC pause time when old-gen expands upto ~1GB for UseNUMA case (I tested for minimum oldgen size to trigger the resize). Non UseNUMA case will show improvement only when resize expands more than minimum "PreTouchParallelChunkSize" size to let other thread participate in pretouch work. Two cases 1 & 3 (without UseNUMA and with/without hugpages) above didnt shows any improvement because of "PreTouchParallelChunkSize" limit and that is the reason why I suggested another fix. 
>> 
>> The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB,  2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest. 
>> 
>> Again, thanks for your feedback.
>
>> [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
>> 
>> [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
> 
> I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.
> 
> Also, can you provide details about how this information is generated and collected?  I would like to reproduce the results and do some other experiments.
> 
> I think that, as proposed, the change is making already messy code still more messy, and some refactoring is needed. I don't know what the details of that might look like yet; I'm still exploring and poking at the code. I also need to look at the corresponding part of G1 that you mentioned.
> 
>> The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB, 2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.
> 
> I think something like this has a serious risk of growing the oldgen a lot more than needed, which may have serious downsides.

Thank you Kim for your reply and please see my inline comments.
> 
> 
> > [https://bugs.openjdk.java.net/browse/JDK-8254699](JDK-8254699) contains test results in XL file to show PreTouchParallelChunkSize was recently changed from 1GB to 4MB on Linux after testing various sizes. I have downloaded the same XL file and same is updated for Oldgen case during resize and it gives some rough idea about the improvement for this fix and follow up fix. Please check "PretouchOldgenDuringResize" sheet for "Co-operative Fix" and "Adaptive Resize Fix" columns.
> > [PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx](https://github.com/openjdk/jdk/files/6147180/PreTouchParallelChunkSize_TestResults_UpdatedForOldGenCase.xlsx)
> 
> I'm not sure how to interpret this. In particular, it says "Test done using November 27rd Openjdk build". That predates a number of recent changes in this area that seem relevant. Is that correct? Or is that stale and the data has been updated for a newer baseline.

Sorry for the confusion. I didnt test again and pre-touch time taken with different chunk size per thread was already recorded in the spreadsheet and thought to use it for reference to reply to David feedback "A change like this needs a lot more testing than that, both functionally and performance.".

> 
> Also, can you provide details about how this information is generated and collected? I would like to reproduce the results and do some other experiments.

JDK-8254699 was fixed by me and test was done using SPECJbb composite with following command along with JVM flag "PreTouchParallelChunkSize" with values 1G (earlier default on x86) and 4M (current default on x86). To generate this info,  "Ticks::now()" function was called to record the time taken for pretouch operation. The first table (please ignore resize related) in the sheet "SPECjbb_Summary"  was created from the sheet "SPECjbb_128C256T_GCPretouchTime" that contains pretouch log dumps for SPECjbb composite tests. 

Command (also mentioned in SPECjbb_128C256T_GCPretouchTime sheet):
-Xms24g -Xmx970g -Xmn960g -server -XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -XX:MaxTenuringThreshold=15 -XX:+ScavengeBeforeFullGC -XX:+UseAdaptiveSizePolicy -XX:ParallelGCThreads=256 -XX:-UseBiasedLocking -XX:SurvivorRatio=200 -XX:TargetSurvivorRatio=95 -XX:+UseNUMA -XX:-UseNUMAInterleaving -XX:+UseTransparentHugePages -XX:-UseAdaptiveNUMAChunkSizing

> 
> I think that, as proposed, the change is making already messy code still more messy, and some refactoring is needed. I don't know what the details of that might look like yet; I'm still exploring and poking at the code. I also need to look at the corresponding part of G1 that you mentioned.

First, I thought to push this change for ParallelGC and next G1GC. Earlier ParallelGC pretouch was done with single threaded and later this was fixed by moving multithreaded pretouch common code from G1GC and ParallelGC to PretouchTask class. I thought to do it similarly for this case too.

> 
> > The "Adaptive Resize Fix" column in the sheet is for next suggested fix and may possibly help to improve further. For server JVM, expansion size of 512KB, 2MB (hugepages) and 64MB looks good for first resize but later needs some attention I think. JVM flag "MinHeapDeltaBytes" needs to be known by the user and need to set it upfront. I think this can be consider for first resize in every GC and later dynamically go for higher size like double the previous size to adopt to application nature. This way it may help to reduce the GC pause time during the expansion. I thought to share my observation and my understanding could be wrong. So please check and suggest.
> 
> I think something like this has a serious risk of growing the oldgen a lot more than needed, which may have serious downsides.

I believe this depends on application runtime nature and it may expand as needed. If you still think that this change is not useful then please suggest. Will withdraw this PR.

Thanks,
Amit

-------------

PR: https://git.openjdk.java.net/jdk/pull/2976


More information about the hotspot-dev mailing list