RFR: JDK-8306441: Two phase segmented heap dump [v17]

Mon Jul 17 02:16:19 UTC 2023

On Wed, 12 Jul 2023 07:58:47 GMT, Yi Yang <yyang at openjdk.org> wrote:

>> ### Motivation and proposal
>> Hi, heap dump brings about pauses for application's execution(STW), this is a well-known pain. JDK-8252842 have added parallel support to heapdump in an attempt to alleviate this issue. However, all concurrent threads competitively write heap data to the same file, and more memory is required to maintain the concurrent buffer queue. In experiments, we did not feel a significant performance improvement from that.
>> 
>> The minor-pause solution, which is presented in this PR, is a two-phase segmented heap dump:
>> 
>> - Phase 1(STW): Concurrent threads directly write data to multiple heap files.
>> - Phase 2(Non-STW): Merge multiple heap files into one complete heap dump file. This process can happen outside safepoint.
>> 
>> Now concurrent worker threads are not required to maintain a buffer queue, which would result in more memory overhead, nor do they need to compete for locks. The changes in the overall design are as follows:
>> 
>> ![image](https://github.com/openjdk/jdk/assets/5010047/77e4764a-62b5-4336-8b45-fc880ba14c4a)
>> <p align="center">Fig1. Before</p>
>> 
>> ![image](https://github.com/openjdk/jdk/assets/5010047/931ab874-64d1-4337-ae32-3066eed809fc)
>> <p align="center">Fig2. After this patch</p>
>> 
>> ### Performance evaluation
>> | memory | numOfThread | CompressionMode | STW | Total |
>> | -------| ----------- | --------------- | --- | ---- |
>> | 8g | 1 T | N | 15.612 | 15.612 |
>> | 8g | 32 T | N | 2.561725 | 14.498 |
>> | 8g | 32 T | C1 | 2.3084878 | 14.198 |
>> | 8g | 32 T | C2 | 10.9355128 | 21.882 |
>> | 8g | 96 T | N | 2.6790452 | 14.012 |
>> | 8g | 96 T | C1 | 2.3044796 | 3.589 |
>> | 8g | 96 T | C2 | 9.7585151 | 20.219 |
>> | 16g | 1 T | N | 26.278 | 26.278 |
>> | 16g | 32 T | N | 5.231374 | 26.417 |
>> | 16g | 32 T | C1 | 5.6946983 | 6.538 |
>> | 16g | 32 T | C2 | 21.8211105 | 41.133 |
>> | 16g | 96 T | N | 6.2445556 | 27.141 |
>> | 16g | 96 T | C1 | 4.6007096 | 6.259 |
>> | 16g | 96 T | C2 | 19.2965783 | 39.007 |
>> | 32g | 1 T | N | 48.149 | 48.149 |
>> | 32g | 32 T | N | 10.7734677 | 61.643 |
>> | 32g | 32 T | C1 | 10.1642097 | 10.903 |
>> | 32g | 32 T | C2 | 43.8407607 | 88.152 |
>> | 32g | 96 T | N | 13.1522042 | 61.432 |
>> | 32g | 96 T | C1 | 9.0954641 | 9.885 |
>> | 32g | 96 T | C2 | 38.9900931 | 80.574 |
>> | 64g | 1 T | N | 100.583 | 100.583 |
>> | 64g | 32 T | N | 20.9233744 | 134.701 |
>> | 64g | 32 T | C1 | 18.5023784 | 19.358 |
>> | 64g | 32 T | C2 | 86.4748377 | 172.707 |
>> | 64g | 96 T | N | 26.7374116 | 126.08 |
>> | 64g | ...
>
> Yi Yang has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix test compilation failure

Thanks for reviews! 

---------
@kevinjwalls 
> if we specify a large number for -parallel=, is it limited to the number of safepoint_workers?

No, it limits how may active_worker is used?

> Should we feedback to the user how many threads were really used?

Yes, I think it makes sense and I will output it in the upcoming changes.

> Should the user really have to always say how many threads? Do they just want to say "use all the worker threads"?

The main purpose of adding the parallel option is that specifying the number of parallel threads can have a certain impact on the time, and using all worker threads may not always produce the best results. Please see the table above, where 96 threads perform worse than 32 threads in many test groups. Additionally, parallel also controls the use of parallel dump or serial dump.

It is also worth noting that even if the user specifies parallel, the VM will try to use this number of threads to perform parallel dump, but this is not guaranteed. The VM will also choose max active_worker based on the size and utilization of the heap.

> Does this work with Serial GC? Maybe it doesn't maybe that's unfortunate (Serial may be appropriate for some large heaps, and you still want a faster heap dump?)

Unfortunately, it does not currently work with Serial GC. For Serial GC, because there is no worker gang concept, the VM always uses serial dump. We have hardly seen any cases of Serial GC in production environments, so I think support for Serial GC can be added in a follow-up PRs.

-------
@dholmes-ora 
> I'm unclear why we need an explicit AttachListenerThread type to support this.

Because I need the overloaded check is_AttachListener_thread(), which can avoid using the VM thread to execute the dump file merge as much as possible.

-------
@tstuefe 
> I don't understand the performance numbers. I assume they correlate with the Y axis of the diagrams? Are these seconds? Of what, the new solution? Or are these percentages? You summarize with "reduces 71~83% application pause time". How do you recon that? You also talk about parallel and serial dump. Does that mean parallel = your patch, serial = stock?

Yes, these are all in seconds. 1T here means serial, which writes to the file with only one thread, and its result is used as the baseline. Other data, such as 32T, are results obtained by testing using my patch. My calculation method compares the change rate obtained by comparing baseline STW and new STW.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13667#issuecomment-1637287623