RFR: JDK-8306441: Segmented heap dump [v6]

Thu May 18 09:38:52 UTC 2023

On Wed, 17 May 2023 09:23:17 GMT, Yi Yang <yyang at openjdk.org> wrote:

>> Hi, heap dump brings about pauses for application's execution(STW), this is a well-known pain. JDK-8252842 have added parallel support to heapdump in an attempt to alleviate this issue. However, all concurrent threads competitively write heap data to the same file, and more memory is required to maintain the concurrent buffer queue. In experiments, we did not feel a significant performance improvement from that.
>> 
>> The minor-pause solution, which is presented in this PR, is a two-stage segmented heap dump:
>> 
>> 1. Stage One(STW): Concurrent threads directly write data to multiple heap files.
>> 2. Stage Two(Non-STW): Merge multiple heap files into one complete heap dump file.
>> 
>> Now concurrent worker threads are not required to maintain a buffer queue, which would result in more memory overhead, nor do they need to compete for locks. It significantly reduces 73~80% application pause time. 
>> 
>> | memory | numOfThread | STW         | Total      |
>> | --- | --------- | -------------- | ------------ |
>> | 8g | 1 thread | 15.612 secs | 15.612 secs |
>> | 8g | 32 thread |  2.5617250 secs | 14.498 secs |
>> | 8g | 96 thread | 2.6790452 secs | 14.012 secs | 
>> | 16g | 1 thread | 26.278 secs | 26.278 secs |
>> | 16g | 32 thread |  5.2313740 secs | 26.417 secs |
>> | 16g | 96 thread | 6.2445556 secs | 27.141 secs |
>> | 32g | 1 thread | 48.149 secs | 48.149 secs |
>> | 32g | 32 thread | 10.7734677 secs | 61.643 secs | 
>> | 32g | 96 thread | 13.1522042 secs |  61.432 secs |
>> | 64g | 1 thread |  100.583 secs | 100.583 secs |
>> | 64g | 32 thread | 20.9233744 secs | 134.701 secs | 
>> | 64g | 96 thread | 26.7374116 secs | 126.080 secs | 
>> | 128g | 1 thread | 233.843 secs | 233.843 secs |
>> | 128g | 32 thread | 72.9945768 secs | 207.060 secs |
>> | 128g | 96 thread | 67.6815929 secs | 336.345 secs |
>> 
>>> **Total** means the total heap dump including both two phases
>>> **STW** means the first phase only.
>>> For parallel dump, **Total** = **STW** + **Merge**. For serial dump, **Total** = **STW**
>> 
>> ![image](https://user-images.githubusercontent.com/5010047/234534654-6f29a3af-dad5-46bc-830b-7449c80b4dec.png)
>> 
>> In actual testing, two-stage solution can lead to an increase in the overall time for heapdump(See table above). However, considering the reduction of STW time, I think it is an acceptable trade-off. Furthermore, there is still room for optimization in the second merge stage(e.g. sendfile/splice/copy_file_range instead of read+write combination). Since number of...
>
> Yi Yang has updated the pull request incrementally with one additional commit since the last revision:
> 
>   rename back to heapdumpCompression

Just a few notes for now:

I was misled by the title, as we already have the concept of segmented heap dumps (HPROF_HEAP_DUMP_SEGMENT).  This is an enhanced parallel heap dump, could we use that (or similar) for the bug and PR title?

I'm interested in if it is faster in STW time with the parallel write to use compression or not?  (Does not compressing take time, or does compressing mean we write fewer bytes, so ends up being faster.)

The idea of forking a child to do the write seems appealing, although it is a non-Windows solution. Isn't the point of a fork that the parent can keep running without any pause at all?  The child gets a copy of all memory, the fork happened at a safepoint so the Java heap is in a good state, and it can take its time to do the write, then exit without affecting the parent.  The child probably must not make any Java calls or interact with other existing JVM threads.
(fork not vfork.  The child is making no writes to the Java heap so should not cause the entire Java heap to be duplicated in memory.  That overhead might be large though particularly if the app is making changes...)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13667#issuecomment-1552794521