RFD: 8252768: Fast, asynchronous heap dumps

Thu Sep 3 17:41:34 UTC 2020

Hi Thomas,

Thanks for sharing your concerns. Please find my anserwers inline:

On Thu, Sep 3, 2020 at 6:58 PM Thomas Stüfe <thomas.stuefe at gmail.com> wrote:
>
> Hi Volker,
>
> hah, that is a cool idea :)
>
> Lets see what could go wrong:
>
> So, child process is forked, but in the child only the forking thread survives, right? The forking thread then proceeds to dump. What happens when, during dumping, it needs a lock owned by one of the non-running threads? Could also be something non-obvious, like a lock drawn by one of the lower level subsystems, e.g. memory allocation (and then NMT), UL etc.

I haven't completely analyzed all the possible code paths yet but the
good thing is that the child process is at a safepoint and as you
correctly mentioned it only has a single running thread. I think if a
VM operation at a safepoint would require a lock owned by another
thread, it would dead-lock anyway, even without forking, wouldn't it?
Normally, VM_HeapDumper::doit() runs at a safepoint without blocking
so I currently don't see why it shouldn't be able to run without
blocking in a forked process?

>
> --
>
> In the child, I would be afraid of running any kind of cleanup code when exiting, since that may somehow modify state in the parent (e.g. via explicitly shared memory, or whatever third party native code may be up to). So I would use _exit(), not exit(), to avoid running any stray onexit()/atexit() handlers.
>

Thanks, that's a good point. I'll change that.

> Of course, then you need to make sure the dump is flushed and the file handle is closed before exiting.
>
> --
>
> Depending on the overcommit settings fork() may fail with ENOMEM, regardless of copy-on-write.
>

Not sure this is related to overcommit settings. According to the
man-page, fork() only fails with "ENOMEM" if fork() "failed to
allocate the necessary kernel structures because memory is tight". But
a failing fork is actually no problem at all. Currently, if fork()
fails, I just fall back to normal, synchronous dumping. Of course this
could be made configurable such that a failing asynchronous dump
wouldn't result in a synchronous dump with its long safepoint timeout
but instead completely skip the dump altogether.

> --
>
> If the parent process is, at the time of the fork, touching a lot of pages, and the child takes its sweet time writing the dump, total memory usage will go up, right? Compared to the original, non-async variant.
>

Yes, that's right. The child is not writing more than a few kb, but
the total memory consumption of the system will increase by the amount
of common pages which are changed by the parent while the child is
dumping.

> --
>
> We will now have a second java process popping up, existing for some seconds, then vanishing. Outside tooling might be confused. OTOH the same happens when forking via Runtime.exec, but there this state only persists for some microseconds, until the first exec() call.

That's also right. But the new process is at a safepoint and will exit
right from there. So at least other tools won't be able to attach to
the new process. It won't be even detectable by tools like jps.

>
> --
>
> UL in child: this log output now gets mixed in asynchronously with the parent's log? I would probably avoid logging in the child process. Also, as stated above, I am not sure if UL uses locks internally, which may hang.

As I wrote in my mail, the UL in the child is currently only there for
debugging purposes. It would be removed in a final version.

>
> --
>
> Just some quick first remarks. I find this idea cool, but I am yet not sure it is practical.

Thanks for your thoughts. I'm not sure either, that's why I wanted to
discuss the approach. I was actually surprised how easy it was to
implement and how smoothly it works until now :)

Regards,
Volker

>
> Cheers, Thomas
>
> On Thu, Sep 3, 2020 at 6:03 PM Volker Simonis <volker.simonis at gmail.com> wrote:
>>
>> Hi,
>>
>> I'd like to get your opinion on a POC I've done in order to speed up
>> heap dumps on Linux:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8252768
>> http://cr.openjdk.java.net/~simonis/webrevs/2020/8252768/
>>
>> Currently, heap dumps can be taken by the SA tools from a frozen
>> process or core file or directly from a running process with jcmd,
>> jconsole & JMX, jmap, etc. If the heap of a running process is dumped,
>> this happens at a safepoint (see VM_HeapDumper). Because the time to
>> produce a heap dump is roughly proportional to the size and fill ratio
>> of the heap, this leads to safepoint times which can range from ~100ms
>> for a 100mb heap to ~1s for a 1gb heap up to 15s and more for a 8gb
>> heap (measured on my Core i7 laptop with SSD).
>>
>> One possibility to decrease the safepoint time is to offload the
>> dumping work to an asynchronous process. On Linux (and probably any
>> other OS which supports fork()) this can be achieved by forking and
>> offloading the heap dumping to the child process. Forking still needs
>> to happen at a safepoint, but forking is considerably faster compared
>> to the dumping process itself. The fork performance is still
>> proportional to the size of the original Java process because although
>> fork won't copy any memory pages, the kernel still needs to duplicate
>> the page table entries of the process.
>>
>> Linux uses a “copy-on-write” technique for the creation of a forked
>> child process. This means that right after creation, the child process
>> will have exactly the same memory image like its parent process. But
>> at the same time, the child process won’t use any additional physical
>> memory, as long as it doesn’t change (i.e. writes into) its memory.
>> Since heap dumping only reads the child process's memory and then
>> exits immediately, this technique can be applied even if the Java
>> process already uses almost the whole free physical memory.
>>
>> The POC I've created (see
>> http://cr.openjdk.java.net/~simonis/webrevs/2020/8252768/) decreases
>> the aforementioned ~100ms, ~1s and 15s for a 100mb, 1gb and 8gb heap
>> to ~3ms, ~15ms and ~60ms on my laptop which I think is significant.
>> You can try it out by using the new "-async" or "-async=true" option
>> of the "GC.heap_dump" jcmd command.
>>
>> Of course this change will require a CSR for the additional jcmd
>> GC.heap_dump "-async" option which I'll be happy to create if there's
>> any interest in this enhancement. Also, logging in the child process
>> might potentially interfere with logging in the parent VM and probably
>> will have to be removed in the final version, but I've left it in for
>> now to better illustrate what's happening. Finally, we can't output
>> the size of the created dump any more if we are using asynchronous
>> dumping but from my point of view that's not such a big problem. Apart
>> from that, the POC works surprisingly well :)
>>
>> Please let me know what you think and if there's something I've overlooked?
>>
>> Best regards,
>> Volker
>>
>> PS: by the way, asynchronous dumping combines just fine with
>> compressed dumps. So you can easily use "GC.heap_dump -async=true
>> -gz=6"