RFD: 8252768: Fast, asynchronous heap dumps

Thomas Stüfe thomas.stuefe at gmail.com
Fri Sep 4 10:01:32 UTC 2020


On Fri, Sep 4, 2020 at 11:23 AM Volker Simonis <volker.simonis at gmail.com>
wrote:

> On Thu, Sep 3, 2020 at 8:26 PM Thomas Stüfe <thomas.stuefe at gmail.com>
> wrote:
> >
> > Hi Volker,
> >
> > On Thu, Sep 3, 2020 at 7:41 PM Volker Simonis <volker.simonis at gmail.com>
> wrote:
> >>
> >> Hi Thomas,
> >>
> >> Thanks for sharing your concerns. Please find my anserwers inline:
> >>
> >> On Thu, Sep 3, 2020 at 6:58 PM Thomas Stüfe <thomas.stuefe at gmail.com>
> wrote:
> >> >
> >> > Hi Volker,
> >> >
> >> > hah, that is a cool idea :)
> >> >
> >> > Lets see what could go wrong:
> >> >
> >> > So, child process is forked, but in the child only the forking thread
> survives, right? The forking thread then proceeds to dump. What happens
> when, during dumping, it needs a lock owned by one of the non-running
> threads? Could also be something non-obvious, like a lock drawn by one of
> the lower level subsystems, e.g. memory allocation (and then NMT), UL etc.
> >>
> >> I haven't completely analyzed all the possible code paths yet but the
> >> good thing is that the child process is at a safepoint and as you
> >> correctly mentioned it only has a single running thread. I think if a
> >> VM operation at a safepoint would require a lock owned by another
> >> thread, it would dead-lock anyway, even without forking, wouldn't it?
> >> Normally, VM_HeapDumper::doit() runs at a safepoint without blocking
> >> so I currently don't see why it shouldn't be able to run without
> >> blocking in a forked process?
> >>
> >
> > You are right, I overlooked that the parent VM is at a safepoint at fork
> time.
> >
> >>
> >> >
> >> > --
> >> >
> >> > In the child, I would be afraid of running any kind of cleanup code
> when exiting, since that may somehow modify state in the parent (e.g. via
> explicitly shared memory, or whatever third party native code may be up
> to). So I would use _exit(), not exit(), to avoid running any stray
> onexit()/atexit() handlers.
> >> >
> >>
> >> Thanks, that's a good point. I'll change that.
> >
> >
> >>
> >>
> >>
> >> > Of course, then you need to make sure the dump is flushed and the
> file handle is closed before exiting.
> >> >
> >> > --
> >> >
> >> > Depending on the overcommit settings fork() may fail with ENOMEM,
> regardless of copy-on-write.
> >> >
> >>
> >> Not sure this is related to overcommit settings. According to the
> >> man-page, fork() only fails with "ENOMEM" if fork() "failed to
> >> allocate the necessary kernel structures because memory is tight". But
> >> a failing fork is actually no problem at all. Currently, if fork()
> >> fails, I just fall back to normal, synchronous dumping. Of course this
> >> could be made configurable such that a failing asynchronous dump
> >> wouldn't result in a synchronous dump with its long safepoint timeout
> >> but instead completely skip the dump altogether.
> >>
> >> > --
> >> >
> >> > If the parent process is, at the time of the fork, touching a lot of
> pages, and the child takes its sweet time writing the dump, total memory
> usage will go up, right? Compared to the original, non-async variant.
> >> >
> >>
> >> Yes, that's right. The child is not writing more than a few kb, but
> >> the total memory consumption of the system will increase by the amount
> >> of common pages which are changed by the parent while the child is
> >> dumping.
> >>
> >
> > Plus costs for page tables and vma structures; even with copy-on-write
> this is not free. But I do not think that would be a reason not to do it.
>
> Yes, you're actually trading memory for speed :)
>
> >
> >>
> >> > --
> >> >
> >> > We will now have a second java process popping up, existing for some
> seconds, then vanishing. Outside tooling might be confused. OTOH the same
> happens when forking via Runtime.exec, but there this state only persists
> for some microseconds, until the first exec() call.
> >>
> >> That's also right. But the new process is at a safepoint and will exit
> >> right from there. So at least other tools won't be able to attach to
> >> the new process. It won't be even detectable by tools like jps.
> >>
> >> >
> >> > --
> >> >
> >> > UL in child: this log output now gets mixed in asynchronously with
> the parent's log? I would probably avoid logging in the child process.
> Also, as stated above, I am not sure if UL uses locks internally, which may
> hang.
> >>
> >> As I wrote in my mail, the UL in the child is currently only there for
> >> debugging purposes. It would be removed in a final version.
> >
> >
> > Sorry, I missed that. Note that UL may still be active though, we may
> inherit active logging from the parent and the code you call may contain UL
> logging points. To be safe you may just deactivate logging at the start of
> the child's life.
>
> I thought about that already but didn't know how to achieve that. I've
> found "LogConfiguration::disable_logging()" now. Do you know if it is
> safe to call that? I'll give it a try...
>

I wonder whether a more generic approach in addition to switching off UL
may be safer:

We want to prevent writing to any descriptor we inherited from the parent.
It is unlikely that this happens since we kind of control what code runs in
the child, but there may be accidental writes, UL is one example. Another
example may be if someone instrumented the VM, so we run with some code
"under us" without knowing.

We could just close all file descriptors at the beginning of the child's
life. Like we do in the jspawnhelper. Worst thing that happens is that
trips off coding in the child which really wanted to write to the file.
Either the code is able to handle this gracefully, or it will assert, but
then we'll know. In any case we have not disturbed the parent, we only risk
an error in the child process.

Lets look at UL: if we close the underlying file descriptor, I believe it
would not even notice. Looks like it just soldiers on,
e.g. LogFileOutput::write(), just takes the negative return of
jio_fprintf() and counts that. I think this is actually a bug, this should
be fixed.

..Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20200904/8072e6ef/attachment.htm>


More information about the serviceability-dev mailing list