RFR: 8306580: Propagate CDS dumping errors instead of directly exiting the VM
Ioi Lam
iklam at openjdk.org
Thu May 23 21:48:01 UTC 2024
On Thu, 23 May 2024 15:52:28 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
> Hi Matias,
>
> I wondered why we would need this, but the JVM crashing because we dump via jcmd is a compelling argument :)
>
> However, I am not sure how many of these things would work. E.g. when encountering an IO error on open, do we now continue with invalid FILE now? Same for memory allocation.
>
> I think there must be some way to jump out of dumping.
There are two possible ways of "jumping out"
- Returning a status to indicate failure
- Using the TRAPS/CHECK macro
The first option is just too tedious and prone to mistakes. The second option is also tedious (you need to add TRAPS to every function that has a possibility to fail), and doesn't work inside the `VM_PopulateDumpSharedSpace` safepoint, where a lot of the CDS operations happen.
Anyway, I think we don't have a quick fix for this problem, so I would suggest a more substantial rewrite (that hopefully will not involve a lot of line changes).
1. Move the file I/O operations outside of the safepoint. It doesn't need to be there.
2. The safepoint is needed for gathering and copying the metadata -- we need the safepoint to stop the metadata from mutation. There are actually very few reasons why the copying operations can fail -- namely, the only failure is when we fail to allocate the buffer memory to store the copy of the metadata.
So my suggestion would be like this:
start safepoint;
(1) Gather all metadata that are eligible for copying
allocate the buffers needed to copy these metadata
if allocation fails -> report failure and exit safe point
(2) Otherwise copy the metadata to the allocated buffers
(we need buffers for the metadata objects, as well as some bitmaps)
(3) Similarly, when copying the heap objects, allocate the entire buffer
before making the copies. Stop and report failure if allocation fails.
end safepoint
if buffer allocation has failed
report error by throwing an exception
open the output file and write the buffer into the file
on any IO failure, stop writing, delete the file, and report error by throwing an exception
Note: buffer allocation is the biggest source of dynamic memory allocation. We might need tens of MBs or more. Therefore, we should handle the failure gracefully.
However, we have smaller allocations throughout the dumping process (adding values to various hash tables, etc) that can also fail. In HotSpot, we usually do not explicitly handle such failures. We just exit the VM (see `AllocFailStrategy::EXIT_OOM` in allocations.hpp). The reasons is when the VM process cannot allocate very small amount of memory, the program is probably running on the edge of memory exhaustion and will soon fail anyway (inside native libraries, etc). Therefore, there's not much benefit in spending a great deal of effect to try to recover from such failures.
I think the CDS code should also take the `AllocFailStrategy::EXIT_OOM` strategy for small allocation.
--------
In terms of implementation, I think we can first move the CDS file writing outside of the safepoint (we already did that in the Leyden repo).
A second step would be to change the buffer allocation strategy. Today, we know exactly (*) how much memory is required, but we only reserve a memory region of the required size, and incrementally commit the memory as we write into the buffer. This means we have a change of failure every time we copy an object into the buffer .... This incremental commit is not necessary. We should just commit all the required memory at the very beginning.
(*) see [`ArchiveBuilder::gather_klasses_and_symbols()`](https://github.com/openjdk/jdk/blob/0a9d1f8c89e946d99f01549515f6044e53992168/src/hotspot/share/cds/archiveBuilder.cpp#L281-L283) -- there are still a few things that we don't have precise size estimation yet.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/19370#issuecomment-2128070017
More information about the hotspot-runtime-dev
mailing list