Reliability of JVM in face of "recoverable" Errors, e.g. out of code cache space

Fri Aug 2 19:11:15 UTC 2024

Hi hotspot-dev,

Please let me know me if this is not an appropriate place to raise this kind of question -
happy to move to another more appropriate list

We run the JVM (22.0.1) with many different application contexts loaded into one JVM, like an application server.
This places a rather high demand on the code cache. We've observed warning messages about the
code cache being full and compiler being disabled, so we increase our ReservedCodeCacheSize to make some space,
and move on with life.

This week, we ran into a new type of failure, that is much more serious than a warning but non-fatal to the JVM.

Caused by: java.lang.ExceptionInInitializerError:
Caused by: java.lang.NoClassDefFoundError: Could not initialize class java.time.temporal.WeekFields
Caused by: Exception java.lang.VirtualMachineError: Out of space in CodeCache for adapters
 at java.base/java.time.format.DateTimeFormatterBuilder$WeekBasedFieldPrinterParser.printerParser(DateTimeFormatterBuilder.java:5264)
 at java.base/java.time.format.DateTimeFormatterBuilder$WeekBasedFieldPrinterParser.format(DateTimeFormatterBuilder.java:5248)
 at java.base/java.time.format.DateTimeFormatterBuilder$CompositePrinterParser.format(DateTimeFormatterBuilder.java:2529)
 at java.base/java.time.format.DateTimeFormatter.formatTo(DateTimeFormatter.java:1905)
 at java.base/java.time.format.DateTimeFormatter.format(DateTimeFormatter.java:1879) 
 at java.base/java.time.LocalDate.format(LocalDate.java:1797)
...

Once this happens, the affected classes (in this case the java.time infrastructure) is effectively dead for the remainder of JVM lifetime.

As part of our JVM reliability configuration, we attempt to set
-XX:OnError=/bin/gather-debuginfo-then-kill -9 %p
to ensure that unexpected errors terminate the JVM, rather than leave it in an uncertain state.
However, this particular VirtualMachineError does not seem to be triggering this OnError logic. Reading the docs, it seems
that this is only triggered for 'irrecoverable' errors, which I guess this does not qualify as, since it triggers a userland exception
not a hotspot dump.

However, trying to imagine how we would recover from such a situation, it's not clear at all what to do.
At this point some arbitrary subset of classes are no longer usable, forever. Even logging a date could fail.
Arguably, user code shouldn't be thinking about VirtualMachineError as a possibility at all, as what can
you even trust to work afterward?

The exception could be thrown in an arbitrary thread - maybe it's one we control, but maybe it's thrown in a background
thread like a Jetty server or Redis client io thread. Where it is thrown is not predictable either, making it very hard to
add a "catch" clause and terminate the JVM, since nearly any statement could fail.
Most threads are careful to have a top-level catch and log, so the uncaught exception handler does not seem reliable either.

Ideally, I would turn on some VM option like '-XX:VMErrorIsAlwaysFatal' to trigger a hs dump, rather than ever seeing this
sort of failure in userland.

How can a user application recover from such an error happening? (I think it cannot.) If we cannot recover, how can we reliably
configure the JVM to crash completely if such an error happens? I suppose a debugger-like tool could breakpoint
on throwing VirtualMachineError, or maybe an agent could transform the VME constructor, but this doesn't feel "production-ready".

Thank you for any advice!
Steven Schlansker