Bits.reserveMemory OutOfMemoryError does not seem to trigger -XX:OnOutOfMemoryError

Tue Dec 17 23:52:22 UTC 2024

Hi hotspot-dev,

In our continuing mission to explore strange new VM memory limits
(aren't containers fun?), we have encountered a situation where a
un-serviceable direct memory allocation request leaves the running
application in a live but unusable state. In the container world, we
expect to run into resource misconfigurations from time to time, it
seems to be a fact of life for the moment. But having the app unable
to recover sucks.

We run:
openjdk 23.0.1+11
netty 4.1.115

A big workload spike comes in, and suddenly we allocate a lot of
memory, and run out:
java.lang.OutOfMemoryError: Cannot reserve 4194304 bytes of direct
buffer memory (allocated: 804069357, limit: 805306368) at
java.base/java.nio.Bits.reserveMemory(Bits.java:178) at
java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:111)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:363)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:693)

We configure our jvm with
'-XX:OnError=bin/crasher %p' '-XX:OnOutOfMemoryError=bin/crasher %p'

where crasher is a shell script that dumps various things and runs
kill -9 on the process to ensure recovery by kubernetes starting a new
container.

The java help page says,

> -XX:OnOutOfMemoryError=string Sets a custom command or a series of semicolon-separated commands to run when an OutOfMemoryError exception is first thrown. ...

>From a simple reading of this, it sounds like this Bits.reserveMemory
OOM should trigger the OnOutOfMemoryError handler. We would expect the
behavior to invoke the error handler, leading to a kill signal.

Instead, the program proceeds. Something quickly goes wrong with
reference counting inside of the Lettuce Redis client:

Caused by: java.lang.NullPointerException: Cannot invoke
"io.netty.buffer.ByteBuf.refCnt()" because "this.buffer" is null at
io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:597)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)

and then the whole application wedges. I reported this separately
(https://github.com/redis/lettuce/issues/3087).

While it will be nice to improve the Lettuce / Netty handling of OOME,
we felt like our configuration of OnOutOfMemoryError should have
covered this case - the help message doesn't qualify a particular type
of OOME, like "out of Java heap memory" - and a clean kill of the
process would have reduced a multi-hour outage (until a human could
notice + respond) to moments while the process restarts.

Is this an appropriate expectation? I'm happy to file an issue if this
would be considered a bug.

Thank you for your consideration.