Bits.reserveMemory OutOfMemoryError does not seem to trigger -XX:OnOutOfMemoryError

Wed Dec 18 19:27:50 UTC 2024

> On Dec 17, 2024, at 5:31 PM, David Holmes <david.holmes at oracle.com> wrote:
> 
> Hi Steven,
> 
> The -XX OOM relating flags are only for OOM conditions directly detected by the VM itself - please also see:
> 
> https://bugs.openjdk.org/browse/JDK-8257790
> 

Thank you David, this explains what we see.

> Unfortunately the java mapage documentation didn't get updated to make this clear, but I will address that. We did clarify in the source [1]:
> 
>  product(ccstrlist, OnOutOfMemoryError, "",     \
>          "Run user-defined commands on first java.lang.OutOfMemoryError "  \
>          "thrown from JVM")     \
> 
> You need to handle Java triggered OOM conditions in the Java code.

We'll endeavor to do this. I hope the Lettuce / Netty maintainers can help us out here, as none of our code is directly implicated.

I am sure this is not a high priority feature request, but it would be nice to have a more general "on any user-visible Error: stop, dump, and crash".
So far every time we see an Error, regardless of whether it came from the VM or the Java standard library,
we'd rather crash and analyze offline than try to continue in an unknown state and end up in an even worse state!

That said, I see you already stated your position as "strongly opposed" in JDK-8257790, so I suppose this is how the world is at least for now :)

Thanks for your time and for everyone's efforts making Java robust.

> David
> -----
> 
> [1] https://bugs.openjdk.org/browse/JDK-8258058
> 
> On 18/12/2024 9:52 am, Steven Schlansker wrote:
>> Hi hotspot-dev,
>> In our continuing mission to explore strange new VM memory limits
>> (aren't containers fun?), we have encountered a situation where a
>> un-serviceable direct memory allocation request leaves the running
>> application in a live but unusable state. In the container world, we
>> expect to run into resource misconfigurations from time to time, it
>> seems to be a fact of life for the moment. But having the app unable
>> to recover sucks.
>> We run:
>> openjdk 23.0.1+11
>> netty 4.1.115
>> A big workload spike comes in, and suddenly we allocate a lot of
>> memory, and run out:
>> java.lang.OutOfMemoryError: Cannot reserve 4194304 bytes of direct
>> buffer memory (allocated: 804069357, limit: 805306368) at
>> java.base/java.nio.Bits.reserveMemory(Bits.java:178) at
>> java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:111)
>> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:363)
>> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:718)
>> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:693)
>> We configure our jvm with
>> '-XX:OnError=bin/crasher %p' '-XX:OnOutOfMemoryError=bin/crasher %p'
>> where crasher is a shell script that dumps various things and runs
>> kill -9 on the process to ensure recovery by kubernetes starting a new
>> container.
>> The java help page says,
>>> -XX:OnOutOfMemoryError=string Sets a custom command or a series of semicolon-separated commands to run when an OutOfMemoryError exception is first thrown. ...
>> From a simple reading of this, it sounds like this Bits.reserveMemory
>> OOM should trigger the OnOutOfMemoryError handler. We would expect the
>> behavior to invoke the error handler, leading to a kill signal.
>> Instead, the program proceeds. Something quickly goes wrong with
>> reference counting inside of the Lettuce Redis client:
>> Caused by: java.lang.NullPointerException: Cannot invoke
>> "io.netty.buffer.ByteBuf.refCnt()" because "this.buffer" is null at
>> io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:597)
>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
>> and then the whole application wedges. I reported this separately
>> (https://github.com/redis/lettuce/issues/3087).
>> While it will be nice to improve the Lettuce / Netty handling of OOME,
>> we felt like our configuration of OnOutOfMemoryError should have
>> covered this case - the help message doesn't qualify a particular type
>> of OOME, like "out of Java heap memory" - and a clean kill of the
>> process would have reduced a multi-hour outage (until a human could
>> notice + respond) to moments while the process restarts.
>> Is this an appropriate expectation? I'm happy to file an issue if this
>> would be considered a bug.
>> Thank you for your consideration.
>