SIGBUS on linux in perfMemory_init

Fri Apr 29 13:44:00 UTC 2022

Hi all,

We've been seeing intermittent SIGBUS failures on linux with jdk11.  They
all have this distinctive backtrace:

C  [libc.so.6+0x12944d]

V  [libjvm.so+0xcca542]  perfMemory_init()+0x72

V  [libjvm.so+0x8a3242]  vm_init_globals()+0x22

V  [libjvm.so+0xedc31d]  Threads::create_vm(JavaVMInitArgs*, bool*)+0x1ed

V  [libjvm.so+0x9615b2]  JNI_CreateJavaVM+0x52

C  [libjli.so+0x49af]  JavaMain+0x8f

C  [libjli.so+0x9149]  ThreadJavaMain+0x9

Initially, we suspected that /tmp was full but that turned out to not be
the case.  After a few more instances of the crash and investigation, we
believe we know the root cause.

The crashing applications are all running in a K8 pod, with each JVM in a
separate container:

container_type: cgroupv1 (from the hs_err file)

/tmp is mounted such that it's shared by multiple containers.  Since these
JVMs are running in containers, we believe what happens is the namespaced
(i.e. per container) PIDs overlap between different containers - 2 JVMs, in
separate containers, can end up with the same namespaced PID.  Since /tmp
is shared, they can now "contend" on the same perfMemory file since those
file names are PID based.

Once multiple JVMs can contend on the same file, a SIGBUS can arise if one
JVM has mmap'd the file and another ftruncate()'s it from under it (e.g.
https://github.com/openjdk/jdk11/blob/37115c8ea4aff13a8148ee2b8832b20888a5d880/src/hotspot/os/linux/perfMemory_linux.cpp#L909
).

Is this a known issue? I couldn't find any existing JBS entries or mailing
list discussions around this specific circumstance.

As for possible solutions, would it be possible to use the global PID
instead of the namespaced PID to "regain" the uniqueness invariant of the
PID? Also, might it make sense to flock() the file to prevent another
process from mucking with it?

Happy to provide more info if needed.

Thanks