[aarch64-port-dev ] Sporadic crashes on aarch64 after switching from OpenJDK 9 to 10
Stephan Bergmann
sbergman at redhat.com
Tue Jun 19 15:44:48 UTC 2018
I unfortunately have a bit of a complex set up, but maybe somebody has
an idea how to debug this further:
I am doing Flatpak builds of LibreOffice. Those builds used to use
OpenJDK 9. When I tried to switch to OpenJDK 10, builds for aarch64
started to fail, in one of LibreOffice's tests that instantiates a JVM
in a (C++) process. Builds for other platforms (arm 32-bit, x86 32- and
64-bit) did not fail.
I unsuccessfully tried to reproduce the failure on various aarch64
machines (with varying 4K and 64K PAGE_SIZE); the only kind of machine I
could reproduce it on (not fully reliably, but around 50% of the time)
is massively parallel 64-core machines (which are routinely used for
those Flatpak builds).
The symptom is always a SIGSEGV in a thread whose gdb backtrace shows
just a single frame of apparently JIT-generated code (i.e., outside any
.so). A typical such case is
> (gdb) disas 0x0000ffff8a1d6d80,+300
> Dump of assembler code from 0xffff8a1d6d80 to 0xffff8a1d6eac:
> 0x0000ffff8a1d6d80: .inst 0xffffffff ; undefined
> 0x0000ffff8a1d6d84: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6d88: adrp x12, 0x100003add7000
> 0x0000ffff8a1d6d8c: .inst 0x00386024 ; NYI
> 0x0000ffff8a1d6d90: .inst 0x60806000 ; undefined
> 0x0000ffff8a1d6d94: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6d98: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6d9c: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6da0: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6da4: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6da8: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6dac: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6db0: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6db4: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6db8: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6dbc: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6dc0: ldr w8, [x1,#8]
> 0x0000ffff8a1d6dc4: cmp w9, w8
> 0x0000ffff8a1d6dc8: b.eq 0xffff8a1d6e00
> 0x0000ffff8a1d6dcc: adrp x8, 0xffff8263c000
> 0x0000ffff8a1d6dd0: add x8, x8, #0x700
> 0x0000ffff8a1d6dd4: br x8
> 0x0000ffff8a1d6dd8: nop
> 0x0000ffff8a1d6ddc: nop
> 0x0000ffff8a1d6de0: nop
> 0x0000ffff8a1d6de4: nop
> 0x0000ffff8a1d6de8: nop
> 0x0000ffff8a1d6dec: nop
> 0x0000ffff8a1d6df0: nop
> 0x0000ffff8a1d6df4: nop
> 0x0000ffff8a1d6df8: nop
> 0x0000ffff8a1d6dfc: nop
> 0x0000ffff8a1d6e00: nop
> 0x0000ffff8a1d6e04: sub x9, sp, #0x14, lsl #12
> 0x0000ffff8a1d6e08: str xzr, [x9]
> 0x0000ffff8a1d6e0c: sub sp, sp, #0x40
> 0x0000ffff8a1d6e10: stp x29, x30, [sp,#48]
> 0x0000ffff8a1d6e14: ldr w0, [x1,#28]
> 0x0000ffff8a1d6e18: ldp x29, x30, [sp,#48]
> 0x0000ffff8a1d6e1c: add sp, sp, #0x40
> 0x0000ffff8a1d6e20: ldr x8, [x28,#112]
> => 0x0000ffff8a1d6e24: ldr wzr, [x8]
> 0x0000ffff8a1d6e28: ret
> 0x0000ffff8a1d6e2c: nop
> 0x0000ffff8a1d6e30: nop
> 0x0000ffff8a1d6e34: ldr x0, [x28,#736]
> 0x0000ffff8a1d6e38: str xzr, [x28,#736]
> 0x0000ffff8a1d6e3c: str xzr, [x28,#744]
> 0x0000ffff8a1d6e40: ldp x29, x30, [sp,#48]
> 0x0000ffff8a1d6e44: add sp, sp, #0x40
> 0x0000ffff8a1d6e48: adrp x8, 0xffff8266e000
> 0x0000ffff8a1d6e4c: add x8, x8, #0x200
> 0x0000ffff8a1d6e50: br x8
> 0x0000ffff8a1d6e54: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e58: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e5c: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e60: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e64: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e68: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e6c: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e70: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e74: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e78: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e7c: .inst 0x00000000 ; undefined
> 0x0000ffff8a1d6e80: adrp x8, 0xffff82670000
> 0x0000ffff8a1d6e84: add x8, x8, #0x900
> 0x0000ffff8a1d6e88: blr x8
> 0x0000ffff8a1d6e8c: stp x0, x1, [sp,#-256]!
> 0x0000ffff8a1d6e90: stp x2, x3, [sp,#16]
> 0x0000ffff8a1d6e94: stp x4, x5, [sp,#32]
> 0x0000ffff8a1d6e98: stp x6, x7, [sp,#48]
> 0x0000ffff8a1d6e9c: stp x8, x9, [sp,#64]
> 0x0000ffff8a1d6ea0: stp x10, x11, [sp,#80]
> 0x0000ffff8a1d6ea4: stp x12, x13, [sp,#96]
> 0x0000ffff8a1d6ea8: stp x14, x15, [sp,#112]
> End of assembler dump.
where x8 points at no memory (0xffff99d52008 in this case). The details
of the code differ across crashes, but it appears to always be a "ldr
wzr, [x8]" that triggers the SIGSEGV. There are more than 100 threads,
most of which appear to be JVM housekeeping ones (compilation, gc; I
have lengthy gdb "thread apply all backtrace full" output that I could
provide.)
I could no longer reproduce the failure when I either made LibreOffice
instantiate the in-process JVM with -Xint to force interpreted mode, or
built OpenJDK with --with-debug-level=fastdebug instead of
--with-debug-level=release. (I tried a handful of times each; but as
the failure isn't reliably reproducible, that might of course also have
just been luck.)
The OpenJDK in the Flatpak build environment is
http://hg.openjdk.java.net/jdk-updates/jdk10u tag jdk-10.0.1+10 (at
<https://github.com/flathub/org.freedesktop.Sdk.Extension.openjdk10>,
which in turn uses the sources packaged by
<https://src.fedoraproject.org/rpms/java-openjdk/branch/jdk-10>). I
also tried replacing that with current tip of that branch, but that
didn't make a difference (it felt like the failure happened less often,
like only 10% of the time, but again, that might just have been luck).
I have only restricted access to that 64-core machine, and the only
viable way for me to test the issue is via the Flatpak build environment
(e.g., I cannot easily download another OpenJDK 10 build to test
against). The failing LibreOffice test itself is also somewhat complex,
and it would likely not be easy to strip it down to a small reproducer.
Thoughts, anyone?
More information about the aarch64-port-dev
mailing list