[aarch64-port-dev ] Sporadic crashes on aarch64 after switching from OpenJDK 9 to 10
Stephan Bergmann
sbergman at redhat.com
Wed Jun 20 06:16:51 UTC 2018
On 19/06/18 18:19, Andrew Haley wrote:
> On 06/19/2018 04:44 PM, Stephan Bergmann wrote:
>> I unfortunately have a bit of a complex set up, but maybe somebody has
>> an idea how to debug this further:
>>
>> I am doing Flatpak builds of LibreOffice. Those builds used to use
>> OpenJDK 9. When I tried to switch to OpenJDK 10, builds for aarch64
>> started to fail, in one of LibreOffice's tests that instantiates a JVM
>> in a (C++) process. Builds for other platforms (arm 32-bit, x86 32- and
>> 64-bit) did not fail.
>>
>> I unsuccessfully tried to reproduce the failure on various aarch64
>> machines (with varying 4K and 64K PAGE_SIZE); the only kind of machine I
>> could reproduce it on (not fully reliably, but around 50% of the time)
>> is massively parallel 64-core machines (which are routinely used for
>> those Flatpak builds).
>>
>> The symptom is always a SIGSEGV in a thread whose gdb backtrace shows
>> just a single frame of apparently JIT-generated code (i.e., outside any
>> .so). A typical such case is
>>
>>> (gdb) disas 0x0000ffff8a1d6d80,+300
>>> Dump of assembler code from 0xffff8a1d6d80 to 0xffff8a1d6eac:
>>> 0x0000ffff8a1d6d80: .inst 0xffffffff ; undefined
>>> 0x0000ffff8a1d6d84: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6d88: adrp x12, 0x100003add7000
>>> 0x0000ffff8a1d6d8c: .inst 0x00386024 ; NYI
>>> 0x0000ffff8a1d6d90: .inst 0x60806000 ; undefined
>>> 0x0000ffff8a1d6d94: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6d98: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6d9c: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6da0: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6da4: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6da8: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6dac: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6db0: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6db4: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6db8: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6dbc: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6dc0: ldr w8, [x1,#8]
>>> 0x0000ffff8a1d6dc4: cmp w9, w8
>>> 0x0000ffff8a1d6dc8: b.eq 0xffff8a1d6e00
>>> 0x0000ffff8a1d6dcc: adrp x8, 0xffff8263c000
>>> 0x0000ffff8a1d6dd0: add x8, x8, #0x700
>>> 0x0000ffff8a1d6dd4: br x8
>>> 0x0000ffff8a1d6dd8: nop
>>> 0x0000ffff8a1d6ddc: nop
>>> 0x0000ffff8a1d6de0: nop
>>> 0x0000ffff8a1d6de4: nop
>>> 0x0000ffff8a1d6de8: nop
>>> 0x0000ffff8a1d6dec: nop
>>> 0x0000ffff8a1d6df0: nop
>>> 0x0000ffff8a1d6df4: nop
>>> 0x0000ffff8a1d6df8: nop
>>> 0x0000ffff8a1d6dfc: nop
>>> 0x0000ffff8a1d6e00: nop
>>> 0x0000ffff8a1d6e04: sub x9, sp, #0x14, lsl #12
>>> 0x0000ffff8a1d6e08: str xzr, [x9]
>>> 0x0000ffff8a1d6e0c: sub sp, sp, #0x40
>>> 0x0000ffff8a1d6e10: stp x29, x30, [sp,#48]
>>> 0x0000ffff8a1d6e14: ldr w0, [x1,#28]
>>> 0x0000ffff8a1d6e18: ldp x29, x30, [sp,#48]
>>> 0x0000ffff8a1d6e1c: add sp, sp, #0x40
>>> 0x0000ffff8a1d6e20: ldr x8, [x28,#112]
>>> => 0x0000ffff8a1d6e24: ldr wzr, [x8]
>>> 0x0000ffff8a1d6e28: ret
>>> 0x0000ffff8a1d6e2c: nop
>>> 0x0000ffff8a1d6e30: nop
>>> 0x0000ffff8a1d6e34: ldr x0, [x28,#736]
>>> 0x0000ffff8a1d6e38: str xzr, [x28,#736]
>>> 0x0000ffff8a1d6e3c: str xzr, [x28,#744]
>>> 0x0000ffff8a1d6e40: ldp x29, x30, [sp,#48]
>>> 0x0000ffff8a1d6e44: add sp, sp, #0x40
>>> 0x0000ffff8a1d6e48: adrp x8, 0xffff8266e000
>>> 0x0000ffff8a1d6e4c: add x8, x8, #0x200
>>> 0x0000ffff8a1d6e50: br x8
>>> 0x0000ffff8a1d6e54: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e58: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e5c: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e60: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e64: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e68: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e6c: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e70: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e74: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e78: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e7c: .inst 0x00000000 ; undefined
>>> 0x0000ffff8a1d6e80: adrp x8, 0xffff82670000
>>> 0x0000ffff8a1d6e84: add x8, x8, #0x900
>>> 0x0000ffff8a1d6e88: blr x8
>>> 0x0000ffff8a1d6e8c: stp x0, x1, [sp,#-256]!
>>> 0x0000ffff8a1d6e90: stp x2, x3, [sp,#16]
>>> 0x0000ffff8a1d6e94: stp x4, x5, [sp,#32]
>>> 0x0000ffff8a1d6e98: stp x6, x7, [sp,#48]
>>> 0x0000ffff8a1d6e9c: stp x8, x9, [sp,#64]
>>> 0x0000ffff8a1d6ea0: stp x10, x11, [sp,#80]
>>> 0x0000ffff8a1d6ea4: stp x12, x13, [sp,#96]
>>> 0x0000ffff8a1d6ea8: stp x14, x15, [sp,#112]
>>> End of assembler dump.
>>
>> where x8 points at no memory (0xffff99d52008 in this case). The details
>> of the code differ across crashes, but it appears to always be a "ldr
>> wzr, [x8]" that triggers the SIGSEGV. There are more than 100 threads,
>> most of which appear to be JVM housekeeping ones (compilation, gc; I
>> have lengthy gdb "thread apply all backtrace full" output that I could
>> provide.)
>
> That's a safepoint SEGV. It's deliberate. If you step at that
> point you'll enter the safepoint code.
>
>> I could no longer reproduce the failure when I either made LibreOffice
>> instantiate the in-process JVM with -Xint to force interpreted mode, or
>> built OpenJDK with --with-debug-level=fastdebug instead of
>> --with-debug-level=release. (I tried a handful of times each; but as
>> the failure isn't reliably reproducible, that might of course also have
>> just been luck.)
>>
>> The OpenJDK in the Flatpak build environment is
>> http://hg.openjdk.java.net/jdk-updates/jdk10u tag jdk-10.0.1+10 (at
>> <https://github.com/flathub/org.freedesktop.Sdk.Extension.openjdk10>,
>> which in turn uses the sources packaged by
>> <https://src.fedoraproject.org/rpms/java-openjdk/branch/jdk-10>). I
>> also tried replacing that with current tip of that branch, but that
>> didn't make a difference (it felt like the failure happened less often,
>> like only 10% of the time, but again, that might just have been luck).
>>
>> I have only restricted access to that 64-core machine, and the only
>> viable way for me to test the issue is via the Flatpak build environment
>> (e.g., I cannot easily download another OpenJDK 10 build to test
>> against). The failing LibreOffice test itself is also somewhat complex,
>> and it would likely not be easy to strip it down to a small reproducer.
>>
>> Thoughts, anyone?
>
> What exactly is the failure?
The failure is that the process (running a C++ cppunittester executable,
with an in-process instantiated JVM) terminates due to a SIGSEGV. When
inspecting the generated core file with gdb, it claims the above thread
caused that "fatal" SIGSEGV.
> You should have an error message and an error log file.
There is no stdout/-err output from the JVM (only some routine
cppunittester output). And I cannot find any log file (you mean one of
those .hs-err-pid* files or whatever they are called exactly, right?).
Where should I look for it ($HOME, cwd, ...)?
More information about the aarch64-port-dev
mailing list