PPC64 JVM crashes when RTM is enabled
Lindenmaier, Goetz
goetz.lindenmaier at sap.com
Fri Jul 22 06:44:13 UTC 2016
Hi Gustavo,
very neat analysis! I opened
https://bugs.openjdk.java.net/browse/JDK-8162369
Does AIX require a similar fix?
Best regards,
Goetz.
> -----Original Message-----
> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> bounces at openjdk.java.net] On Behalf Of Gustavo Romero
> Sent: Donnerstag, 21. Juli 2016 20:30
> To: ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
> Cc: Breno Leitao <brenohl at br.ibm.com>
> Subject: PPC64 JVM crashes when RTM is enabled
> Importance: High
>
> Hi
>
> As of now (jdk9/hs-comp, f3c27d6d4ad1 tip), JVM crashes due to the
> delivery of
> a signal in the middle of an HTM transaction on PPC64 (on x64 this feature is
> called RTM but on POWER it's called HTM, standing for Hardware
> Transactional
> Memory).
>
> When a SIGTRAP or a SIGILL is generated by the execution of a `trap`
> instruction
> or an illegal instruction at the beginning of a not entrant or zombie method
> and
> it happens in the middle of an HTM transaction, it fails the HTM transaction.
>
> As a consequence two different ucontext_t structs are set by the Linux
> kernel.
> One context is related to the HTM block that failed while the other context is
> related to where the offending instruction was executed, i.e. the method
> con-
> taining the `trap` or illegal instruction. Currently the JVM signal handler for
> Linux/PPC64 just inspects the context related to the failed HTM block and
> when
> it verifies the value of nip, i.e. the Next Instruction Pointer set at
> uc->uc_mcontext.regs->nip, by calling os::Linux::ucontext_get_pc(uc), the
> signal
> handler does not find the offending instruction but instead the instruction
> located at tbegin+4, that consists in a branch to the HTM failure handler, as
> explained here [1].
>
> A simple test case is:
> java -XX:+UnlockExperimentalVMOptions -XX:+UseRTMForStackLocks -
> XX:+UseRTMLocking
>
> The issue first appeared in the
> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> jtreg test:
> http://hastebin.com/raw/ufodiduqeh
>
> Please, refer to the following hs_err log:
> http://hastebin.com/raw/zucifaxoce
>
> In this log, si_addr=0x00003fff60460c10 (where the trap instruction is) but
> pc=0x00003fff60455ec4 (which points to tbegin+4, i.e. a beq instruction to
> the
> HTM failure handler, and not to a trap instruction).
>
> 0x00003fff60455ec0: .long 0x7c00051d (tbegin.)
> 0x00003fff60455ec4: beq- 0x00003fff60455ee0 <======= pc = HTM failure
> handler
> 0x00003fff60455ec8: ld r14,0(r3) and not trap (or illegal) instr.
> 0x00003fff60455ecc: clrldi r0,r14,61
> 0x00003fff60455ed0: cmpwi cr5,r0,1
> 0x00003fff60455ed4: beq- cr5,0x00003fff60455ff4
> 0x00003fff60455ed8: .long 0x7c00055d (tend.)
>
> Once in the signal handler, the pc is normally equal to si_addr, thus pc must
> point to the trap instruction located in the marked not entrant (or zombie)
> method. But when the JVM handler inspects pc it can't find a trap instruction
> (or otherwise an illegal instruction if -XX:-UseSIGTRAP flag is used). So it's
> an invalid condition for the JVM signal handler and the handler hits the
> report_and_die.
>
> Here are two examples of it, one using a trap instruction to mark a not
> entrant
> method and another using a illegal instruction for the same purpose:
> http://hastebin.com/raw/avahoyadik It's important to mention that the
> crash is
> indeed intermittent, so a few times a run will just not crash the JVM (it
> seems
> that the issue gets worse if the number of threads increase).
>
> The solution I found consists in restoring the right context that points to the
> not entrant method, which is stored by the kernel in a second ucontext_t
> struct
> in case a signal is caught in the middle of an HTM transaction, as explained in
> here [2].
>
> The following patch is proposed to solve the issue, i.e. now
> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> always
> passes:
>
>
> diff -r adc8c84b7cf8 src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp
> --- a/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp Fri Jul 01 11:29:55 2016
> +0200
> +++ b/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp Wed Jul 20 21:52:08
> 2016 -0400
> @@ -219,6 +219,28 @@
> int abort_if_unrecognized) {
> ucontext_t* uc = (ucontext_t*) ucVoid;
>
> + // A second thread context exists if the signal is delivered during a
> + // transaction. Please see kernel doc transactional_memory.txt, L99-101:
> + // https://goo.gl/E1xbxZ
> + ucontext_t* transaction_uc = uc->uc_link;
> +
> + // If uc->uc_link != NULL, then the signal happened during a transaction, as
> + // pointed out in L106-107 (ibidem). MSR.TS bit must be checked for future
> + // compatibility, but for now just checking uc->uc_link is ok.
> + //
> + // The JVM signal handler expects the context where a `trap` or
> + // an illegal instruction occurs (i.e. at the beginning of a method marked as
> + // not entrant or zombie), but if the first context `uc` is used it contains
> + // the context of the HTM block, thus uc->uc_mcontext.regs->nip points
> to
> + // tbegin+4, as explained in L103-104 (ibidem). Hence it's necessary to
> + // restore the context where the `trap` or the illegal instruction are, which
> + // is the second context in uc->uc_link.
> + if (transaction_uc) {
> + uc = transaction_uc;
> + uc->uc_link = NULL;
> + ucVoid = (void*) uc;
> + }
> +
> Thread* t = Thread::current_or_null_safe();
>
> SignalHandlerMark shm(t);
>
> Is it possible to open a bug for this issue?
>
> Thank you and best regards,
> Gustavo
>
> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> ansactional_memory.txt#L96-L105
> [2]
> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> ansactional_memory.txt#L106-L107
More information about the ppc-aix-port-dev
mailing list