PPC64 JVM crashes when RTM is enabled

Fri Jul 22 06:44:13 UTC 2016

Hi Gustavo,

very neat analysis!  I opened 
https://bugs.openjdk.java.net/browse/JDK-8162369

Does AIX require a similar fix?

Best regards,
  Goetz.

> -----Original Message-----
> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> bounces at openjdk.java.net] On Behalf Of Gustavo Romero
> Sent: Donnerstag, 21. Juli 2016 20:30
> To: ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
> Cc: Breno Leitao <brenohl at br.ibm.com>
> Subject: PPC64 JVM crashes when RTM is enabled
> Importance: High
> 
> Hi
> 
> As of now (jdk9/hs-comp, f3c27d6d4ad1 tip), JVM crashes due to the
> delivery of
> a signal in the middle of an HTM transaction on PPC64 (on x64 this feature is
> called RTM but on POWER it's called HTM, standing for Hardware
> Transactional
> Memory).
> 
> When a SIGTRAP or a SIGILL is generated by the execution of a `trap`
> instruction
> or an illegal instruction at the beginning of a not entrant or zombie method
> and
> it happens in the middle of an HTM transaction, it fails the HTM transaction.
> 
> As a consequence two different ucontext_t structs are set by the Linux
> kernel.
> One context is related to the HTM block that failed while the other context is
> related to where the offending instruction was executed, i.e. the method
> con-
> taining the `trap` or illegal instruction. Currently the JVM signal handler for
> Linux/PPC64 just inspects the context related to the failed HTM block and
> when
> it verifies the value of nip, i.e. the Next Instruction Pointer set at
> uc->uc_mcontext.regs->nip, by calling os::Linux::ucontext_get_pc(uc), the
> signal
> handler does not find the offending instruction but instead the instruction
> located at tbegin+4, that consists in a branch to the HTM failure handler, as
> explained here [1].
> 
> A simple test case is:
> java -XX:+UnlockExperimentalVMOptions -XX:+UseRTMForStackLocks -
> XX:+UseRTMLocking
> 
> The issue first appeared in the
> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> jtreg test:
> http://hastebin.com/raw/ufodiduqeh
> 
> Please, refer to the following hs_err log:
> http://hastebin.com/raw/zucifaxoce
> 
> In this log, si_addr=0x00003fff60460c10 (where the trap instruction is) but
> pc=0x00003fff60455ec4 (which points to tbegin+4, i.e. a beq instruction to
> the
> HTM failure handler, and not to a trap instruction).
> 
> 0x00003fff60455ec0: .long 0x7c00051d (tbegin.)
> 0x00003fff60455ec4: beq-    0x00003fff60455ee0 <======= pc = HTM failure
> handler
> 0x00003fff60455ec8: ld      r14,0(r3)           and not trap (or illegal) instr.
> 0x00003fff60455ecc: clrldi  r0,r14,61
> 0x00003fff60455ed0: cmpwi   cr5,r0,1
> 0x00003fff60455ed4: beq-    cr5,0x00003fff60455ff4
> 0x00003fff60455ed8: .long 0x7c00055d (tend.)
> 
> Once in the signal handler, the pc is normally equal to si_addr, thus pc must
> point to the trap instruction located in the marked not entrant (or zombie)
> method. But when the JVM handler inspects pc it can't find a trap instruction
> (or otherwise an illegal instruction if -XX:-UseSIGTRAP flag is used). So it's
> an invalid condition for the JVM signal handler and the handler hits the
> report_and_die.
> 
> Here are two examples of it, one using a trap instruction to mark a not
> entrant
> method and another using a illegal instruction for the same purpose:
> http://hastebin.com/raw/avahoyadik It's important to mention that the
> crash is
> indeed intermittent, so a few times a run will just not crash the JVM (it
> seems
> that the issue gets worse if the number of threads increase).
> 
> The solution I found consists in restoring the right context that points to the
> not entrant method, which is stored by the kernel in a second ucontext_t
> struct
> in case a signal is caught in the middle of an HTM transaction, as explained in
> here [2].
> 
> The following patch is proposed to solve the issue, i.e. now
> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> always
> passes:
> 
> 
> diff -r adc8c84b7cf8 src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp
> --- a/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Fri Jul 01 11:29:55 2016
> +0200
> +++ b/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Wed Jul 20 21:52:08
> 2016 -0400
> @@ -219,6 +219,28 @@
>                          int abort_if_unrecognized) {
>    ucontext_t* uc = (ucontext_t*) ucVoid;
> 
> +  // A second thread context exists if the signal is delivered during a
> +  // transaction. Please see kernel doc transactional_memory.txt, L99-101:
> +  // https://goo.gl/E1xbxZ
> +  ucontext_t* transaction_uc = uc->uc_link;
> +
> +  // If uc->uc_link != NULL, then the signal happened during a transaction, as
> +  // pointed out in L106-107 (ibidem). MSR.TS bit must be checked for future
> +  // compatibility, but for now just checking uc->uc_link is ok.
> +  //
> +  // The JVM signal handler expects the context where a `trap` or
> +  // an illegal instruction occurs (i.e. at the beginning of a method marked as
> +  // not entrant or zombie), but if the first context `uc` is used it contains
> +  // the context of the HTM block, thus uc->uc_mcontext.regs->nip points
> to
> +  // tbegin+4, as explained in L103-104 (ibidem). Hence it's necessary to
> +  // restore the context where the `trap` or the illegal instruction are, which
> +  // is the second context in uc->uc_link.
> +  if (transaction_uc) {
> +    uc = transaction_uc;
> +    uc->uc_link = NULL;
> +    ucVoid = (void*) uc;
> +  }
> +
>    Thread* t = Thread::current_or_null_safe();
> 
>    SignalHandlerMark shm(t);
> 
> Is it possible to open a bug for this issue?
> 
> Thank you and best regards,
> Gustavo
> 
> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> ansactional_memory.txt#L96-L105
> [2]
> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> ansactional_memory.txt#L106-L107