PPC64 JVM crashes when RTM is enabled

Tue Jul 26 14:24:20 UTC 2016

Hi Gustavo and all,

thanks for investigating. We have only tested RTM on AIX which seems to work fine.
We have inserted the following code into JVM_handle_aix_signal:

#if INCLUDE_RTM_OPT
    {
      int *inst_ptr = (int*)(pc - BytesPerInstWord);
      if (CodeCache::contains((address)inst_ptr) && MacroAssembler::is_tbegin(*inst_ptr)) {
        // Ignore transaction abort due to signal. Will jump to abort handler.
        if (TraceTraps) {
          tty->print_cr("Caught signal %d in transaction. Ignoring to jump to abort handler.", sig);
        }
        return true;
      }
    }
#endif

The transaction always got aborted and we never ran into this code.

I had made the same experiment on linux in JVM_handle_linux_signal where we ran into it. But RTM didn't work as expected because I was using an old linux version which doesn't treat system calls as required.

Maybe you would like to use this detection and figure out the context of the signaling instruction.

Please keep in mind that the C/C++ code must be compilable on old linux versions (at least for big endian). But I guess this shouldn't be an issue as the uc_link isn't new.

Can you provide a webrev? I can sponsor the fix.

Best regards,
Martin

-----Original Message-----
From: hotspot-dev [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Lindenmaier, Goetz
Sent: Freitag, 22. Juli 2016 08:44
To: Gustavo Romero <gromero at linux.vnet.ibm.com>; ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
Cc: Breno Leitao <brenohl at br.ibm.com>
Subject: RE: PPC64 JVM crashes when RTM is enabled

Hi Gustavo,

very neat analysis!  I opened 
https://bugs.openjdk.java.net/browse/JDK-8162369

Does AIX require a similar fix?

Best regards,
  Goetz.

> -----Original Message-----
> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> bounces at openjdk.java.net] On Behalf Of Gustavo Romero
> Sent: Donnerstag, 21. Juli 2016 20:30
> To: ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
> Cc: Breno Leitao <brenohl at br.ibm.com>
> Subject: PPC64 JVM crashes when RTM is enabled
> Importance: High
> 
> Hi
> 
> As of now (jdk9/hs-comp, f3c27d6d4ad1 tip), JVM crashes due to the
> delivery of
> a signal in the middle of an HTM transaction on PPC64 (on x64 this feature is
> called RTM but on POWER it's called HTM, standing for Hardware
> Transactional
> Memory).
> 
> When a SIGTRAP or a SIGILL is generated by the execution of a `trap`
> instruction
> or an illegal instruction at the beginning of a not entrant or zombie method
> and
> it happens in the middle of an HTM transaction, it fails the HTM transaction.
> 
> As a consequence two different ucontext_t structs are set by the Linux
> kernel.
> One context is related to the HTM block that failed while the other context is
> related to where the offending instruction was executed, i.e. the method
> con-
> taining the `trap` or illegal instruction. Currently the JVM signal handler for
> Linux/PPC64 just inspects the context related to the failed HTM block and
> when
> it verifies the value of nip, i.e. the Next Instruction Pointer set at
> uc->uc_mcontext.regs->nip, by calling os::Linux::ucontext_get_pc(uc), the
> signal
> handler does not find the offending instruction but instead the instruction
> located at tbegin+4, that consists in a branch to the HTM failure handler, as
> explained here [1].
> 
> A simple test case is:
> java -XX:+UnlockExperimentalVMOptions -XX:+UseRTMForStackLocks -
> XX:+UseRTMLocking
> 
> The issue first appeared in the
> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> jtreg test:
> http://hastebin.com/raw/ufodiduqeh
> 
> Please, refer to the following hs_err log:
> http://hastebin.com/raw/zucifaxoce
> 
> In this log, si_addr=0x00003fff60460c10 (where the trap instruction is) but
> pc=0x00003fff60455ec4 (which points to tbegin+4, i.e. a beq instruction to
> the
> HTM failure handler, and not to a trap instruction).
> 
> 0x00003fff60455ec0: .long 0x7c00051d (tbegin.)
> 0x00003fff60455ec4: beq-    0x00003fff60455ee0 <======= pc = HTM failure
> handler
> 0x00003fff60455ec8: ld      r14,0(r3)           and not trap (or illegal) instr.
> 0x00003fff60455ecc: clrldi  r0,r14,61
> 0x00003fff60455ed0: cmpwi   cr5,r0,1
> 0x00003fff60455ed4: beq-    cr5,0x00003fff60455ff4
> 0x00003fff60455ed8: .long 0x7c00055d (tend.)
> 
> Once in the signal handler, the pc is normally equal to si_addr, thus pc must
> point to the trap instruction located in the marked not entrant (or zombie)
> method. But when the JVM handler inspects pc it can't find a trap instruction
> (or otherwise an illegal instruction if -XX:-UseSIGTRAP flag is used). So it's
> an invalid condition for the JVM signal handler and the handler hits the
> report_and_die.
> 
> Here are two examples of it, one using a trap instruction to mark a not
> entrant
> method and another using a illegal instruction for the same purpose:
> http://hastebin.com/raw/avahoyadik It's important to mention that the
> crash is
> indeed intermittent, so a few times a run will just not crash the JVM (it
> seems
> that the issue gets worse if the number of threads increase).
> 
> The solution I found consists in restoring the right context that points to the
> not entrant method, which is stored by the kernel in a second ucontext_t
> struct
> in case a signal is caught in the middle of an HTM transaction, as explained in
> here [2].
> 
> The following patch is proposed to solve the issue, i.e. now
> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> always
> passes:
> 
> 
> diff -r adc8c84b7cf8 src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp
> --- a/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Fri Jul 01 11:29:55 2016
> +0200
> +++ b/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Wed Jul 20 21:52:08
> 2016 -0400
> @@ -219,6 +219,28 @@
>                          int abort_if_unrecognized) {
>    ucontext_t* uc = (ucontext_t*) ucVoid;
> 
> +  // A second thread context exists if the signal is delivered during a
> +  // transaction. Please see kernel doc transactional_memory.txt, L99-101:
> +  // https://goo.gl/E1xbxZ
> +  ucontext_t* transaction_uc = uc->uc_link;
> +
> +  // If uc->uc_link != NULL, then the signal happened during a transaction, as
> +  // pointed out in L106-107 (ibidem). MSR.TS bit must be checked for future
> +  // compatibility, but for now just checking uc->uc_link is ok.
> +  //
> +  // The JVM signal handler expects the context where a `trap` or
> +  // an illegal instruction occurs (i.e. at the beginning of a method marked as
> +  // not entrant or zombie), but if the first context `uc` is used it contains
> +  // the context of the HTM block, thus uc->uc_mcontext.regs->nip points
> to
> +  // tbegin+4, as explained in L103-104 (ibidem). Hence it's necessary to
> +  // restore the context where the `trap` or the illegal instruction are, which
> +  // is the second context in uc->uc_link.
> +  if (transaction_uc) {
> +    uc = transaction_uc;
> +    uc->uc_link = NULL;
> +    ucVoid = (void*) uc;
> +  }
> +
>    Thread* t = Thread::current_or_null_safe();
> 
>    SignalHandlerMark shm(t);
> 
> Is it possible to open a bug for this issue?
> 
> Thank you and best regards,
> Gustavo
> 
> [1]
> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> ansactional_memory.txt#L96-L105
> [2]
> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> ansactional_memory.txt#L106-L107