PPC64 JVM crashes when RTM is enabled

Thu Jul 21 18:30:04 UTC 2016

Hi

As of now (jdk9/hs-comp, f3c27d6d4ad1 tip), JVM crashes due to the delivery of
a signal in the middle of an HTM transaction on PPC64 (on x64 this feature is
called RTM but on POWER it's called HTM, standing for Hardware Transactional
Memory).

When a SIGTRAP or a SIGILL is generated by the execution of a `trap` instruction
or an illegal instruction at the beginning of a not entrant or zombie method and
it happens in the middle of an HTM transaction, it fails the HTM transaction.

As a consequence two different ucontext_t structs are set by the Linux kernel.
One context is related to the HTM block that failed while the other context is
related to where the offending instruction was executed, i.e. the method con-
taining the `trap` or illegal instruction. Currently the JVM signal handler for
Linux/PPC64 just inspects the context related to the failed HTM block and when
it verifies the value of nip, i.e. the Next Instruction Pointer set at
uc->uc_mcontext.regs->nip, by calling os::Linux::ucontext_get_pc(uc), the signal
handler does not find the offending instruction but instead the instruction
located at tbegin+4, that consists in a branch to the HTM failure handler, as
explained here [1].

A simple test case is:
java -XX:+UnlockExperimentalVMOptions -XX:+UseRTMForStackLocks -XX:+UseRTMLocking

The issue first appeared in the
compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java jtreg test:
http://hastebin.com/raw/ufodiduqeh

Please, refer to the following hs_err log: http://hastebin.com/raw/zucifaxoce

In this log, si_addr=0x00003fff60460c10 (where the trap instruction is) but
pc=0x00003fff60455ec4 (which points to tbegin+4, i.e. a beq instruction to the
HTM failure handler, and not to a trap instruction).

0x00003fff60455ec0: .long 0x7c00051d (tbegin.)
0x00003fff60455ec4: beq-    0x00003fff60455ee0 <======= pc = HTM failure handler
0x00003fff60455ec8: ld      r14,0(r3)           and not trap (or illegal) instr.
0x00003fff60455ecc: clrldi  r0,r14,61
0x00003fff60455ed0: cmpwi   cr5,r0,1
0x00003fff60455ed4: beq-    cr5,0x00003fff60455ff4
0x00003fff60455ed8: .long 0x7c00055d (tend.)

Once in the signal handler, the pc is normally equal to si_addr, thus pc must
point to the trap instruction located in the marked not entrant (or zombie)
method. But when the JVM handler inspects pc it can't find a trap instruction
(or otherwise an illegal instruction if -XX:-UseSIGTRAP flag is used). So it's
an invalid condition for the JVM signal handler and the handler hits the
report_and_die.

Here are two examples of it, one using a trap instruction to mark a not entrant
method and another using a illegal instruction for the same purpose:
http://hastebin.com/raw/avahoyadik It's important to mention that the crash is
indeed intermittent, so a few times a run will just not crash the JVM (it seems
that the issue gets worse if the number of threads increase).

The solution I found consists in restoring the right context that points to the
not entrant method, which is stored by the kernel in a second ucontext_t struct
in case a signal is caught in the middle of an HTM transaction, as explained in
here [2].

The following patch is proposed to solve the issue, i.e. now
compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java always
passes:


diff -r adc8c84b7cf8 src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp

--- a/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Fri Jul 01 11:29:55 2016 +0200
+++ b/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Wed Jul 20 21:52:08 2016 -0400
@@ -219,6 +219,28 @@
                         int abort_if_unrecognized) {
   ucontext_t* uc = (ucontext_t*) ucVoid;

+  // A second thread context exists if the signal is delivered during a
+  // transaction. Please see kernel doc transactional_memory.txt, L99-101:
+  // https://goo.gl/E1xbxZ
+  ucontext_t* transaction_uc = uc->uc_link;
+
+  // If uc->uc_link != NULL, then the signal happened during a transaction, as
+  // pointed out in L106-107 (ibidem). MSR.TS bit must be checked for future
+  // compatibility, but for now just checking uc->uc_link is ok.
+  //
+  // The JVM signal handler expects the context where a `trap` or
+  // an illegal instruction occurs (i.e. at the beginning of a method marked as
+  // not entrant or zombie), but if the first context `uc` is used it contains
+  // the context of the HTM block, thus uc->uc_mcontext.regs->nip points to
+  // tbegin+4, as explained in L103-104 (ibidem). Hence it's necessary to
+  // restore the context where the `trap` or the illegal instruction are, which
+  // is the second context in uc->uc_link.
+  if (transaction_uc) {
+    uc = transaction_uc;
+    uc->uc_link = NULL;
+    ucVoid = (void*) uc;
+  }
+
   Thread* t = Thread::current_or_null_safe();

   SignalHandlerMark shm(t);

Is it possible to open a bug for this issue?

Thank you and best regards,
Gustavo

[1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/transactional_memory.txt#L96-L105
[2] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/transactional_memory.txt#L106-L107