PPC64 JVM crashes when RTM is enabled

Thomas Stüfe thomas.stuefe at gmail.com
Fri Jul 22 08:30:50 UTC 2016


I agree, this is a very good analysis!

Short question though, does this mean the original context is of no value
at all? So, for the purpose of error reporting, which register set should
we print, the original one or the one from uc_link?

Kind Regards, Thomas

On Fri, Jul 22, 2016 at 8:44 AM, Lindenmaier, Goetz <
goetz.lindenmaier at sap.com> wrote:

> Hi Gustavo,
>
> very neat analysis!  I opened
> https://bugs.openjdk.java.net/browse/JDK-8162369
>
> Does AIX require a similar fix?
>
> Best regards,
>   Goetz.
>
>
> > -----Original Message-----
> > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> > bounces at openjdk.java.net] On Behalf Of Gustavo Romero
> > Sent: Donnerstag, 21. Juli 2016 20:30
> > To: ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
> > Cc: Breno Leitao <brenohl at br.ibm.com>
> > Subject: PPC64 JVM crashes when RTM is enabled
> > Importance: High
> >
> > Hi
> >
> > As of now (jdk9/hs-comp, f3c27d6d4ad1 tip), JVM crashes due to the
> > delivery of
> > a signal in the middle of an HTM transaction on PPC64 (on x64 this
> feature is
> > called RTM but on POWER it's called HTM, standing for Hardware
> > Transactional
> > Memory).
> >
> > When a SIGTRAP or a SIGILL is generated by the execution of a `trap`
> > instruction
> > or an illegal instruction at the beginning of a not entrant or zombie
> method
> > and
> > it happens in the middle of an HTM transaction, it fails the HTM
> transaction.
> >
> > As a consequence two different ucontext_t structs are set by the Linux
> > kernel.
> > One context is related to the HTM block that failed while the other
> context is
> > related to where the offending instruction was executed, i.e. the method
> > con-
> > taining the `trap` or illegal instruction. Currently the JVM signal
> handler for
> > Linux/PPC64 just inspects the context related to the failed HTM block and
> > when
> > it verifies the value of nip, i.e. the Next Instruction Pointer set at
> > uc->uc_mcontext.regs->nip, by calling os::Linux::ucontext_get_pc(uc), the
> > signal
> > handler does not find the offending instruction but instead the
> instruction
> > located at tbegin+4, that consists in a branch to the HTM failure
> handler, as
> > explained here [1].
> >
> > A simple test case is:
> > java -XX:+UnlockExperimentalVMOptions -XX:+UseRTMForStackLocks -
> > XX:+UseRTMLocking
> >
> > The issue first appeared in the
> > compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> > jtreg test:
> > http://hastebin.com/raw/ufodiduqeh
> >
> > Please, refer to the following hs_err log:
> > http://hastebin.com/raw/zucifaxoce
> >
> > In this log, si_addr=0x00003fff60460c10 (where the trap instruction is)
> but
> > pc=0x00003fff60455ec4 (which points to tbegin+4, i.e. a beq instruction
> to
> > the
> > HTM failure handler, and not to a trap instruction).
> >
> > 0x00003fff60455ec0: .long 0x7c00051d (tbegin.)
> > 0x00003fff60455ec4: beq-    0x00003fff60455ee0 <======= pc = HTM failure
> > handler
> > 0x00003fff60455ec8: ld      r14,0(r3)           and not trap (or
> illegal) instr.
> > 0x00003fff60455ecc: clrldi  r0,r14,61
> > 0x00003fff60455ed0: cmpwi   cr5,r0,1
> > 0x00003fff60455ed4: beq-    cr5,0x00003fff60455ff4
> > 0x00003fff60455ed8: .long 0x7c00055d (tend.)
> >
> > Once in the signal handler, the pc is normally equal to si_addr, thus pc
> must
> > point to the trap instruction located in the marked not entrant (or
> zombie)
> > method. But when the JVM handler inspects pc it can't find a trap
> instruction
> > (or otherwise an illegal instruction if -XX:-UseSIGTRAP flag is used).
> So it's
> > an invalid condition for the JVM signal handler and the handler hits the
> > report_and_die.
> >
> > Here are two examples of it, one using a trap instruction to mark a not
> > entrant
> > method and another using a illegal instruction for the same purpose:
> > http://hastebin.com/raw/avahoyadik It's important to mention that the
> > crash is
> > indeed intermittent, so a few times a run will just not crash the JVM (it
> > seems
> > that the issue gets worse if the number of threads increase).
> >
> > The solution I found consists in restoring the right context that points
> to the
> > not entrant method, which is stored by the kernel in a second ucontext_t
> > struct
> > in case a signal is caught in the middle of an HTM transaction, as
> explained in
> > here [2].
> >
> > The following patch is proposed to solve the issue, i.e. now
> > compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
> > always
> > passes:
> >
> >
> > diff -r adc8c84b7cf8 src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp
> > --- a/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp        Fri Jul 01
> 11:29:55 2016
> > +0200
> > +++ b/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp        Wed Jul 20 21:52:08
> > 2016 -0400
> > @@ -219,6 +219,28 @@
> >                          int abort_if_unrecognized) {
> >    ucontext_t* uc = (ucontext_t*) ucVoid;
> >
> > +  // A second thread context exists if the signal is delivered during a
> > +  // transaction. Please see kernel doc transactional_memory.txt,
> L99-101:
> > +  // https://goo.gl/E1xbxZ
> > +  ucontext_t* transaction_uc = uc->uc_link;
> > +
> > +  // If uc->uc_link != NULL, then the signal happened during a
> transaction, as
> > +  // pointed out in L106-107 (ibidem). MSR.TS bit must be checked for
> future
> > +  // compatibility, but for now just checking uc->uc_link is ok.
> > +  //
> > +  // The JVM signal handler expects the context where a `trap` or
> > +  // an illegal instruction occurs (i.e. at the beginning of a method
> marked as
> > +  // not entrant or zombie), but if the first context `uc` is used it
> contains
> > +  // the context of the HTM block, thus uc->uc_mcontext.regs->nip points
> > to
> > +  // tbegin+4, as explained in L103-104 (ibidem). Hence it's necessary
> to
> > +  // restore the context where the `trap` or the illegal instruction
> are, which
> > +  // is the second context in uc->uc_link.
> > +  if (transaction_uc) {
> > +    uc = transaction_uc;
> > +    uc->uc_link = NULL;
> > +    ucVoid = (void*) uc;
> > +  }
> > +
> >    Thread* t = Thread::current_or_null_safe();
> >
> >    SignalHandlerMark shm(t);
> >
> > Is it possible to open a bug for this issue?
> >
> > Thank you and best regards,
> > Gustavo
> >
> > [1]
> > https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> > ansactional_memory.txt#L96-L105
> > [2]
> > https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
> > ansactional_memory.txt#L106-L107
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/attachments/20160722/2df9dcc4/attachment-0001.html>


More information about the ppc-aix-port-dev mailing list