PPC64 JVM crashes when RTM is enabled

Wed Jul 27 00:48:06 UTC 2016

Hi Goetz, Thomas, Martin

Goetz, my knowledge on AIX kernel is close to zero, so I could not say for
sure if the fix was required also on AIX before testing. Unfortunately I
was unable to set up an AIX env so far in order to perform additional tests
regarding the HTM behavior on AIX.

That said, by looking at the AIX docs [1], it seems the AIX has a quite
similar mechanism when dealing with an HTM abortion due to a signal, i.e.
it will set accordingly a second context (or extended context). The only
difference I see is on the member names, like __extcxt instead of uc_link
and also the way to check if such a extended context exists, since on AIX
there will be a member called __extctx_magic that must be check if being
equal to __EXTCTX_MAGIC and if so that indicates we got a signal in the
middle of an HTM transaction.

    unsigned long long __extctx;  /* address of extended context    */
    int __extctx_magic;           /* if set to __EXTCTX_MAGIC, then */
                                  /* extended context is present    */

Nonetheless, Martin proposes a different solution, which consists in not
switching the context where the trap/illegal instruction occurs but instead
to use the current one and let it fall into the HTM failure handler, so not
using the second (extended) context.

Martin, I've tested on Linux the solution as you've tried on AIX and
it works. But I've got a question: checking if the signal happened in the
middle of an HTM transaction by just checking the presence of tbegin. at
pc-4, aren't we missing a crash in case an illegal instruction is
generated (and executed) /not/ intentionally in the middle of an HTM
transaction, due to, let's say, a bug in the JIT compiler?

Anyway, I think that managing to fall into the HTM failure handler is
the correct thing to do, instead of switching to the second context (my
first approach to the problem). For instance, in a nested HTM:

tbegin.
beq failure_handler
tbegin.
beq failure_handler
<valid instruction>
...
<trap/illegal instruction>
<valid instruction>
...
tend.
tend.

it seems wrong to return control to the second context (inner trap/illegal
instruction) and not the most outer failure_handler (just as curiosity, no
matter the number of levels of nested HTM blocks if any of them fails pc
will set at the most outer HTM failure handler).

Thomas, in any case both contexts contain be valid and we can choose
any (or both) to include in the report. Also there are other registers, like
TEXASR, that can indicate precisely why the HTM transactoin failed. But
I'm not sure if we should include them in the crash report.

Should we address the case when an unintentional instruction is caught in
the middle of an HTM by including additional checks?

Should we add any additional information in the report to indicate it
happened in the middle of a HTM transaction?

Goetz, thanks for opening the bug.

Thank you and kind regards,
Gustavo

[1] https://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.genprogc/transactional_memory.htm

On 26-07-2016 11:24, Doerr, Martin wrote:
> Hi Gustavo and all,
> 
> thanks for investigating. We have only tested RTM on AIX which seems to work fine.
> We have inserted the following code into JVM_handle_aix_signal:
> 
> #if INCLUDE_RTM_OPT
>     {
>       int *inst_ptr = (int*)(pc - BytesPerInstWord);
>       if (CodeCache::contains((address)inst_ptr) && MacroAssembler::is_tbegin(*inst_ptr)) {
>         // Ignore transaction abort due to signal. Will jump to abort handler.
>         if (TraceTraps) {
>           tty->print_cr("Caught signal %d in transaction. Ignoring to jump to abort handler.", sig);
>         }
>         return true;
>       }
>     }
> #endif
> 
> The transaction always got aborted and we never ran into this code.
> 
> I had made the same experiment on linux in JVM_handle_linux_signal where we ran into it. But RTM didn't work as expected because I was using an old linux version which doesn't treat system calls as required.
> 
> Maybe you would like to use this detection and figure out the context of the signaling instruction.
> 
> Please keep in mind that the C/C++ code must be compilable on old linux versions (at least for big endian). But I guess this shouldn't be an issue as the uc_link isn't new.
> 
> Can you provide a webrev? I can sponsor the fix.
> 
> Best regards,
> Martin
> 
> -----Original Message-----
> From: hotspot-dev [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Lindenmaier, Goetz
> Sent: Freitag, 22. Juli 2016 08:44
> To: Gustavo Romero <gromero at linux.vnet.ibm.com>; ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
> Cc: Breno Leitao <brenohl at br.ibm.com>
> Subject: RE: PPC64 JVM crashes when RTM is enabled
> 
> Hi Gustavo,
> 
> very neat analysis!  I opened 
> https://bugs.openjdk.java.net/browse/JDK-8162369
> 
> Does AIX require a similar fix?
> 
> Best regards,
>   Goetz.
> 
> 
>> -----Original Message-----
>> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
>> bounces at openjdk.java.net] On Behalf Of Gustavo Romero
>> Sent: Donnerstag, 21. Juli 2016 20:30
>> To: ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net
>> Cc: Breno Leitao <brenohl at br.ibm.com>
>> Subject: PPC64 JVM crashes when RTM is enabled
>> Importance: High
>>
>> Hi
>>
>> As of now (jdk9/hs-comp, f3c27d6d4ad1 tip), JVM crashes due to the
>> delivery of
>> a signal in the middle of an HTM transaction on PPC64 (on x64 this feature is
>> called RTM but on POWER it's called HTM, standing for Hardware
>> Transactional
>> Memory).
>>
>> When a SIGTRAP or a SIGILL is generated by the execution of a `trap`
>> instruction
>> or an illegal instruction at the beginning of a not entrant or zombie method
>> and
>> it happens in the middle of an HTM transaction, it fails the HTM transaction.
>>
>> As a consequence two different ucontext_t structs are set by the Linux
>> kernel.
>> One context is related to the HTM block that failed while the other context is
>> related to where the offending instruction was executed, i.e. the method
>> con-
>> taining the `trap` or illegal instruction. Currently the JVM signal handler for
>> Linux/PPC64 just inspects the context related to the failed HTM block and
>> when
>> it verifies the value of nip, i.e. the Next Instruction Pointer set at
>> uc->uc_mcontext.regs->nip, by calling os::Linux::ucontext_get_pc(uc), the
>> signal
>> handler does not find the offending instruction but instead the instruction
>> located at tbegin+4, that consists in a branch to the HTM failure handler, as
>> explained here [1].
>>
>> A simple test case is:
>> java -XX:+UnlockExperimentalVMOptions -XX:+UseRTMForStackLocks -
>> XX:+UseRTMLocking
>>
>> The issue first appeared in the
>> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
>> jtreg test:
>> http://hastebin.com/raw/ufodiduqeh
>>
>> Please, refer to the following hs_err log:
>> http://hastebin.com/raw/zucifaxoce
>>
>> In this log, si_addr=0x00003fff60460c10 (where the trap instruction is) but
>> pc=0x00003fff60455ec4 (which points to tbegin+4, i.e. a beq instruction to
>> the
>> HTM failure handler, and not to a trap instruction).
>>
>> 0x00003fff60455ec0: .long 0x7c00051d (tbegin.)
>> 0x00003fff60455ec4: beq-    0x00003fff60455ee0 <======= pc = HTM failure
>> handler
>> 0x00003fff60455ec8: ld      r14,0(r3)           and not trap (or illegal) instr.
>> 0x00003fff60455ecc: clrldi  r0,r14,61
>> 0x00003fff60455ed0: cmpwi   cr5,r0,1
>> 0x00003fff60455ed4: beq-    cr5,0x00003fff60455ff4
>> 0x00003fff60455ed8: .long 0x7c00055d (tend.)
>>
>> Once in the signal handler, the pc is normally equal to si_addr, thus pc must
>> point to the trap instruction located in the marked not entrant (or zombie)
>> method. But when the JVM handler inspects pc it can't find a trap instruction
>> (or otherwise an illegal instruction if -XX:-UseSIGTRAP flag is used). So it's
>> an invalid condition for the JVM signal handler and the handler hits the
>> report_and_die.
>>
>> Here are two examples of it, one using a trap instruction to mark a not
>> entrant
>> method and another using a illegal instruction for the same purpose:
>> http://hastebin.com/raw/avahoyadik It's important to mention that the
>> crash is
>> indeed intermittent, so a few times a run will just not crash the JVM (it
>> seems
>> that the issue gets worse if the number of threads increase).
>>
>> The solution I found consists in restoring the right context that points to the
>> not entrant method, which is stored by the kernel in a second ucontext_t
>> struct
>> in case a signal is caught in the middle of an HTM transaction, as explained in
>> here [2].
>>
>> The following patch is proposed to solve the issue, i.e. now
>> compiler/rtm/cli/TestUseRTMForStackLocksOptionOnSupportedConfig.java
>> always
>> passes:
>>
>>
>> diff -r adc8c84b7cf8 src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp
>> --- a/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Fri Jul 01 11:29:55 2016
>> +0200
>> +++ b/src/os_cpu/linux_ppc/vm/os_linux_ppc.cpp	Wed Jul 20 21:52:08
>> 2016 -0400
>> @@ -219,6 +219,28 @@
>>                          int abort_if_unrecognized) {
>>    ucontext_t* uc = (ucontext_t*) ucVoid;
>>
>> +  // A second thread context exists if the signal is delivered during a
>> +  // transaction. Please see kernel doc transactional_memory.txt, L99-101:
>> +  // https://goo.gl/E1xbxZ
>> +  ucontext_t* transaction_uc = uc->uc_link;
>> +
>> +  // If uc->uc_link != NULL, then the signal happened during a transaction, as
>> +  // pointed out in L106-107 (ibidem). MSR.TS bit must be checked for future
>> +  // compatibility, but for now just checking uc->uc_link is ok.
>> +  //
>> +  // The JVM signal handler expects the context where a `trap` or
>> +  // an illegal instruction occurs (i.e. at the beginning of a method marked as
>> +  // not entrant or zombie), but if the first context `uc` is used it contains
>> +  // the context of the HTM block, thus uc->uc_mcontext.regs->nip points
>> to
>> +  // tbegin+4, as explained in L103-104 (ibidem). Hence it's necessary to
>> +  // restore the context where the `trap` or the illegal instruction are, which
>> +  // is the second context in uc->uc_link.
>> +  if (transaction_uc) {
>> +    uc = transaction_uc;
>> +    uc->uc_link = NULL;
>> +    ucVoid = (void*) uc;
>> +  }
>> +
>>    Thread* t = Thread::current_or_null_safe();
>>
>>    SignalHandlerMark shm(t);
>>
>> Is it possible to open a bug for this issue?
>>
>> Thank you and best regards,
>> Gustavo
>>
>> [1]
>> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
>> ansactional_memory.txt#L96-L105
>> [2]
>> https://github.com/torvalds/linux/blob/master/Documentation/powerpc/tr
>> ansactional_memory.txt#L106-L107
>