SIGILL crashes JVM on PPC64 LE
Gustavo Romero
gromero at linux.vnet.ibm.com
Wed Jun 1 18:21:55 UTC 2016
Hi Volker
You are right, Cassandra's upstream code does not contain arch.equals("ppc64le").
The following patch http://hastebin.com/raw/zusomadace was applied to this
commit: http://cassci.datastax.com/job/trunk_utest/1344 This was the way
I first reproduced the issue, maybe Hiroshi is using a more recent commit.
As, except for VMX/Altivec instructions whose operands are assumed to
be always aligned, PPC64 supports unaligned storage access, and as - I've
been told - the patch solved all failing tests on z (which is LE, BTW), the
change (adding ppc64le) was tried and the issue emerged.
The initial error thus was seen on "ant test" suite. This is an example of
failing test due to the illegal instruction (there are others):
ant testsome -Dtest.name=org.apache.cassandra.db.NativeCellTest -Dtest.methods=testCells
The following problematic snippet that uses MemoryUtil.getInt has been traced:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/rows/NativeCell.java#L136-L137
This is from where the test case was thought (Hiroshi, please correct me if I'm
wrong).
Thanks a lot for your help!
Gustavo
On 01-06-2016 14:06, Volker Simonis wrote:
> Hi Hiroshi, Gustavo,
>
> I'm currently trying to better understand the cause of the crash.
> When looking at the Cassandra sources [1] I can see that on ppc we
> should actually not call Unsafe.getInt() at all:
>
> UNALIGNED = arch.equals("i386") || arch.equals("x86")
> || arch.equals("amd64") || arch.equals("x86_64") || arch.equals("s390x");
>
> public static int getInt(long address)
> {
> return UNALIGNED ? unsafe.getInt(address) : getIntByByte(address);
> }
>
> Is this behavior different in the version of Cassandra which you have
> used for your tests?
>
> I just want to make sure that the problem we reproduce with your
> stand-alone test case is the same like the one we are seeing in the
> initial Cassandra crash.
>
> Could you please provide the exact versions of Cassandra you have used
> and a description of the tests and the way you have executed them when
> you saw the initial error?
>
> Thanks a lot for your help,
> Volker
>
> [1] https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/memory/MemoryUtil.java
>
> On Wed, Jun 1, 2016 at 10:51 AM, Hiroshi H Horii <HORII at jp.ibm.com> wrote:
>> Hi Volker,
>>
>> Thank you for your reviewing our fix.
>>
>> To avoid a generation of illegal instructions when ldisp is not 4-alignment,
>> I changed ppc.ad to generate always two instructions for each ld and lwa as
>> follows.
>> I mean, when ldisp is 4-alignment, nop() is generated redundantly.
>>
>> // Operand 'ds' requires 4-alignment.
>> if (Idisp & 0x3) {
>> __ addi($dst$$Register, $mem$$base$$Register, Idisp);
>> __ ld($dst$$Register, 0, $dst$$Register);
>> } else {
>> __ ld($dst$$Register, Idisp, $mem$$base$$Register);
>> __ nop();
>> }
>>
>> I'm not sure this fix is elegant or not.
>>
>> In my understanding, an argument of size(n) in ADL must be constant.
>> Correct?
>> If the number can be dynamic, we can avoid generating nop()...
>> Also, we may be able to fix this bug in more higher level (such as IR
>> generation).
>>
>> Regards,
>> Hiroshi
>> -----------------------
>> Hiroshi Horii, Ph.D.
>> IBM Research - Tokyo
>>
>>
>> Volker Simonis <volker.simonis at gmail.com> wrote on 06/01/2016 15:37:21:
>>
>>> From: Volker Simonis <volker.simonis at gmail.com>
>>> To: Gustavo Romero <gromero at linux.vnet.ibm.com>
>>> Cc: "ppc-aix-port-dev at openjdk.java.net" <ppc-aix-port-
>>> dev at openjdk.java.net>, "hotspot-dev at openjdk.java.net" <hotspot-
>>> dev at openjdk.java.net>, Breno Leitao <brenohl at br.ibm.com>, Hiroshi H
>>> Horii/Japan/IBM at IBMJP
>>> Date: 06/01/2016 15:38
>>> Subject: Re: SIGILL crashes JVM on PPC64 LE
>>
>>>
>>> Hi Gustavo, Hiroshi,
>>>
>>> thanks a lot for the great analysis and the nice stand-alone test
>>> case. This is indeed a problem, and it also occurs on ppc64
>>> big-endian.
>>>
>>> I've opened "8158260: PPC64: unaligned Unsafe.getInt can lead to the
>>> generation of illegal instructions"
>>> (https://bugs.openjdk.java.net/browse/JDK-8158260) for this issue.
>>>
>>> I'm currently looking at your proposed fix and will come back with a
>>> new webrev soon.
>>>
>>> Thanks a lot and best regards,
>>> Volker
>>>
>>>
>>> On Tue, May 31, 2016 at 3:31 AM, Gustavo Romero
>>> <gromero at linux.vnet.ibm.com> wrote:
>>>> Hi Volker
>>>>
>>>> The following test case has been isolated by Hiroshi Horii and generates
>>>> the illegal instruction, crashing the JVM on PPC64 LE:
>>>>
>>>> UnalignedUnsafeAccess.java:
>>>> http://hastebin.com/raw/uqegukific
>>>>
>>>> $ javac UnalignedUnsafeAccess.java
>>>> $ java -Xcomp -Xbatch UnalignedUnsafeAccess
>>>>
>>>> The issue can be reproduced on OpenJDK 8 downstream, OpenJDK 8, and
>>>> OpenJDK 9 - hs_err logs:
>>>>
>>>> OpenJDK 9, tag 0be6f4f5d186 jdk-9+120:
>>>> http://hastebin.com/raw/ecuhukutur
>>>>
>>>> OpenJDK 8, tag 5aaa43d91c73 tip:
>>>> http://hastebin.com/raw/ipohoyafos
>>>>
>>>> OpenJDK 8 downstream:
>>>>
>>>> Ubuntu 16.04 LTS
>>>> build 1.8.0_91-8u91-b14-0ubuntu4~16.04.1-b14
>>>> http://hastebin.com/raw/yetizebofo
>>>>
>>>> RHEL 7.2:
>>>> build 1.8.0_91-b14
>>>> http://hastebin.com/raw/irequfawaw
>>>>
>>>> The crash happens when an illegal instruction - 0xea2f0013 - is
>>>> executed.
>>>>
>>>> The backtrace shows:
>>>>
>>>> Stack: [0x00003fff56030000,0x00003fff56430000],
>>> sp=0x00003fff5642b8d0, free space=4078k
>>>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code,
>>> C=native code)
>>>> V [libjvm.so+0x162104] loadI2LNode::emit(CodeBuffer&,
>>> PhaseRegAlloc*) const+0x194
>>>> V [libjvm.so+0x8ece28] Compile::fill_buffer(CodeBuffer*,
>>> unsigned int*)+0x4e8
>>>> V [libjvm.so+0x368e08] Compile::Code_Gen()+0x3c8
>>>> V [libjvm.so+0x369e04] Compile::Compile(ciEnv*, C2Compiler*,
>>> ciMethod*, int, bool, bool, bool)+0xf64
>>>> V [libjvm.so+0x271380] C2Compiler::compile_method(ciEnv*,
>>> ciMethod*, int)+0x1f0
>>>> V [libjvm.so+0x3785a4] CompileBroker::invoke_compiler_on_method
>>> (CompileTask*)+0xd54
>>>> V [libjvm.so+0x379dc8] CompileBroker::compiler_thread_loop()+0x488
>>>> V [libjvm.so+0xa5de90] compiler_thread_entry(JavaThread*,
>>>> Thread*)+0x20
>>>> V [libjvm.so+0xa690c8] JavaThread::thread_main_inner()+0x178
>>>> V [libjvm.so+0x8c8c10] java_start(Thread*)+0x170
>>>> C [libpthread.so.0+0x833c] start_thread+0xfc
>>>> C [libc.so.6+0x12b014] clone+0xe4
>>>>
>>>> loadI2LNode class is generated according to the following ADL code in
>>>> ppc.ad file:
>>>>
>>>> instruct loadI2L(iRegLdst dst, memory mem) %{
>>>> match(Set dst (ConvI2L (LoadI mem)));
>>>> predicate(_kids[0]->_leaf->as_Load()->is_unordered());
>>>> ins_cost(MEMORY_REF_COST);
>>>>
>>>> format %{ "LWA $dst, $mem \t// loadI2L" %}
>>>> size(4);
>>>> ins_encode %{
>>>> // TODO: PPC port $archOpcode(ppc64Opcode_lwa);
>>>> int Idisp = $mem$$disp + frame_slots_bias($mem$$base, ra_);
>>>> __ lwa($dst$$Register, Idisp, $mem$$base$$Register);
>>>> %}
>>>> ins_pipe(pipe_class_memory);
>>>> %}
>>>>
>>>> So the generated illegal instruction comes from:
>>>> lwa 17,17,15 (DS-form: lwa RT, DS, RA)
>>>>
>>>> As DS field must always be 4-byte aligned (i.e. DS field is always
>>>> concatenated with 0b00), 17 as DS (middle 17 value) is illegal,
>>>> generating the illegal instruction in question:
>>>>
>>>> 11101010000000000000000000000010: LWA
>>>> 00000010001000000000000000000000: 17
>>>> 00000000000000000000000000010001: 17
>>>> 00000000000011110000000000000000: 15
>>>> --------------------------------
>>>> 11101010001011110000000000010011: 0xEA2F0013 => Illegal instruction
>>>>
>>>> The following change is proposed to fix the issue and deals with the
>>>> unaligned displacements:
>>>>
>>>> OpenJDK 9 webrev:
>>>> 81.de.7a9f.ip4.static.sl-reverse.com./illegal/9
>>>>
>>>> OpenJDK 8 webrev:
>>>> 81.de.7a9f.ip4.static.sl-reverse.com./illegal/8
>>>>
>>>> Could we open a JIRA ticket regarding this issue in order to include it
>>>> in the webrev?
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>
>>>> On 12-05-2016 09:39, Volker Simonis wrote:
>>>>> And I forgot to mention: I've checked and we don't emit vsel
>>>>> instructions in jdk8 on ppc. So it must be a coincidence that changing
>>>>> the endianess of the offending instruction yields a valid 'vsel'
>>>>> instruction.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 12, 2016 at 2:26 PM, Volker Simonis
>>>>> <volker.simonis at gmail.com> wrote:
>>>>>> Hi Gustavo,
>>>>>>
>>>>>> thanks for the bug report. The hs_err file you provided indicates that
>>>>>> this crash happened with Ubuntu's openjdk 8 version. Can you still
>>>>>> reproduce this with the the newest jdk9 builds?
>>>>>>
>>>>>> Also, I can see from the hs_err file that the crash happened in the C2
>>>>>> compiled method java.util.TimSort.countRunAndMakeAscending which
>>>>>> doesn't seem to be related to nio and unsafe.
>>>>>>
>>>>>> Ideally, you could post an easy test case to reproduce the problem. If
>>>>>> that's not possible, it would be helpful if you could post the output
>>>>>> of a failing run with
>>>>>> "-XX:CompileCommand=print,java.util.TimSort::countRunAndMakeAscending
>>>>>> -
>>>
>>> XX:CompileCommand=option,java.util.TimSort::countRunAndMakeAscending,PrintOptoAssembly".
>>>>>> In order to get the disassembly output for compiled methods you have
>>>>>> to build the hsdis library from hotspot/src/share/tools/hsdis (it has
>>>>>> a README with build instructions).
>>>>>>
>>>>>> Regards,
>>>>>> Volker
>>>>>>
>>>>>>
>>>>>> On Thu, May 12, 2016 at 12:32 AM, Gustavo Romero
>>>>>> <gromero at linux.vnet.ibm.com> wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> I'm getting a nasty SIGILL that crashes the JVM on PPC64 LE.
>>>>>>>
>>>>>>> hs_err log:
>>>>>>> http://hastebin.com/raw/fovagunaci
>>>>>>>
>>>>>>> The application employs methods from both java.nio.ByteBuffer and
>>>>>>> sun.misc.Unsafe classes in order to write and read from an
>>> allocated buffer.
>>>>>>>
>>>>>>> A interesting thing is that after debugging the instruction
>>> that caused the
>>>>>>> said SIGILL:
>>>>>>>
>>>>>>> 0x3fff902839a4: cmpwi cr6,r17,0
>>>>>>> 0x3fff902839a8: beq cr6,0x3fff90283ae4
>>>>>>> 0x3fff902839ac: .long 0xea2f0013 <============ illegal
>>> instruction
>>>>>>> 0x3fff902839b0: add r15,r15,r17
>>>>>>> 0x3fff902839b4: add r14,r17,r14
>>>>>>>
>>>>>>> I found that when its endianness is changed it turns out to be a
>>>>>>> valid
>>>>>>> instruction: vsel v24,v0,v5,v31
>>>>>>>
>>>>>>> However, I'm still unable to determine if it's an application
>>> issue, something
>>>>>>> with JVM unsafe interface code, or something else.
>>>>>>>
>>>>>>> Any clue on how to narrow down this SIGILL?
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gustavo
>>>>>>>
>>>>>
>>>>
>>>
>>
>
More information about the ppc-aix-port-dev
mailing list