RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays

Thu Jan 17 16:18:39 UTC 2019

Hi Martin,

On 01/17/2019 11:18 AM, Doerr, Martin wrote:
> Hi,
> 
> the rebased webrev.01 applies on jdk/jdk, now (after JDK-8216376). So the issue Gustavo had observed does not longer exist.
> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
> 
> I have updated copyrights and retested it.

I tested it when JDK-8216376 was submitted for review, but I retested
this rebase from webrev.01 again on top of the most recent changes
again (just in case) and all looks fine to me.

Thank you.

Best regards,
Gustavo

> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> Sent: Montag, 7. Januar 2019 14:52
> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays
> 
> Hi Martin,
> 
> On 01/07/2019 11:49 AM, Doerr, Martin wrote:
>> I want to check all places where we use "mr(R1_SP, R21_sender_SP)". There may be more issues with that. I'll probably handle that in a separate change and push this CRC change afterwards.
> 
> I see. Thanks for letting me know.
> 
> Best regards,
> Gustavo
> 
>> Best regards,
>> Martin
>>
>>
>> -----Original Message-----
>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
>> Sent: Freitag, 4. Januar 2019 19:55
>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays
>>
>> Hi Martin,
>>
>> On 01/04/2019 02:13 PM, Doerr, Martin wrote:
>>> Hi Gustavo,
>>>
>>> when called from the interpreter (the scenario you observed), R21 is set before resizing the frame to avoid wasted stack space (InterpreterMacroAssembler::call_from_interpreter).
>>
>> Got it. Thanks a lot for the explanations.
>>
>> I think it doesn't currently matter in practice, but I'm wondering if to be
>> consistent we should cut back the stack back earlier also in
>> TemplateInterpreterGenerator::generate_CRC32_update_entry()?
>>
>> diff -r a35f8c35d8c9 src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
>> --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Fri Jan 04 10:09:00 2019 +0100
>> +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Fri Jan 04 13:44:37 2019 -0500
>> @@ -1840,11 +1840,12 @@
>>     #endif
>>         __ lwz(crc,  2*wordSize, argP);    // Current crc state, zero extend to 64 bit to have a clean register.
>>     
>> +    // Restore caller sp for c2i case and return.
>> +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller started.
>> +
>>         StubRoutines::ppc64::generate_load_crc_table_addr(_masm, table);
>>         __ kernel_crc32_singleByte(crc, data, dataLen, table, tmp, true);
>>     
>> -    // Restore caller sp for c2i case and return.
>> -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller started.
>>         __ blr();
>>     
>>         // Generate a vanilla native entry as the slow path.
>>
>> Currently there is no issue probably because generated code is simpler and does
>> no spills.
>>
>> Best regards,
>> Gustavo
>>
>>> When called from compiled methods, R21 is set by a c2i adapter which extends the compiled frame by space for arguments (gen_c2i_adapter).
>>>
>>> "mr(R1_SP, R21_sender_SP)" is more error-prone than "resize_frame_absolute" so I think the latter would be better (though it takes more registers and instructions), but I don't want to replace that as part of this CRC change.
>>>
>>> Best regards,
>>> Martin
>>>
>>>
>>> -----Original Message-----
>>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
>>> Sent: Freitag, 4. Januar 2019 14:44
>>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
>>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays
>>>
>>> Hi Martin,
>>>
>>> On 01/04/2019 07:30 AM, Doerr, Martin wrote:
>>>> thank you very much for confirming. This makes sense. We use different frame headers depending on whether the frame is the top Java frame or not (and on whether it's a debug build or not). Setting R1_SP to sender_SP is a shortcut for leaf calls which relies on having an unmodified stack until this point. So the patch fixes the issue.
>>>
>>> Glad to help! Thanks for the additional information, I was not aware that the
>>> selection of different frame headers could be done at compile time. One last
>>> question only for my education: what exactly advanced (incremented) R1_SP so it
>>> has to be cut back using sender_SP value, i.e. sender_SP tracks the frame for
>>> which function exactly or "who" is the caller exactly here?
>>>
>>> Thank you.
>>>
>>> Best regards,
>>> Gustavo
>>>
>>>> New webrev:
>>>> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
>>>>
>>>> Best regards,
>>>> Martin
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
>>>> Sent: Donnerstag, 3. Januar 2019 19:36
>>>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
>>>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays
>>>>
>>>> Hi Martin,
>>>>
>>>> On 01/03/2019 03:34 PM, Doerr, Martin wrote:
>>>>> Unfortunately, I can't reproduce the crash. TestCRC32C works stable on our machine (with fastdbg build).
>>>>> I guess that the frameless spills mess up the stack. Can you check if the patch below helps?
>>>>
>>>> Thanks for providing a fix so I can try it.
>>>> Yes, I confirm the patch below indeed fixes the sigsegv crash when CRC32C update() method is used.
>>>> I also confirm that I don't observe the crash on the fastdebug build, only on the release build.
>>>> It also only affects the Interpreter mode, so passing -Xcomp avoids the crash on the release build.
>>>>
>>>> Just as reference, I can reproduce it on the release build with the following trivial code:
>>>>
>>>> import java.util.zip.CRC32C;
>>>>
>>>> class CRC32C_v1 {
>>>>        public static void main(String[] arg) {
>>>>          byte[] b = new byte[1024];
>>>>        
>>>>          CRC32C crc32c = new CRC32C();
>>>>          crc32c.update(b, 0, b.length);
>>>>
>>>>          System.out.println(crc32c.getValue());
>>>>        }
>>>> }
>>>>
>>>> Thanks for fixing the typos.
>>>>
>>>>
>>>> Best regards,
>>>> Gustavo
>>>>       
>>>>> Best regards,
>>>>> Martin
>>>>>
>>>>>
>>>>> diff -r a33f49d5998c src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
>>>>> --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Thu Jan 03 17:30:03 2019 +0100
>>>>> +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Thu Jan 03 18:33:16 2019 +0100
>>>>> @@ -1924,6 +1924,9 @@
>>>>>             __ addi(data, data, arrayOopDesc::base_offset_in_bytes(T_BYTE));
>>>>>           }
>>>>>
>>>>> +    // Restore caller sp for c2i case.
>>>>> +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller started.
>>>>> +
>>>>>           StubRoutines::ppc64::generate_load_crc_table_addr(_masm, table);
>>>>>
>>>>>           if (!VM_Version::has_vpmsumb()) {
>>>>> @@ -1933,8 +1936,6 @@
>>>>>             __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3, tc0, tc1, tc2, true);
>>>>>           }
>>>>>
>>>>> -    // Restore caller sp for c2i case and return.
>>>>> -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller started.
>>>>>           __ blr();
>>>>>
>>>>>           // Generate a vanilla native entry as the slow path.
>>>>> @@ -2014,6 +2015,9 @@
>>>>>             __ addi(data, data, arrayOopDesc::base_offset_in_bytes(T_BYTE));
>>>>>           }
>>>>>
>>>>> +    // Restore caller sp for c2i case.
>>>>> +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller started.
>>>>> +
>>>>>           StubRoutines::ppc64::generate_load_crc32c_table_addr(_masm, table);
>>>>>
>>>>>           if (!VM_Version::has_vpmsumb()) {
>>>>> @@ -2023,8 +2027,6 @@
>>>>>             __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3, tc0, tc1, tc2, false);
>>>>>           }
>>>>>
>>>>> -    // Restore caller sp for c2i case and return.
>>>>> -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller started.
>>>>>           __ blr();
>>>>>
>>>>>           BLOCK_COMMENT("} CRC32C_update{Bytes|DirectByteBuffer}");
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
>>>>> Sent: Donnerstag, 3. Januar 2019 17:13
>>>>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
>>>>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays
>>>>>
>>>>> Hi Martin,
>>>>>
>>>>> oh that's nice. You removed the 512-byte block constraint and also wired it up to the Interpreter :)
>>>>>
>>>>> For the worst case, unaligned 512 byte array, I see the gap to aligned 512 byte array reduced by about ~5.7x.
>>>>>
>>>>> On the Interpreter I see an improvement of at least 50% for 1024 bytes.
>>>>>
>>>>> This is all for the CRC32 class.
>>>>>
>>>>> On CRC32C I'm getting a SIGSEV that can be reproduced running against ./test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java.
>>>>>
>>>>> I've upload a full log into http://cr.openjdk.java.net/~gromero/logs/crc32c_sigsegv/
>>>>>
>>>>> I'm leaving for the lunch and I'll take a closer look when back. But probably you will figure it out before I sit to appreciate the meal :)
>>>>>
>>>>> Finally, since the change does some cleanup, I wonder if it would be worth fixing the following typos:
>>>>>
>>>>> I think it's Barrett const., not Barret. Probably 'barret' is used in the code as a short version
>>>>> for Barrett but it should be changed in
>>>>>
>>>>> +  // Point to Barret constants
>>>>> +  add_const_optimized(cur_const, constants, outer_consts_size + inner_consts_size);
>>>>> +
>>>>>
>>>>> ?
>>>>>
>>>>> s/not/note/ in:
>>>>> cpu/ppc/macroAssembler_ppc.cpp:3977:// A not on the lookup table address(es):
>>>>>
>>>>> d/lives/ in:
>>>>> cpu/ppc/macroAssembler_ppc.cpp:4265:  mtvrwz(VCRC, crc); // crc lives lives in VCRC, now
>>>>>
>>>>> Best regards,
>>>>> Gustavo
>>>>>
>>>>> On 01/03/2019 12:17 PM, Doerr, Martin wrote:
>>>>>> Hi,
>>>>>>
>>>>>> the JVM on PPC64 currently misses usage of the fast vector implementation in the interpreter code.
>>>>>>
>>>>>> In addition, performance is not good for short arrays (unaligned 512 byte arrays or shorter arrays) because the current vector implementation needs at least 512 bytes.
>>>>>>
>>>>>> Bug:
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8216060
>>>>>>
>>>>>> I have addressed these 2 issues + some cleanup with the following webrev:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.00/ <http://cr.openjdk.java.net/%7Emdoerr/8216060_PPC64_CRC/webrev.00/>
>>>>>>
>>>>>> Please review.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>
>>>>
>>>
>>
>