RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays

Fri Jan 18 14:32:45 UTC 2019

Hi Götz,

that's a good proposal. I've moved the common functionality into macroAssembler_ppc. This makes interpreter and stubGenerator code shorter.

I've also moved the vector constants computation to stubGenerator such that we only do it when the intrinsics are enabled and the vector version is supported by the processor.

New webrev:
http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.02/

@Gustavo: Thanks for testing and confirming the issue (JDK-8216376) is fixed.

Best regards,
Martin

-----Original Message-----
From: Lindenmaier, Goetz 
Sent: Freitag, 18. Januar 2019 12:03
To: Doerr, Martin <martin.doerr at sap.com>; Gustavo Romero <gromero at linux.vnet.ibm.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
Subject: RE: RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays

Hi Martin, 

I had a look at your change. 
Overall looks good. According to Gustavos mail a nice improvement!

I think though that the way to select the algorithm is quite
messy:
In templateInterpreter vpmsumb is checked and the methods are
called directly.
In stubGenerator, generate_CRC32...()
  vpmsumb is tested to decide on vector_constants = R2.
  and generic generate_CRC_updateBytes is called, which 
again checks whether verctor_constants == R2.

I think generate_CRC_updateBytes() or some other generic
function should be located in macroAssembler_ppc and
be called from both locations.

What do you think?

Best regards,
  Goetz

> -----Original Message-----
> From: Doerr, Martin
> Sent: Donnerstag, 17. Januar 2019 14:18
> To: Gustavo Romero <gromero at linux.vnet.ibm.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>;
> Lindenmaier, Goetz <goetz.lindenmaier at sap.com>
> Subject: RE: RFR(M): 8216060: [PPC64] Vector CRC implementation should be
> used by interpreter and be faster for short arrays
> 
> Hi,
> 
> the rebased webrev.01 applies on jdk/jdk, now (after JDK-8216376). So the
> issue Gustavo had observed does not longer exist.
> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
> 
> I have updated copyrights and retested it.
> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> Sent: Montag, 7. Januar 2019 14:52
> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should be
> used by interpreter and be faster for short arrays
> 
> Hi Martin,
> 
> On 01/07/2019 11:49 AM, Doerr, Martin wrote:
> > I want to check all places where we use "mr(R1_SP, R21_sender_SP)".
> There may be more issues with that. I'll probably handle that in a separate
> change and push this CRC change afterwards.
> 
> I see. Thanks for letting me know.
> 
> Best regards,
> Gustavo
> 
> > Best regards,
> > Martin
> >
> >
> > -----Original Message-----
> > From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> > Sent: Freitag, 4. Januar 2019 19:55
> > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should
> be used by interpreter and be faster for short arrays
> >
> > Hi Martin,
> >
> > On 01/04/2019 02:13 PM, Doerr, Martin wrote:
> >> Hi Gustavo,
> >>
> >> when called from the interpreter (the scenario you observed), R21 is set
> before resizing the frame to avoid wasted stack space
> (InterpreterMacroAssembler::call_from_interpreter).
> >
> > Got it. Thanks a lot for the explanations.
> >
> > I think it doesn't currently matter in practice, but I'm wondering if to be
> > consistent we should cut back the stack back earlier also in
> > TemplateInterpreterGenerator::generate_CRC32_update_entry()?
> >
> > diff -r a35f8c35d8c9
> src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> > --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Fri Jan
> 04 10:09:00 2019 +0100
> > +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Fri Jan
> 04 13:44:37 2019 -0500
> > @@ -1840,11 +1840,12 @@
> >    #endif
> >        __ lwz(crc,  2*wordSize, argP);    // Current crc state, zero extend to 64
> bit to have a clean register.
> >
> > +    // Restore caller sp for c2i case and return.
> > +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller
> started.
> > +
> >        StubRoutines::ppc64::generate_load_crc_table_addr(_masm, table);
> >        __ kernel_crc32_singleByte(crc, data, dataLen, table, tmp, true);
> >
> > -    // Restore caller sp for c2i case and return.
> > -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller
> started.
> >        __ blr();
> >
> >        // Generate a vanilla native entry as the slow path.
> >
> > Currently there is no issue probably because generated code is simpler and
> does
> > no spills.
> >
> > Best regards,
> > Gustavo
> >
> >> When called from compiled methods, R21 is set by a c2i adapter which
> extends the compiled frame by space for arguments (gen_c2i_adapter).
> >>
> >> "mr(R1_SP, R21_sender_SP)" is more error-prone than
> "resize_frame_absolute" so I think the latter would be better (though it takes
> more registers and instructions), but I don't want to replace that as part of
> this CRC change.
> >>
> >> Best regards,
> >> Martin
> >>
> >>
> >> -----Original Message-----
> >> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> >> Sent: Freitag, 4. Januar 2019 14:44
> >> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> >> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should
> be used by interpreter and be faster for short arrays
> >>
> >> Hi Martin,
> >>
> >> On 01/04/2019 07:30 AM, Doerr, Martin wrote:
> >>> thank you very much for confirming. This makes sense. We use different
> frame headers depending on whether the frame is the top Java frame or not
> (and on whether it's a debug build or not). Setting R1_SP to sender_SP is a
> shortcut for leaf calls which relies on having an unmodified stack until this
> point. So the patch fixes the issue.
> >>
> >> Glad to help! Thanks for the additional information, I was not aware that
> the
> >> selection of different frame headers could be done at compile time. One
> last
> >> question only for my education: what exactly advanced (incremented)
> R1_SP so it
> >> has to be cut back using sender_SP value, i.e. sender_SP tracks the frame
> for
> >> which function exactly or "who" is the caller exactly here?
> >>
> >> Thank you.
> >>
> >> Best regards,
> >> Gustavo
> >>
> >>> New webrev:
> >>> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
> >>>
> >>> Best regards,
> >>> Martin
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> >>> Sent: Donnerstag, 3. Januar 2019 19:36
> >>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> >>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation
> should be used by interpreter and be faster for short arrays
> >>>
> >>> Hi Martin,
> >>>
> >>> On 01/03/2019 03:34 PM, Doerr, Martin wrote:
> >>>> Unfortunately, I can't reproduce the crash. TestCRC32C works stable on
> our machine (with fastdbg build).
> >>>> I guess that the frameless spills mess up the stack. Can you check if the
> patch below helps?
> >>>
> >>> Thanks for providing a fix so I can try it.
> >>> Yes, I confirm the patch below indeed fixes the sigsegv crash when
> CRC32C update() method is used.
> >>> I also confirm that I don't observe the crash on the fastdebug build, only
> on the release build.
> >>> It also only affects the Interpreter mode, so passing -Xcomp avoids the
> crash on the release build.
> >>>
> >>> Just as reference, I can reproduce it on the release build with the
> following trivial code:
> >>>
> >>> import java.util.zip.CRC32C;
> >>>
> >>> class CRC32C_v1 {
> >>>       public static void main(String[] arg) {
> >>>         byte[] b = new byte[1024];
> >>>
> >>>         CRC32C crc32c = new CRC32C();
> >>>         crc32c.update(b, 0, b.length);
> >>>
> >>>         System.out.println(crc32c.getValue());
> >>>       }
> >>> }
> >>>
> >>> Thanks for fixing the typos.
> >>>
> >>>
> >>> Best regards,
> >>> Gustavo
> >>>
> >>>> Best regards,
> >>>> Martin
> >>>>
> >>>>
> >>>> diff -r a33f49d5998c
> src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> >>>> --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Thu
> Jan 03 17:30:03 2019 +0100
> >>>> +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> Thu Jan 03 18:33:16 2019 +0100
> >>>> @@ -1924,6 +1924,9 @@
> >>>>            __ addi(data, data,
> arrayOopDesc::base_offset_in_bytes(T_BYTE));
> >>>>          }
> >>>>
> >>>> +    // Restore caller sp for c2i case.
> >>>> +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>> +
> >>>>          StubRoutines::ppc64::generate_load_crc_table_addr(_masm,
> table);
> >>>>
> >>>>          if (!VM_Version::has_vpmsumb()) {
> >>>> @@ -1933,8 +1936,6 @@
> >>>>            __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3,
> tc0, tc1, tc2, true);
> >>>>          }
> >>>>
> >>>> -    // Restore caller sp for c2i case and return.
> >>>> -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>>          __ blr();
> >>>>
> >>>>          // Generate a vanilla native entry as the slow path.
> >>>> @@ -2014,6 +2015,9 @@
> >>>>            __ addi(data, data,
> arrayOopDesc::base_offset_in_bytes(T_BYTE));
> >>>>          }
> >>>>
> >>>> +    // Restore caller sp for c2i case.
> >>>> +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>> +
> >>>>          StubRoutines::ppc64::generate_load_crc32c_table_addr(_masm,
> table);
> >>>>
> >>>>          if (!VM_Version::has_vpmsumb()) {
> >>>> @@ -2023,8 +2027,6 @@
> >>>>            __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3,
> tc0, tc1, tc2, false);
> >>>>          }
> >>>>
> >>>> -    // Restore caller sp for c2i case and return.
> >>>> -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>>          __ blr();
> >>>>
> >>>>          BLOCK_COMMENT("} CRC32C_update{Bytes|DirectByteBuffer}");
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> >>>> Sent: Donnerstag, 3. Januar 2019 17:13
> >>>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> >>>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation
> should be used by interpreter and be faster for short arrays
> >>>>
> >>>> Hi Martin,
> >>>>
> >>>> oh that's nice. You removed the 512-byte block constraint and also
> wired it up to the Interpreter :)
> >>>>
> >>>> For the worst case, unaligned 512 byte array, I see the gap to aligned
> 512 byte array reduced by about ~5.7x.
> >>>>
> >>>> On the Interpreter I see an improvement of at least 50% for 1024 bytes.
> >>>>
> >>>> This is all for the CRC32 class.
> >>>>
> >>>> On CRC32C I'm getting a SIGSEV that can be reproduced running against
> ./test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java.
> >>>>
> >>>> I've upload a full log into
> http://cr.openjdk.java.net/~gromero/logs/crc32c_sigsegv/
> >>>>
> >>>> I'm leaving for the lunch and I'll take a closer look when back. But
> probably you will figure it out before I sit to appreciate the meal :)
> >>>>
> >>>> Finally, since the change does some cleanup, I wonder if it would be
> worth fixing the following typos:
> >>>>
> >>>> I think it's Barrett const., not Barret. Probably 'barret' is used in the
> code as a short version
> >>>> for Barrett but it should be changed in
> >>>>
> >>>> +  // Point to Barret constants
> >>>> +  add_const_optimized(cur_const, constants, outer_consts_size +
> inner_consts_size);
> >>>> +
> >>>>
> >>>> ?
> >>>>
> >>>> s/not/note/ in:
> >>>> cpu/ppc/macroAssembler_ppc.cpp:3977:// A not on the lookup table
> address(es):
> >>>>
> >>>> d/lives/ in:
> >>>> cpu/ppc/macroAssembler_ppc.cpp:4265:  mtvrwz(VCRC, crc); // crc
> lives lives in VCRC, now
> >>>>
> >>>> Best regards,
> >>>> Gustavo
> >>>>
> >>>> On 01/03/2019 12:17 PM, Doerr, Martin wrote:
> >>>>> Hi,
> >>>>>
> >>>>> the JVM on PPC64 currently misses usage of the fast vector
> implementation in the interpreter code.
> >>>>>
> >>>>> In addition, performance is not good for short arrays (unaligned 512
> byte arrays or shorter arrays) because the current vector implementation
> needs at least 512 bytes.
> >>>>>
> >>>>> Bug:
> >>>>>
> >>>>> https://bugs.openjdk.java.net/browse/JDK-8216060
> >>>>>
> >>>>> I have addressed these 2 issues + some cleanup with the following
> webrev:
> >>>>>
> >>>>> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.00/
> <http://cr.openjdk.java.net/%7Emdoerr/8216060_PPC64_CRC/webrev.00/>
> >>>>>
> >>>>> Please review.
> >>>>>
> >>>>> Best regards,
> >>>>>
> >>>>> Martin
> >>>>>
> >>>>
> >>>
> >>
> >
>