RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays

Lindenmaier, Goetz goetz.lindenmaier at sap.com
Fri Jan 18 14:42:14 UTC 2019


Hi Martin, 

thanks for improving this, looks good now!
Actually, this is much more cleanup than I expected :)

Best regards,
  Goetz.

> -----Original Message-----
> From: Doerr, Martin
> Sent: Freitag, 18. Januar 2019 15:33
> To: Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Gustavo Romero
> <gromero at linux.vnet.ibm.com>; 'hotspot-compiler-dev at openjdk.java.net'
> <hotspot-compiler-dev at openjdk.java.net>
> Subject: RE: RFR(M): 8216060: [PPC64] Vector CRC implementation should be
> used by interpreter and be faster for short arrays
> 
> Hi Götz,
> 
> that's a good proposal. I've moved the common functionality into
> macroAssembler_ppc. This makes interpreter and stubGenerator code shorter.
> 
> I've also moved the vector constants computation to stubGenerator such that
> we only do it when the intrinsics are enabled and the vector version is
> supported by the processor.
> 
> New webrev:
> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.02/
> 
> @Gustavo: Thanks for testing and confirming the issue (JDK-8216376) is fixed.
> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Lindenmaier, Goetz
> Sent: Freitag, 18. Januar 2019 12:03
> To: Doerr, Martin <martin.doerr at sap.com>; Gustavo Romero
> <gromero at linux.vnet.ibm.com>; 'hotspot-compiler-dev at openjdk.java.net'
> <hotspot-compiler-dev at openjdk.java.net>
> Subject: RE: RFR(M): 8216060: [PPC64] Vector CRC implementation should be
> used by interpreter and be faster for short arrays
> 
> Hi Martin,
> 
> I had a look at your change.
> Overall looks good. According to Gustavos mail a nice improvement!
> 
> I think though that the way to select the algorithm is quite
> messy:
> In templateInterpreter vpmsumb is checked and the methods are
> called directly.
> In stubGenerator, generate_CRC32...()
>   vpmsumb is tested to decide on vector_constants = R2.
>   and generic generate_CRC_updateBytes is called, which
> again checks whether verctor_constants == R2.
> 
> I think generate_CRC_updateBytes() or some other generic
> function should be located in macroAssembler_ppc and
> be called from both locations.
> 
> What do you think?
> 
> Best regards,
>   Goetz
> 
> 
> 
> > -----Original Message-----
> > From: Doerr, Martin
> > Sent: Donnerstag, 17. Januar 2019 14:18
> > To: Gustavo Romero <gromero at linux.vnet.ibm.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>;
> > Lindenmaier, Goetz <goetz.lindenmaier at sap.com>
> > Subject: RE: RFR(M): 8216060: [PPC64] Vector CRC implementation should
> be
> > used by interpreter and be faster for short arrays
> >
> > Hi,
> >
> > the rebased webrev.01 applies on jdk/jdk, now (after JDK-8216376). So the
> > issue Gustavo had observed does not longer exist.
> > http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
> >
> > I have updated copyrights and retested it.
> >
> > Best regards,
> > Martin
> >
> >
> > -----Original Message-----
> > From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> > Sent: Montag, 7. Januar 2019 14:52
> > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should
> be
> > used by interpreter and be faster for short arrays
> >
> > Hi Martin,
> >
> > On 01/07/2019 11:49 AM, Doerr, Martin wrote:
> > > I want to check all places where we use "mr(R1_SP, R21_sender_SP)".
> > There may be more issues with that. I'll probably handle that in a separate
> > change and push this CRC change afterwards.
> >
> > I see. Thanks for letting me know.
> >
> > Best regards,
> > Gustavo
> >
> > > Best regards,
> > > Martin
> > >
> > >
> > > -----Original Message-----
> > > From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> > > Sent: Freitag, 4. Januar 2019 19:55
> > > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > > Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should
> > be used by interpreter and be faster for short arrays
> > >
> > > Hi Martin,
> > >
> > > On 01/04/2019 02:13 PM, Doerr, Martin wrote:
> > >> Hi Gustavo,
> > >>
> > >> when called from the interpreter (the scenario you observed), R21 is set
> > before resizing the frame to avoid wasted stack space
> > (InterpreterMacroAssembler::call_from_interpreter).
> > >
> > > Got it. Thanks a lot for the explanations.
> > >
> > > I think it doesn't currently matter in practice, but I'm wondering if to be
> > > consistent we should cut back the stack back earlier also in
> > > TemplateInterpreterGenerator::generate_CRC32_update_entry()?
> > >
> > > diff -r a35f8c35d8c9
> > src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> > > --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Fri Jan
> > 04 10:09:00 2019 +0100
> > > +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Fri Jan
> > 04 13:44:37 2019 -0500
> > > @@ -1840,11 +1840,12 @@
> > >    #endif
> > >        __ lwz(crc,  2*wordSize, argP);    // Current crc state, zero extend to 64
> > bit to have a clean register.
> > >
> > > +    // Restore caller sp for c2i case and return.
> > > +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller
> > started.
> > > +
> > >        StubRoutines::ppc64::generate_load_crc_table_addr(_masm, table);
> > >        __ kernel_crc32_singleByte(crc, data, dataLen, table, tmp, true);
> > >
> > > -    // Restore caller sp for c2i case and return.
> > > -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller
> > started.
> > >        __ blr();
> > >
> > >        // Generate a vanilla native entry as the slow path.
> > >
> > > Currently there is no issue probably because generated code is simpler and
> > does
> > > no spills.
> > >
> > > Best regards,
> > > Gustavo
> > >
> > >> When called from compiled methods, R21 is set by a c2i adapter which
> > extends the compiled frame by space for arguments (gen_c2i_adapter).
> > >>
> > >> "mr(R1_SP, R21_sender_SP)" is more error-prone than
> > "resize_frame_absolute" so I think the latter would be better (though it takes
> > more registers and instructions), but I don't want to replace that as part of
> > this CRC change.
> > >>
> > >> Best regards,
> > >> Martin
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> > >> Sent: Freitag, 4. Januar 2019 14:44
> > >> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > >> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation
> should
> > be used by interpreter and be faster for short arrays
> > >>
> > >> Hi Martin,
> > >>
> > >> On 01/04/2019 07:30 AM, Doerr, Martin wrote:
> > >>> thank you very much for confirming. This makes sense. We use different
> > frame headers depending on whether the frame is the top Java frame or not
> > (and on whether it's a debug build or not). Setting R1_SP to sender_SP is a
> > shortcut for leaf calls which relies on having an unmodified stack until this
> > point. So the patch fixes the issue.
> > >>
> > >> Glad to help! Thanks for the additional information, I was not aware that
> > the
> > >> selection of different frame headers could be done at compile time. One
> > last
> > >> question only for my education: what exactly advanced (incremented)
> > R1_SP so it
> > >> has to be cut back using sender_SP value, i.e. sender_SP tracks the frame
> > for
> > >> which function exactly or "who" is the caller exactly here?
> > >>
> > >> Thank you.
> > >>
> > >> Best regards,
> > >> Gustavo
> > >>
> > >>> New webrev:
> > >>> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
> > >>>
> > >>> Best regards,
> > >>> Martin
> > >>>
> > >>>
> > >>> -----Original Message-----
> > >>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> > >>> Sent: Donnerstag, 3. Januar 2019 19:36
> > >>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > >>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation
> > should be used by interpreter and be faster for short arrays
> > >>>
> > >>> Hi Martin,
> > >>>
> > >>> On 01/03/2019 03:34 PM, Doerr, Martin wrote:
> > >>>> Unfortunately, I can't reproduce the crash. TestCRC32C works stable on
> > our machine (with fastdbg build).
> > >>>> I guess that the frameless spills mess up the stack. Can you check if the
> > patch below helps?
> > >>>
> > >>> Thanks for providing a fix so I can try it.
> > >>> Yes, I confirm the patch below indeed fixes the sigsegv crash when
> > CRC32C update() method is used.
> > >>> I also confirm that I don't observe the crash on the fastdebug build, only
> > on the release build.
> > >>> It also only affects the Interpreter mode, so passing -Xcomp avoids the
> > crash on the release build.
> > >>>
> > >>> Just as reference, I can reproduce it on the release build with the
> > following trivial code:
> > >>>
> > >>> import java.util.zip.CRC32C;
> > >>>
> > >>> class CRC32C_v1 {
> > >>>       public static void main(String[] arg) {
> > >>>         byte[] b = new byte[1024];
> > >>>
> > >>>         CRC32C crc32c = new CRC32C();
> > >>>         crc32c.update(b, 0, b.length);
> > >>>
> > >>>         System.out.println(crc32c.getValue());
> > >>>       }
> > >>> }
> > >>>
> > >>> Thanks for fixing the typos.
> > >>>
> > >>>
> > >>> Best regards,
> > >>> Gustavo
> > >>>
> > >>>> Best regards,
> > >>>> Martin
> > >>>>
> > >>>>
> > >>>> diff -r a33f49d5998c
> > src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> > >>>> --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp  Thu
> > Jan 03 17:30:03 2019 +0100
> > >>>> +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> > Thu Jan 03 18:33:16 2019 +0100
> > >>>> @@ -1924,6 +1924,9 @@
> > >>>>            __ addi(data, data,
> > arrayOopDesc::base_offset_in_bytes(T_BYTE));
> > >>>>          }
> > >>>>
> > >>>> +    // Restore caller sp for c2i case.
> > >>>> +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> > caller started.
> > >>>> +
> > >>>>          StubRoutines::ppc64::generate_load_crc_table_addr(_masm,
> > table);
> > >>>>
> > >>>>          if (!VM_Version::has_vpmsumb()) {
> > >>>> @@ -1933,8 +1936,6 @@
> > >>>>            __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3,
> > tc0, tc1, tc2, true);
> > >>>>          }
> > >>>>
> > >>>> -    // Restore caller sp for c2i case and return.
> > >>>> -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> > caller started.
> > >>>>          __ blr();
> > >>>>
> > >>>>          // Generate a vanilla native entry as the slow path.
> > >>>> @@ -2014,6 +2015,9 @@
> > >>>>            __ addi(data, data,
> > arrayOopDesc::base_offset_in_bytes(T_BYTE));
> > >>>>          }
> > >>>>
> > >>>> +    // Restore caller sp for c2i case.
> > >>>> +    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> > caller started.
> > >>>> +
> > >>>>          StubRoutines::ppc64::generate_load_crc32c_table_addr(_masm,
> > table);
> > >>>>
> > >>>>          if (!VM_Version::has_vpmsumb()) {
> > >>>> @@ -2023,8 +2027,6 @@
> > >>>>            __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3,
> > tc0, tc1, tc2, false);
> > >>>>          }
> > >>>>
> > >>>> -    // Restore caller sp for c2i case and return.
> > >>>> -    __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> > caller started.
> > >>>>          __ blr();
> > >>>>
> > >>>>          BLOCK_COMMENT("} CRC32C_update{Bytes|DirectByteBuffer}");
> > >>>>
> > >>>>
> > >>>> -----Original Message-----
> > >>>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> > >>>> Sent: Donnerstag, 3. Januar 2019 17:13
> > >>>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > >>>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation
> > should be used by interpreter and be faster for short arrays
> > >>>>
> > >>>> Hi Martin,
> > >>>>
> > >>>> oh that's nice. You removed the 512-byte block constraint and also
> > wired it up to the Interpreter :)
> > >>>>
> > >>>> For the worst case, unaligned 512 byte array, I see the gap to aligned
> > 512 byte array reduced by about ~5.7x.
> > >>>>
> > >>>> On the Interpreter I see an improvement of at least 50% for 1024 bytes.
> > >>>>
> > >>>> This is all for the CRC32 class.
> > >>>>
> > >>>> On CRC32C I'm getting a SIGSEV that can be reproduced running
> against
> > ./test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java.
> > >>>>
> > >>>> I've upload a full log into
> > http://cr.openjdk.java.net/~gromero/logs/crc32c_sigsegv/
> > >>>>
> > >>>> I'm leaving for the lunch and I'll take a closer look when back. But
> > probably you will figure it out before I sit to appreciate the meal :)
> > >>>>
> > >>>> Finally, since the change does some cleanup, I wonder if it would be
> > worth fixing the following typos:
> > >>>>
> > >>>> I think it's Barrett const., not Barret. Probably 'barret' is used in the
> > code as a short version
> > >>>> for Barrett but it should be changed in
> > >>>>
> > >>>> +  // Point to Barret constants
> > >>>> +  add_const_optimized(cur_const, constants, outer_consts_size +
> > inner_consts_size);
> > >>>> +
> > >>>>
> > >>>> ?
> > >>>>
> > >>>> s/not/note/ in:
> > >>>> cpu/ppc/macroAssembler_ppc.cpp:3977:// A not on the lookup table
> > address(es):
> > >>>>
> > >>>> d/lives/ in:
> > >>>> cpu/ppc/macroAssembler_ppc.cpp:4265:  mtvrwz(VCRC, crc); // crc
> > lives lives in VCRC, now
> > >>>>
> > >>>> Best regards,
> > >>>> Gustavo
> > >>>>
> > >>>> On 01/03/2019 12:17 PM, Doerr, Martin wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> the JVM on PPC64 currently misses usage of the fast vector
> > implementation in the interpreter code.
> > >>>>>
> > >>>>> In addition, performance is not good for short arrays (unaligned 512
> > byte arrays or shorter arrays) because the current vector implementation
> > needs at least 512 bytes.
> > >>>>>
> > >>>>> Bug:
> > >>>>>
> > >>>>> https://bugs.openjdk.java.net/browse/JDK-8216060
> > >>>>>
> > >>>>> I have addressed these 2 issues + some cleanup with the following
> > webrev:
> > >>>>>
> > >>>>>
> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.00/
> > <http://cr.openjdk.java.net/%7Emdoerr/8216060_PPC64_CRC/webrev.00/>
> > >>>>>
> > >>>>> Please review.
> > >>>>>
> > >>>>> Best regards,
> > >>>>>
> > >>>>> Martin
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
> >
> 
> 



More information about the hotspot-compiler-dev mailing list