RFR(M): 8216060: [PPC64] Vector CRC implementation should be used by interpreter and be faster for short arrays
Lindenmaier, Goetz
goetz.lindenmaier at sap.com
Fri Jan 18 11:03:15 UTC 2019
Hi Martin,
I had a look at your change.
Overall looks good. According to Gustavos mail a nice improvement!
I think though that the way to select the algorithm is quite
messy:
In templateInterpreter vpmsumb is checked and the methods are
called directly.
In stubGenerator, generate_CRC32...()
vpmsumb is tested to decide on vector_constants = R2.
and generic generate_CRC_updateBytes is called, which
again checks whether verctor_constants == R2.
I think generate_CRC_updateBytes() or some other generic
function should be located in macroAssembler_ppc and
be called from both locations.
What do you think?
Best regards,
Goetz
> -----Original Message-----
> From: Doerr, Martin
> Sent: Donnerstag, 17. Januar 2019 14:18
> To: Gustavo Romero <gromero at linux.vnet.ibm.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>;
> Lindenmaier, Goetz <goetz.lindenmaier at sap.com>
> Subject: RE: RFR(M): 8216060: [PPC64] Vector CRC implementation should be
> used by interpreter and be faster for short arrays
>
> Hi,
>
> the rebased webrev.01 applies on jdk/jdk, now (after JDK-8216376). So the
> issue Gustavo had observed does not longer exist.
> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
>
> I have updated copyrights and retested it.
>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> Sent: Montag, 7. Januar 2019 14:52
> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should be
> used by interpreter and be faster for short arrays
>
> Hi Martin,
>
> On 01/07/2019 11:49 AM, Doerr, Martin wrote:
> > I want to check all places where we use "mr(R1_SP, R21_sender_SP)".
> There may be more issues with that. I'll probably handle that in a separate
> change and push this CRC change afterwards.
>
> I see. Thanks for letting me know.
>
> Best regards,
> Gustavo
>
> > Best regards,
> > Martin
> >
> >
> > -----Original Message-----
> > From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> > Sent: Freitag, 4. Januar 2019 19:55
> > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should
> be used by interpreter and be faster for short arrays
> >
> > Hi Martin,
> >
> > On 01/04/2019 02:13 PM, Doerr, Martin wrote:
> >> Hi Gustavo,
> >>
> >> when called from the interpreter (the scenario you observed), R21 is set
> before resizing the frame to avoid wasted stack space
> (InterpreterMacroAssembler::call_from_interpreter).
> >
> > Got it. Thanks a lot for the explanations.
> >
> > I think it doesn't currently matter in practice, but I'm wondering if to be
> > consistent we should cut back the stack back earlier also in
> > TemplateInterpreterGenerator::generate_CRC32_update_entry()?
> >
> > diff -r a35f8c35d8c9
> src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> > --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp Fri Jan
> 04 10:09:00 2019 +0100
> > +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp Fri Jan
> 04 13:44:37 2019 -0500
> > @@ -1840,11 +1840,12 @@
> > #endif
> > __ lwz(crc, 2*wordSize, argP); // Current crc state, zero extend to 64
> bit to have a clean register.
> >
> > + // Restore caller sp for c2i case and return.
> > + __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller
> started.
> > +
> > StubRoutines::ppc64::generate_load_crc_table_addr(_masm, table);
> > __ kernel_crc32_singleByte(crc, data, dataLen, table, tmp, true);
> >
> > - // Restore caller sp for c2i case and return.
> > - __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the caller
> started.
> > __ blr();
> >
> > // Generate a vanilla native entry as the slow path.
> >
> > Currently there is no issue probably because generated code is simpler and
> does
> > no spills.
> >
> > Best regards,
> > Gustavo
> >
> >> When called from compiled methods, R21 is set by a c2i adapter which
> extends the compiled frame by space for arguments (gen_c2i_adapter).
> >>
> >> "mr(R1_SP, R21_sender_SP)" is more error-prone than
> "resize_frame_absolute" so I think the latter would be better (though it takes
> more registers and instructions), but I don't want to replace that as part of
> this CRC change.
> >>
> >> Best regards,
> >> Martin
> >>
> >>
> >> -----Original Message-----
> >> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> >> Sent: Freitag, 4. Januar 2019 14:44
> >> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> >> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation should
> be used by interpreter and be faster for short arrays
> >>
> >> Hi Martin,
> >>
> >> On 01/04/2019 07:30 AM, Doerr, Martin wrote:
> >>> thank you very much for confirming. This makes sense. We use different
> frame headers depending on whether the frame is the top Java frame or not
> (and on whether it's a debug build or not). Setting R1_SP to sender_SP is a
> shortcut for leaf calls which relies on having an unmodified stack until this
> point. So the patch fixes the issue.
> >>
> >> Glad to help! Thanks for the additional information, I was not aware that
> the
> >> selection of different frame headers could be done at compile time. One
> last
> >> question only for my education: what exactly advanced (incremented)
> R1_SP so it
> >> has to be cut back using sender_SP value, i.e. sender_SP tracks the frame
> for
> >> which function exactly or "who" is the caller exactly here?
> >>
> >> Thank you.
> >>
> >> Best regards,
> >> Gustavo
> >>
> >>> New webrev:
> >>> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.01/
> >>>
> >>> Best regards,
> >>> Martin
> >>>
> >>>
> >>> -----Original Message-----
> >>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> >>> Sent: Donnerstag, 3. Januar 2019 19:36
> >>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> >>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation
> should be used by interpreter and be faster for short arrays
> >>>
> >>> Hi Martin,
> >>>
> >>> On 01/03/2019 03:34 PM, Doerr, Martin wrote:
> >>>> Unfortunately, I can't reproduce the crash. TestCRC32C works stable on
> our machine (with fastdbg build).
> >>>> I guess that the frameless spills mess up the stack. Can you check if the
> patch below helps?
> >>>
> >>> Thanks for providing a fix so I can try it.
> >>> Yes, I confirm the patch below indeed fixes the sigsegv crash when
> CRC32C update() method is used.
> >>> I also confirm that I don't observe the crash on the fastdebug build, only
> on the release build.
> >>> It also only affects the Interpreter mode, so passing -Xcomp avoids the
> crash on the release build.
> >>>
> >>> Just as reference, I can reproduce it on the release build with the
> following trivial code:
> >>>
> >>> import java.util.zip.CRC32C;
> >>>
> >>> class CRC32C_v1 {
> >>> public static void main(String[] arg) {
> >>> byte[] b = new byte[1024];
> >>>
> >>> CRC32C crc32c = new CRC32C();
> >>> crc32c.update(b, 0, b.length);
> >>>
> >>> System.out.println(crc32c.getValue());
> >>> }
> >>> }
> >>>
> >>> Thanks for fixing the typos.
> >>>
> >>>
> >>> Best regards,
> >>> Gustavo
> >>>
> >>>> Best regards,
> >>>> Martin
> >>>>
> >>>>
> >>>> diff -r a33f49d5998c
> src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> >>>> --- a/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp Thu
> Jan 03 17:30:03 2019 +0100
> >>>> +++ b/src/hotspot/cpu/ppc/templateInterpreterGenerator_ppc.cpp
> Thu Jan 03 18:33:16 2019 +0100
> >>>> @@ -1924,6 +1924,9 @@
> >>>> __ addi(data, data,
> arrayOopDesc::base_offset_in_bytes(T_BYTE));
> >>>> }
> >>>>
> >>>> + // Restore caller sp for c2i case.
> >>>> + __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>> +
> >>>> StubRoutines::ppc64::generate_load_crc_table_addr(_masm,
> table);
> >>>>
> >>>> if (!VM_Version::has_vpmsumb()) {
> >>>> @@ -1933,8 +1936,6 @@
> >>>> __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3,
> tc0, tc1, tc2, true);
> >>>> }
> >>>>
> >>>> - // Restore caller sp for c2i case and return.
> >>>> - __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>> __ blr();
> >>>>
> >>>> // Generate a vanilla native entry as the slow path.
> >>>> @@ -2014,6 +2015,9 @@
> >>>> __ addi(data, data,
> arrayOopDesc::base_offset_in_bytes(T_BYTE));
> >>>> }
> >>>>
> >>>> + // Restore caller sp for c2i case.
> >>>> + __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>> +
> >>>> StubRoutines::ppc64::generate_load_crc32c_table_addr(_masm,
> table);
> >>>>
> >>>> if (!VM_Version::has_vpmsumb()) {
> >>>> @@ -2023,8 +2027,6 @@
> >>>> __ kernel_crc32_vpmsum(crc, data, dataLen, table, t0, t1, t2, t3,
> tc0, tc1, tc2, false);
> >>>> }
> >>>>
> >>>> - // Restore caller sp for c2i case and return.
> >>>> - __ mr(R1_SP, R21_sender_SP); // Cut the stack back to where the
> caller started.
> >>>> __ blr();
> >>>>
> >>>> BLOCK_COMMENT("} CRC32C_update{Bytes|DirectByteBuffer}");
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Gustavo Romero <gromero at linux.vnet.ibm.com>
> >>>> Sent: Donnerstag, 3. Januar 2019 17:13
> >>>> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> >>>> Subject: Re: RFR(M): 8216060: [PPC64] Vector CRC implementation
> should be used by interpreter and be faster for short arrays
> >>>>
> >>>> Hi Martin,
> >>>>
> >>>> oh that's nice. You removed the 512-byte block constraint and also
> wired it up to the Interpreter :)
> >>>>
> >>>> For the worst case, unaligned 512 byte array, I see the gap to aligned
> 512 byte array reduced by about ~5.7x.
> >>>>
> >>>> On the Interpreter I see an improvement of at least 50% for 1024 bytes.
> >>>>
> >>>> This is all for the CRC32 class.
> >>>>
> >>>> On CRC32C I'm getting a SIGSEV that can be reproduced running against
> ./test/hotspot/jtreg/compiler/intrinsics/zip/TestCRC32C.java.
> >>>>
> >>>> I've upload a full log into
> http://cr.openjdk.java.net/~gromero/logs/crc32c_sigsegv/
> >>>>
> >>>> I'm leaving for the lunch and I'll take a closer look when back. But
> probably you will figure it out before I sit to appreciate the meal :)
> >>>>
> >>>> Finally, since the change does some cleanup, I wonder if it would be
> worth fixing the following typos:
> >>>>
> >>>> I think it's Barrett const., not Barret. Probably 'barret' is used in the
> code as a short version
> >>>> for Barrett but it should be changed in
> >>>>
> >>>> + // Point to Barret constants
> >>>> + add_const_optimized(cur_const, constants, outer_consts_size +
> inner_consts_size);
> >>>> +
> >>>>
> >>>> ?
> >>>>
> >>>> s/not/note/ in:
> >>>> cpu/ppc/macroAssembler_ppc.cpp:3977:// A not on the lookup table
> address(es):
> >>>>
> >>>> d/lives/ in:
> >>>> cpu/ppc/macroAssembler_ppc.cpp:4265: mtvrwz(VCRC, crc); // crc
> lives lives in VCRC, now
> >>>>
> >>>> Best regards,
> >>>> Gustavo
> >>>>
> >>>> On 01/03/2019 12:17 PM, Doerr, Martin wrote:
> >>>>> Hi,
> >>>>>
> >>>>> the JVM on PPC64 currently misses usage of the fast vector
> implementation in the interpreter code.
> >>>>>
> >>>>> In addition, performance is not good for short arrays (unaligned 512
> byte arrays or shorter arrays) because the current vector implementation
> needs at least 512 bytes.
> >>>>>
> >>>>> Bug:
> >>>>>
> >>>>> https://bugs.openjdk.java.net/browse/JDK-8216060
> >>>>>
> >>>>> I have addressed these 2 issues + some cleanup with the following
> webrev:
> >>>>>
> >>>>> http://cr.openjdk.java.net/~mdoerr/8216060_PPC64_CRC/webrev.00/
> <http://cr.openjdk.java.net/%7Emdoerr/8216060_PPC64_CRC/webrev.00/>
> >>>>>
> >>>>> Please review.
> >>>>>
> >>>>> Best regards,
> >>>>>
> >>>>> Martin
> >>>>>
> >>>>
> >>>
> >>
> >
>
More information about the hotspot-compiler-dev
mailing list