[10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics
Gustavo Serra Scalet
gustavo.scalet at eldorado.org.br
Wed Sep 6 12:32:20 UTC 2017
Thanks Goetz.
Could somebody sponsor this change?
THanks
> -----Original Message-----
> From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com]
> Sent: quarta-feira, 6 de setembro de 2017 03:30
> To: Gustavo Serra Scalet <gustavo.scalet at eldorado.org.br>; Doerr, Martin
> <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net'
> <hotspot-compiler-dev at openjdk.java.net>
> Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
>
> Hi,
>
> I had a look at this change and tested it. Reviewed.
>
> Best regards,
> Goetz.
>
> > -----Original Message-----
> > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > Sent: Freitag, 1. September 2017 19:12
> > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > SquareToLen intrinsics
> >
> > Hi Martin,
> >
> > > -----Original Message-----
> > > From: Doerr, Martin
> > > your first webrev already works on Big Endian. So the only required
> > > change is to fix your new code by this trivial patch:
> > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:47:45
> 2017
> > > +0200
> > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp Fri Sep 01 17:55:08
> 2017
> > > +0200
> > > @@ -3426,7 +3426,9 @@
> > > __ srdi (product, product, 1);
> > > // join them to the same register and store it as Little Endian
> > > __ orr (product, lplw_s, product);
> > > +#ifdef VM_LITTLE_ENDIAN
> > > __ rldicl (product, product, 32, 0);
> > > +#endif
> > > __ stdu (product, 8, out_aux);
> > > __ bdnz (LOOP_SQUARE);
> > >
> > > So please enable it again for Big Endian in vm_version_ppc. Besides
> > > that, it looks good to me. We also need a 2nd review.
> >
> > Great! Thanks for checking it and suggesting the diff.
> >
> > I changed these things. You can find it below:
> > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/
> >
> > I wonder who could be a 2nd reviewer... Anybody in mind that we may
> ping?
> > Maybe Goetz Lindenmaier?
> >
> > Best Regards,
> > Gustavo Serra Scalet
> >
> > >
> > > Best regards,
> > > Martin
> > >
> > >
> > > -----Original Message-----
> > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> > > Sent: Mittwoch, 30. August 2017 19:03
> > > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > SquareToLen intrinsics
> > >
> > > Hi Martin,
> > >
> > > (webrev at the end)
> > >
> > > > -----Original Message-----
> > > > From: Doerr, Martin
> > > >
> > > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't
> > > > > seem to need further changes as it's being cleared with clrldi,
> > > > > which is the same as rldic with no shift. Therefore it's treated
> > > > > appropriately as requested for "offset" parameter. Do you agree?
> > > >
> > > > No, I didn't find clrldi for len in generate_mulAdd(). Only for k.
> > >
> > > I'm sorry. I was thinking about "offset" and "k", which are both
> > > cleaned on generate_mulAdd(). "len" was not cleaned and it was being
> > > used on
> > > muladd() directly with cmpdi, which could lead to problems.
> > >
> > > That is being changed.
> > >
> > > > Where are in_len and out_len fixed up in generate_squareToLen()?
> > >
> > > They are not. According to your suggestions, I agree it also needs
> > > to be done for the same reason.
> > >
> > > > > You are right. The way I'm building the 64 bits of the register
> > > > > depends on which kind of endianness it is run. For now it works
> > > > > only on little endian so I'm adding a switch (just like I did
> > > > > for SHA) to make it available only on little endian systems.
> > > >
> > > > It shouldn't be that hard to get it working on big endian ;-)
> > > > Btw., my point was not to replace the 2 4-byte store instructions
> > > > by an 8-byte one (though I'm also ok with that). It was that 2
> > > > stwu which update the same pointer doesn't make sense from
> performance point of view.
> > > > Please keep something which works on big endian, too.
> > >
> > > I see. The 2x stwu was being used like that because it was the
> > > trivial approach when considering the original java update:
> > > z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33); z[i++]
> > > = (int)(product >>> 1);
> > >
> > > As you pointed out, that might cause some stall on the pipeline so I
> > > made it with 1s stdu (and could improve code by reducing 1
> > > instruction)
> > >
> > > Now about having a big endian version: I'm not confident in doing so
> > > as I don't have access to such a machine at the moment. You were
> > > kind on offering test support but I don't know if it'd work like
> > > that. I may support you in checking out which places are
> > > endianness-related but I'm not comfortable in sending you untested
> code.
> > >
> > > Would you be interested in doing such a changes for making it work
> > > on Big Endian? For this patch, I provided an interesting test that
> > > might help you to verify if it worked.
> > >
> > > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6
> > > > > 14:15:31 2017). The reported performance speedup was calculated
> > > > > by running the following test (TestSquareToLen.java):
> > > >
> > > > Seems like JDK-8145913 has not been backported, yet. Sorry for not
> > > > checking this earlier. So if you want to make RSA really fast, it
> > > > should be so much better to backport that one. But I can still
> > > > sponsor this change as it may be used elsewhere.
> > >
> > > No problem. It's nice to know that I may not need to request a
> > > backport of this patch for performance reasons.
> > >
> > > And at last, but not least, the new webrev with these clrldi
> changes:
> > > https://gut.github.io/openjdk/webrev/JDK-
> > 8185976/webrev.03/index.html
> > >
> > > Thank you once again,
> > > Gustavo Serra Scalet
> > >
> > > > Best regards,
> > > > Martin
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> > > > Sent: Dienstag, 29. August 2017 22:37
> > > > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > > > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > > SquareToLen intrinsics
> > > >
> > > > Hi Martin,
> > > >
> > > > New changes:
> > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/
> > > >
> > > > Check comments below, please.
> > > >
> > > > > -----Original Message-----
> > > > > From: Doerr, Martin
> > > > >
> > > > > 1. Sign extending offset and len Right, sign and zero extending
> > > > > is equivalent for offset and len because they are guaranteed to
> > > > > be >=0 (by checks in Java). But you can only rely on bit 32 (IBM
> > > > > notation) to be 0. Bit 0-31 may contain
> > > > garbage.
> > > > > rldicl was incorrect. My mistake, sorry for that. Correct would
> > > > > be rldic which also clears the least significant bits.
> > > > > len should also get fixed e.g. by replacing cmpdi by extsw_ in
> > > muladd.
> > > >
> > > > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem
> > > > to need further changes as it's being cleared with clrldi, which
> > > > is the same as rldic with no shift. Therefore it's treated
> > > > appropriately as requested for "offset" parameter. Do you agree?
> > > >
> > > > > 2. Using 8 byte instructions for int The code which feeds stdu
> > > > > is endianess specific. Doesn't work on all
> > > > > PPC64 platforms.
> > > >
> > > > You are right. The way I'm building the 64 bits of the register
> > > > depends on which kind of endianness it is run. For now it works
> > > > only on little endian so I'm adding a switch (just like I did for
> > > > SHA) to make it available only on little endian systems.
> > > >
> > > > > 3.Regarding Andrew's point: Superseded by Montgomery?
> > > > > The Montgomery change got backported to jdk8u (JDK-8150152 in
> > > 8u102).
> > > > > I'd expect the performance improvement of these intrinsics to be
> > > > > irrelevant for crypto.rsa. Did you measure with an older jdk8
> > > release?
> > > >
> > > > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6
> > > > 14:15:31 2017). The reported performance speedup was calculated by
> > > > running the following test (TestSquareToLen.java):
> > > > import java.math.BigInteger;
> > > >
> > > > public class TestSquareToLen {
> > > >
> > > > public static void main(String args[]) throws Exception {
> > > >
> > > > int n = 10000000;
> > > > if (args.length >=1) {
> > > > n = Integer.parseInt(args[0]);
> > > > }
> > > >
> > > > BigInteger b1 = new
> > > >
> > BigInteger("34893980923557359086350514982082503920002298311877320859
> > 99
> > > > 36
> > > >
> > 7395594183801021468843071391756049207873137016631559837931214754926
> > 092
> > > > 22
> > > >
> > 3780292110207609223272184808289336630057735969423726808520641030118
> > 116
> > > > 51
> > > >
> > 6440180488338234823908199478965242076358579845520899779963131131540
> > 166
> > > > 68 718795349783157384006672542605760392289645528307");
> > > > BigInteger b2 = BigInteger.valueOf(0);
> > > > BigInteger check = BigInteger.valueOf(1);
> > > > for (int i = 0; i < n; i++) {
> > > > b2 = b1.multiply(b1);
> > > > if (i == 0)
> > > > // Didn't JIT yet. Comparing against interpreted mode
> > > > check = b2;
> > > > }
> > > > if (b2.compareTo(check) == 0)
> > > > System.out.println("Check ok!");
> > > > else
> > > > System.out.println("Check failed!");
> > > > }
> > > > }
> > > >
> > > >
> > > > I got these results on JDK8 on my POWER8 machine:
> > > > $ ./javac TestSquareToLen.java
> > > > $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:-
> > > > UseSquareToLenIntrinsic TestSquareToLen Check ok!
> > > > Check ok!
> > > > Check ok!
> > > > Check ok!
> > > > Check ok!
> > > >
> > > > Performance counter stats for './java -XX:-UseMulAddIntrinsic
> > > > -XX:- UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
> > > >
> > > > 15148.009557 task-clock (msec) # 1.053 CPUs
> > > > utilized ( +- 0.48% )
> > > > 2,425 context-switches # 0.160 K/sec
> > > > ( +- 5.84% )
> > > > 356 cpu-migrations # 0.023 K/sec
> > > > ( +- 3.01% )
> > > > 5,153 page-faults # 0.340 K/sec
> > > > ( +- 5.22% )
> > > > 54,536,889,909 cycles # 3.600 GHz
> > > > ( +- 0.56% ) (66.68%)
> > > > 239,554,105 stalled-cycles-frontend # 0.44%
> frontend
> > > > cycles idle ( +- 4.87% ) (49.90%)
> > > > 27,683,316,001 stalled-cycles-backend # 50.76%
> backend
> > > > cycles idle ( +- 0.56% ) (50.17%)
> > > > 102,020,229,733 instructions # 1.87 insn
> per
> > > > cycle
> > > > # 0.27
> stalled
> > > > cycles per insn ( +- 0.14% ) (66.94%)
> > > > 7,706,072,218 branches # 508.718 M/sec
> > > > ( +- 0.23% ) (50.20%)
> > > > 456,051,162 branch-misses # 5.92% of
> all
> > > > branches ( +- 0.09% ) (50.07%)
> > > >
> > > > 14.390840733 seconds time elapsed ( +- 0.09% )
> > > >
> > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic -
> > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok!
> > > > Check ok!
> > > > Check ok!
> > > > Check ok!
> > > > Check ok!
> > > >
> > > > Performance counter stats for './java -XX:+UseMulAddIntrinsic -
> > > > XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
> > > >
> > > > 11368.141410 task-clock (msec) # 1.045 CPUs
> > > > utilized ( +- 0.64% )
> > > > 1,964 context-switches # 0.173 K/sec
> > > > ( +- 8.93% )
> > > > 338 cpu-migrations # 0.030 K/sec
> > > > ( +- 7.65% )
> > > > 5,627 page-faults # 0.495 K/sec
> > > > ( +- 6.15% )
> > > > 41,100,168,967 cycles # 3.615 GHz
> > > > ( +- 0.50% ) (66.36%)
> > > > 309,052,316 stalled-cycles-frontend # 0.75%
> frontend
> > > > cycles idle ( +- 2.84% ) (49.89%)
> > > > 14,188,581,685 stalled-cycles-backend # 34.52%
> backend
> > > > cycles idle ( +- 0.99% ) (50.34%)
> > > > 77,846,029,829 instructions # 1.89 insn
> per
> > > > cycle
> > > > # 0.18
> stalled
> > > > cycles per insn ( +- 0.29% ) (66.96%)
> > > > 8,435,216,989 branches # 742.005 M/sec
> > > > ( +- 0.28% ) (50.17%)
> > > > 339,903,936 branch-misses # 4.03% of
> all
> > > > branches ( +- 0.27% ) (49.90%)
> > > >
> > > > 10.882357546 seconds time elapsed ( +- 0.24% )
> > > >
> > > >
> > > > (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s
> > > > (+-
> > > > 0.53%) on JDK10)
> > > >
> > > > I may run for SpecJVM2008's crypto.rsa if you are interested.
> > > >
> > > > Thank you once again for reviewing this.
> > > >
> > > > Best regards,
> > > > Gustavo
> > > >
> > > > > (I think the change is still acceptable as the intrinsics could
> > > > > be used elsewhere and the implementation also exists on other
> > > > > platforms.)
> > > > >
> > > > > Best regards,
> > > > > Martin
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Gustavo Serra Scalet
> > > > > [mailto:gustavo.scalet at eldorado.org.br]
> > > > > Sent: Mittwoch, 16. August 2017 18:50
> > > > > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > > > > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > > > SquareToLen intrinsics
> > > > >
> > > > > Hi Martin,
> > > > >
> > > > > Thanks for dedicated review. It took me a while to be able to
> > > > > work on this but I hope to have your points solved. Please check
> > > > > below the review as well as my comments quoting your email:
> > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/
> > > > >
> > > > > > -----Original Message-----
> > > > > > First of all, C2 does not perform sign extend when calling
> stubs.
> > > > > > The int parms need to get zero/sign extended. (Could even be
> > > > > > done without extra instructions by replacing sldi -> rldicl,
> > > > > > cmpdi -> extsw_ in some
> > > > > > cases.)
> > > > >
> > > > > Does it make a difference on my case?
> > > > >
> > > > > I guess you are talking about mulAdd preparation code. The only
> > > > > aspect I found about him is to force the cast from 32 bits -> 64
> > > > > bits by cleaning higher bits. Offset is a signed integer but it
> > > > > can't be
> > > > negative anyway.
> > > > >
> > > > > So I changed from:
> > > > > sldi (R5_ARG3, R5_ARG3, 2);
> > > > >
> > > > > to:
> > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive
> > > > >
> > > > >
> > > > > > macroAssembler_ppc.cpp:
> > > > > > - Indentation should be 2 spaces.
> > > > >
> > > > > Done
> > > > >
> > > > >
> > > > > > stubGenerator_ppc:cpp:
> > > > > > - or_, addi_ should get replaced by orr, addi when CR0 result
> > > > > > is not needed.
> > > > >
> > > > > Done
> > > > >
> > > > > > - Where is lplw initialized?
> > > > >
> > > > > It should be initialized with 0, I missed that...
> > > > >
> > > > > > - I believe that the updating load/store instructions e.g.
> > > > > > lwzu don't perform well on some processors. At least using
> > > > > > stwu 2 times in the loop doesn't make sense.
> > > > >
> > > > > You are right. I could manipulate the bits differently and ended
> > > > > up with a single stdu in the loop. Neat! Although I could not
> > > > > reduce the total number of instructions.
> > > > >
> > > > > > - Note: It should be possible to use 8 byte instead of 4 byte
> > > > > > instructions: MacroAssembler::multiply64, addc, adde. But I'm
> > > > > > not requesting to change that because I guess it would make
> > > > > > the code very complicated, especially when supporting both
> > > > > > endianess
> > > > versions.
> > > > >
> > > > > Yes, that would require a new analysis on this code. May we
> > > > > consider it next? As you said, I prefer having an initial
> > > > > version that looks as simple as the original java code.
> > > > >
> > > > > > - The squareToLen stub implementation is very close the Java
> > > > > > implementation. So it'd be interesting to understand what C2
> > > > > > doesn't do as well as the hand written assembly code. Do you
> > > > > > know that? (Not absolutely necessary for accepting this change
> > > > > > as long as the stub is measurably faster.)
> > > > >
> > > > > I don't know either. Basically I chose doing it because I
> > > > > noticed some performance gain on SpecJVM2008 when analyzing X64.
> > > > > Then, taking a closer look, I didn't notice any AVX or some
> > > > > special instructions on
> > > > > X64 so I decided to try it on ppc64 by using some basic
> assembly.
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Best regards,
> > > > > > Martin
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > > > > Sent: Donnerstag, 10. August 2017 19:22
> > > > > > To: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> > > > > > dev at openjdk.java.net>
> > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > > > > SquareToLen intrinsics
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > > > > Sent: terça-feira, 8 de agosto de 2017 17:19
> > > > > > To: ppc-aix-port-dev at openjdk.java.net
> > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > > > > SquareToLen intrinsics
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Could you please review this specific PPC64 change to hotspot?
> > > > > > By implementing these intrinsics I noticed a small improvement
> > > > > > with microbenchmarks analysis. On SpecJVM2008's crypto.rsa
> > > > > > benchmark, only when backporting to JDK8 an improvement was
> noticed.
> > > > > >
> > > > > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK-
> > 8185976/webrev/
> > > > > >
> > > > > > Motivation for this implementation:
> > > > > > https://twitter.com/ijuma/status/698309312498835457
> > > > > >
> > > > > > Best regards,
> > > > > > Gustavo Serra Scalet
More information about the hotspot-compiler-dev
mailing list