[10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Gustavo Serra Scalet gustavo.scalet at eldorado.org.br
Wed Aug 30 17:02:54 UTC 2017


Hi Martin,

(webrev at the end)

> -----Original Message-----
> From: Doerr, Martin
> 
> > The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to
> > need further changes as it's being cleared with clrldi, which is the
> > same as rldic with no shift. Therefore it's treated appropriately as
> > requested for "offset" parameter. Do you agree?
> 
> No, I didn't find clrldi for len in generate_mulAdd(). Only for k.

I'm sorry. I was thinking about "offset" and "k", which are both cleaned on generate_mulAdd(). "len" was not cleaned and it was being used on muladd() directly with cmpdi, which could lead to problems. 

That is being changed.

> Where are in_len and out_len fixed up in generate_squareToLen()?

They are not. According to your suggestions, I agree it also needs to be done for the same reason.

> > You are right. The way I'm building the 64 bits of the register
> > depends on which kind of endianness it is run. For now it works only on
> > little endian so I'm adding a switch (just like I did for SHA) to make
> > it available only on little endian systems.
> 
> It shouldn't be that hard to get it working on big endian ;-) Btw., my
> point was not to replace the 2 4-byte store instructions by an 8-byte
> one (though I'm also ok with that). It was that 2 stwu which update the
> same pointer doesn't make sense from performance point of view. Please
> keep something which works on big endian, too.

I see. The 2x stwu was being used like that because it was the trivial approach when considering the original java update:
z[i++] = (lastProductLowWord << 31) | (int)(product >>> 33);
z[i++] = (int)(product >>> 1);

As you pointed out, that might cause some stall on the pipeline so I made it with 1s stdu (and could improve code by reducing 1 instruction)

Now about having a big endian version: I'm not confident in doing so as I don't have access to such a machine at the moment. You were kind on offering test support but I don't know if it'd work like that. I may support you in checking out which places are endianness-related but I'm not comfortable in sending you untested code.

Would you be interested in doing such a changes for making it work on Big Endian? For this patch, I provided an interesting test that might help you to verify if it worked.

> > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31
> > 2017). The reported performance speedup was calculated by running the
> > following test (TestSquareToLen.java):
> 
> Seems like JDK-8145913 has not been backported, yet. Sorry for not
> checking this earlier. So if you want to make RSA really fast, it should
> be so much better to backport that one. But I can still sponsor this
> change as it may be used elsewhere.

No problem. It's nice to know that I may not need to request a backport of this patch for performance reasons.

And at last, but not least, the new webrev with these clrldi changes:
https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.03/index.html

Thank you once again,
Gustavo Serra Scalet

> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> Sent: Dienstag, 29. August 2017 22:37
> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
> 
> Hi Martin,
> 
> New changes:
> https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/
> 
> Check comments below, please.
> 
> > -----Original Message-----
> > From: Doerr, Martin
> >
> > 1. Sign extending offset and len
> > Right, sign and zero extending is equivalent for offset and len
> > because they are guaranteed to be >=0 (by checks in Java). But you can
> > only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain
> garbage.
> > rldicl was incorrect. My mistake, sorry for that. Correct would be
> > rldic which also clears the least significant bits.
> > len should also get fixed e.g. by replacing cmpdi by extsw_ in muladd.
> 
> The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to
> need further changes as it's being cleared with clrldi, which is the
> same as rldic with no shift. Therefore it's treated appropriately as
> requested for "offset" parameter. Do you agree?
> 
> > 2. Using 8 byte instructions for int
> > The code which feeds stdu is endianess specific. Doesn't work on all
> > PPC64 platforms.
> 
> You are right. The way I'm building the 64 bits of the register depends
> on which kind of endianness it is run. For now it works only on little
> endian so I'm adding a switch (just like I did for SHA) to make it
> available only on little endian systems.
> 
> > 3.Regarding Andrew's point: Superseded by Montgomery?
> > The Montgomery change got backported to jdk8u (JDK-8150152  in 8u102).
> > I'd expect the performance improvement of these intrinsics to be
> > irrelevant for crypto.rsa. Did you measure with an older jdk8 release?
> 
> No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31
> 2017). The reported performance speedup was calculated by running the
> following test (TestSquareToLen.java):
> import java.math.BigInteger;
> 
> public class TestSquareToLen {
> 
>     public static void main(String args[]) throws Exception {
> 
>       int n = 10000000;
>       if (args.length >=1) {
>         n = Integer.parseInt(args[0]);
>       }
> 
>       BigInteger b1 = new
> BigInteger("348939809235573590863505149820825039200022983118773208599936
> 739559418380102146884307139175604920787313701663155983793121475492609222
> 378029211020760922327218480828933663005773596942372680852064103011811651
> 644018048833823482390819947896524207635857984552089977996313113154016668
> 718795349783157384006672542605760392289645528307");
>       BigInteger b2 = BigInteger.valueOf(0);
>       BigInteger check = BigInteger.valueOf(1);
>       for (int i = 0; i < n; i++) {
>         b2 = b1.multiply(b1);
>         if (i == 0)
>           // Didn't JIT yet. Comparing against interpreted mode
>           check = b2;
>       }
>       if (b2.compareTo(check) == 0)
>         System.out.println("Check ok!");
>       else
>         System.out.println("Check failed!");
>    }
> }
> 
> 
> I got these results on JDK8 on my POWER8 machine:
> $ ./javac TestSquareToLen.java
> $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:-
> UseSquareToLenIntrinsic TestSquareToLen Check ok!
> Check ok!
> Check ok!
> Check ok!
> Check ok!
> 
>  Performance counter stats for './java -XX:-UseMulAddIntrinsic -XX:-
> UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
> 
>       15148.009557      task-clock (msec)         #    1.053 CPUs
> utilized            ( +-  0.48% )
>              2,425      context-switches          #    0.160 K/sec
> ( +-  5.84% )
>                356      cpu-migrations            #    0.023 K/sec
> ( +-  3.01% )
>              5,153      page-faults               #    0.340 K/sec
> ( +-  5.22% )
>     54,536,889,909      cycles                    #    3.600 GHz
> ( +-  0.56% )  (66.68%)
>        239,554,105      stalled-cycles-frontend   #    0.44% frontend
> cycles idle     ( +-  4.87% )  (49.90%)
>     27,683,316,001      stalled-cycles-backend    #   50.76% backend
> cycles idle      ( +-  0.56% )  (50.17%)
>    102,020,229,733      instructions              #    1.87  insn per
> cycle
>                                                   #    0.27  stalled
> cycles per insn  ( +-  0.14% )  (66.94%)
>      7,706,072,218      branches                  #  508.718 M/sec
> ( +-  0.23% )  (50.20%)
>        456,051,162      branch-misses             #    5.92% of all
> branches          ( +-  0.09% )  (50.07%)
> 
>       14.390840733 seconds time elapsed
> ( +-  0.09% )
> 
> $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic -
> XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok!
> Check ok!
> Check ok!
> Check ok!
> Check ok!
> 
>  Performance counter stats for './java -XX:+UseMulAddIntrinsic -
> XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
> 
>       11368.141410      task-clock (msec)         #    1.045 CPUs
> utilized            ( +-  0.64% )
>              1,964      context-switches          #    0.173 K/sec
> ( +-  8.93% )
>                338      cpu-migrations            #    0.030 K/sec
> ( +-  7.65% )
>              5,627      page-faults               #    0.495 K/sec
> ( +-  6.15% )
>     41,100,168,967      cycles                    #    3.615 GHz
> ( +-  0.50% )  (66.36%)
>        309,052,316      stalled-cycles-frontend   #    0.75% frontend
> cycles idle     ( +-  2.84% )  (49.89%)
>     14,188,581,685      stalled-cycles-backend    #   34.52% backend
> cycles idle      ( +-  0.99% )  (50.34%)
>     77,846,029,829      instructions              #    1.89  insn per
> cycle
>                                                   #    0.18  stalled
> cycles per insn  ( +-  0.29% )  (66.96%)
>      8,435,216,989      branches                  #  742.005 M/sec
> ( +-  0.28% )  (50.17%)
>        339,903,936      branch-misses             #    4.03% of all
> branches          ( +-  0.27% )  (49.90%)
> 
>       10.882357546 seconds time elapsed
> ( +-  0.24% )
> 
> 
> (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s (+-
> 0.53%) on JDK10)
> 
> I may run for SpecJVM2008's crypto.rsa if you are interested.
> 
> Thank you once again for reviewing this.
> 
> Best regards,
> Gustavo
> 
> > (I think the change is still acceptable as the intrinsics could be
> > used elsewhere and the implementation also exists on other platforms.)
> >
> > Best regards,
> > Martin
> >
> >
> > -----Original Message-----
> > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> > Sent: Mittwoch, 16. August 2017 18:50
> > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > SquareToLen intrinsics
> >
> > Hi Martin,
> >
> > Thanks for dedicated review. It took me a while to be able to work on
> > this but I hope to have your points solved. Please check below the
> > review as well as my comments quoting your email:
> > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/
> >
> > > -----Original Message-----
> > > First of all, C2 does not perform sign extend when calling stubs.
> > > The int parms need to get zero/sign extended. (Could even be done
> > > without extra instructions by replacing sldi -> rldicl, cmpdi ->
> > > extsw_ in some
> > > cases.)
> >
> > Does it make a difference on my case?
> >
> > I guess you are talking about mulAdd preparation code. The only aspect
> > I found about him is to force the cast from 32 bits -> 64 bits by
> > cleaning higher bits. Offset is a signed integer but it can't be
> negative anyway.
> >
> > So I changed from:
> > sldi   (R5_ARG3, R5_ARG3, 2);
> >
> > to:
> > rldicl (R5_ARG3, R5_ARG3, 2, 32);  // always positive
> >
> >
> > > macroAssembler_ppc.cpp:
> > > - Indentation should be 2 spaces.
> >
> > Done
> >
> >
> > > stubGenerator_ppc:cpp:
> > > - or_, addi_ should get replaced by orr, addi when CR0 result is not
> > > needed.
> >
> > Done
> >
> > > - Where is lplw initialized?
> >
> > It should be initialized with 0, I missed that...
> >
> > > - I believe that the updating load/store instructions e.g. lwzu
> > > don't perform well on some processors. At least using stwu 2 times
> > > in the loop doesn't make sense.
> >
> > You are right. I could manipulate the bits differently and ended up
> > with a single stdu in the loop. Neat! Although I could not reduce the
> > total number of instructions.
> >
> > > - Note: It should be possible to use 8 byte instead of 4 byte
> > > instructions: MacroAssembler::multiply64, addc, adde. But I'm not
> > > requesting to change that because I guess it would make the code
> > > very complicated, especially when supporting both endianess
> versions.
> >
> > Yes, that would require a new analysis on this code. May we consider
> > it next? As you said, I prefer having an initial version that looks as
> > simple as the original java code.
> >
> > > - The squareToLen stub implementation is very close the Java
> > > implementation. So it'd be interesting to understand what C2 doesn't
> > > do as well as the hand written assembly code. Do you know that? (Not
> > > absolutely necessary for accepting this change as long as the stub
> > > is measurably faster.)
> >
> > I don't know either. Basically I chose doing it because I noticed some
> > performance gain on SpecJVM2008 when analyzing X64. Then, taking a
> > closer look, I didn't notice any AVX or some special instructions on
> > X64 so I decided to try it on ppc64 by using some basic assembly.
> >
> > Thanks
> >
> > >
> > > Best regards,
> > > Martin
> > >
> > >
> > > -----Original Message-----
> > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > Sent: Donnerstag, 10. August 2017 19:22
> > > To: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> > > dev at openjdk.java.net>
> > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > SquareToLen intrinsics
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > Sent: terça-feira, 8 de agosto de 2017 17:19
> > > To: ppc-aix-port-dev at openjdk.java.net
> > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > SquareToLen intrinsics
> > >
> > > Hi,
> > >
> > > Could you please review this specific PPC64 change to hotspot? By
> > > implementing these intrinsics I noticed a small improvement with
> > > microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark,
> > > only when backporting to JDK8 an improvement was noticed.
> > >
> > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> > > Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/
> > >
> > > Motivation for this implementation:
> > > https://twitter.com/ijuma/status/698309312498835457
> > >
> > > Best regards,
> > > Gustavo Serra Scalet


More information about the hotspot-compiler-dev mailing list