[10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Tue Aug 29 20:36:34 UTC 2017

Hi Martin,

New changes:
https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/

Check comments below, please.

> -----Original Message-----
> From: Doerr, Martin
> 
> 1. Sign extending offset and len
> Right, sign and zero extending is equivalent for offset and len because
> they are guaranteed to be >=0 (by checks in Java). But you can only rely
> on bit 32 (IBM notation) to be 0. Bit 0-31 may contain garbage.
> rldicl was incorrect. My mistake, sorry for that. Correct would be rldic
> which also clears the least significant bits.
> len should also get fixed e.g. by replacing cmpdi by extsw_ in muladd.

The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to need further changes as it's being cleared with clrldi, which is the same as rldic with no shift. Therefore it's treated appropriately as requested for "offset" parameter. Do you agree?

> 2. Using 8 byte instructions for int
> The code which feeds stdu is endianess specific. Doesn't work on all
> PPC64 platforms.

You are right. The way I'm building the 64 bits of the register depends on which kind of endianness it is run. For now it works only on little endian so I'm adding a switch (just like I did for SHA) to make it available only on little endian systems.

> 3.Regarding Andrew's point: Superseded by Montgomery?
> The Montgomery change got backported to jdk8u (JDK-8150152  in 8u102).
> I'd expect the performance improvement of these intrinsics to be
> irrelevant for crypto.rsa. Did you measure with an older jdk8 release?

No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31 2017). The reported performance speedup was calculated by running the following test (TestSquareToLen.java):
import java.math.BigInteger;

public class TestSquareToLen {

    public static void main(String args[]) throws Exception {

      int n = 10000000;
      if (args.length >=1) {
        n = Integer.parseInt(args[0]);
      }

      BigInteger b1 = new BigInteger("348939809235573590863505149820825039200022983118773208599936739559418380102146884307139175604920787313701663155983793121475492609222378029211020760922327218480828933663005773596942372680852064103011811651644018048833823482390819947896524207635857984552089977996313113154016668718795349783157384006672542605760392289645528307");
      BigInteger b2 = BigInteger.valueOf(0);
      BigInteger check = BigInteger.valueOf(1);
      for (int i = 0; i < n; i++) {
        b2 = b1.multiply(b1);
        if (i == 0)
          // Didn't JIT yet. Comparing against interpreted mode
          check = b2;
      }
      if (b2.compareTo(check) == 0)
        System.out.println("Check ok!");
      else
        System.out.println("Check failed!");
   }
}

I got these results on JDK8 on my POWER8 machine:
$ ./javac TestSquareToLen.java
$ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:-UseSquareToLenIntrinsic TestSquareToLen
Check ok!
Check ok!
Check ok!
Check ok!
Check ok!

 Performance counter stats for './java -XX:-UseMulAddIntrinsic -XX:-UseSquareToLenIntrinsic TestSquareToLen' (5 runs):

      15148.009557      task-clock (msec)         #    1.053 CPUs utilized            ( +-  0.48% )
             2,425      context-switches          #    0.160 K/sec                    ( +-  5.84% )
               356      cpu-migrations            #    0.023 K/sec                    ( +-  3.01% )
             5,153      page-faults               #    0.340 K/sec                    ( +-  5.22% )
    54,536,889,909      cycles                    #    3.600 GHz                      ( +-  0.56% )  (66.68%)
       239,554,105      stalled-cycles-frontend   #    0.44% frontend cycles idle     ( +-  4.87% )  (49.90%)
    27,683,316,001      stalled-cycles-backend    #   50.76% backend cycles idle      ( +-  0.56% )  (50.17%)
   102,020,229,733      instructions              #    1.87  insn per cycle
                                                  #    0.27  stalled cycles per insn  ( +-  0.14% )  (66.94%)
     7,706,072,218      branches                  #  508.718 M/sec                    ( +-  0.23% )  (50.20%)
       456,051,162      branch-misses             #    5.92% of all branches          ( +-  0.09% )  (50.07%)

      14.390840733 seconds time elapsed                                          ( +-  0.09% )

$ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic -XX:+UseSquareToLenIntrinsic TestSquareToLen
Check ok!
Check ok!
Check ok!
Check ok!
Check ok!                                                                                                                                                                     

 Performance counter stats for './java -XX:+UseMulAddIntrinsic -XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs):

      11368.141410      task-clock (msec)         #    1.045 CPUs utilized            ( +-  0.64% )
             1,964      context-switches          #    0.173 K/sec                    ( +-  8.93% )
               338      cpu-migrations            #    0.030 K/sec                    ( +-  7.65% )
             5,627      page-faults               #    0.495 K/sec                    ( +-  6.15% )
    41,100,168,967      cycles                    #    3.615 GHz                      ( +-  0.50% )  (66.36%)
       309,052,316      stalled-cycles-frontend   #    0.75% frontend cycles idle     ( +-  2.84% )  (49.89%)
    14,188,581,685      stalled-cycles-backend    #   34.52% backend cycles idle      ( +-  0.99% )  (50.34%)
    77,846,029,829      instructions              #    1.89  insn per cycle
                                                  #    0.18  stalled cycles per insn  ( +-  0.29% )  (66.96%)
     8,435,216,989      branches                  #  742.005 M/sec                    ( +-  0.28% )  (50.17%)
       339,903,936      branch-misses             #    4.03% of all branches          ( +-  0.27% )  (49.90%)

      10.882357546 seconds time elapsed                                          ( +-  0.24% )

(out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s (+- 0.53%) on JDK10)

I may run for SpecJVM2008's crypto.rsa if you are interested.

Thank you once again for reviewing this.

Best regards,
Gustavo

> (I think the change is still acceptable as the intrinsics could be used
> elsewhere and the implementation also exists on other platforms.)
> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> Sent: Mittwoch, 16. August 2017 18:50
> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
> 
> Hi Martin,
> 
> Thanks for dedicated review. It took me a while to be able to work on
> this but I hope to have your points solved. Please check below the
> review as well as my comments quoting your email:
> https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/
> 
> > -----Original Message-----
> > First of all, C2 does not perform sign extend when calling stubs. The
> > int parms need to get zero/sign extended. (Could even be done without
> > extra instructions by replacing sldi -> rldicl, cmpdi -> extsw_ in
> > some
> > cases.)
> 
> Does it make a difference on my case?
> 
> I guess you are talking about mulAdd preparation code. The only aspect I
> found about him is to force the cast from 32 bits -> 64 bits by cleaning
> higher bits. Offset is a signed integer but it can't be negative anyway.
> 
> So I changed from:
> sldi   (R5_ARG3, R5_ARG3, 2);
> 
> to:
> rldicl (R5_ARG3, R5_ARG3, 2, 32);  // always positive
> 
> 
> > macroAssembler_ppc.cpp:
> > - Indentation should be 2 spaces.
> 
> Done
> 
> 
> > stubGenerator_ppc:cpp:
> > - or_, addi_ should get replaced by orr, addi when CR0 result is not
> > needed.
> 
> Done
> 
> > - Where is lplw initialized?
> 
> It should be initialized with 0, I missed that...
> 
> > - I believe that the updating load/store instructions e.g. lwzu don't
> > perform well on some processors. At least using stwu 2 times in the
> > loop doesn't make sense.
> 
> You are right. I could manipulate the bits differently and ended up with
> a single stdu in the loop. Neat! Although I could not reduce the total
> number of instructions.
> 
> > - Note: It should be possible to use 8 byte instead of 4 byte
> > instructions: MacroAssembler::multiply64, addc, adde. But I'm not
> > requesting to change that because I guess it would make the code very
> > complicated, especially when supporting both endianess versions.
> 
> Yes, that would require a new analysis on this code. May we consider it
> next? As you said, I prefer having an initial version that looks as
> simple as the original java code.
> 
> > - The squareToLen stub implementation is very close the Java
> > implementation. So it'd be interesting to understand what C2 doesn't
> > do as well as the hand written assembly code. Do you know that? (Not
> > absolutely necessary for accepting this change as long as the stub is
> > measurably faster.)
> 
> I don't know either. Basically I chose doing it because I noticed some
> performance gain on SpecJVM2008 when analyzing X64. Then, taking a
> closer look, I didn't notice any AVX or some special instructions on X64
> so I decided to try it on ppc64 by using some basic assembly.
> 
> Thanks
> 
> >
> > Best regards,
> > Martin
> >
> >
> > -----Original Message-----
> > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > Sent: Donnerstag, 10. August 2017 19:22
> > To: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> > dev at openjdk.java.net>
> > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > SquareToLen intrinsics
> >
> >
> >
> > -----Original Message-----
> > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > Sent: terça-feira, 8 de agosto de 2017 17:19
> > To: ppc-aix-port-dev at openjdk.java.net
> > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen
> > intrinsics
> >
> > Hi,
> >
> > Could you please review this specific PPC64 change to hotspot? By
> > implementing these intrinsics I noticed a small improvement with
> > microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only
> > when backporting to JDK8 an improvement was noticed.
> >
> > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> > Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/
> >
> > Motivation for this implementation:
> > https://twitter.com/ijuma/status/698309312498835457
> >
> > Best regards,
> > Gustavo Serra Scalet