[10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Gustavo Serra Scalet gustavo.scalet at eldorado.org.br
Mon Oct 2 10:53:32 UTC 2017


Sorry, I didn't notice that.

Thanks, have a great week

> -----Original Message-----
> From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com]
> Sent: sexta-feira, 29 de setembro de 2017 19:48
> To: Gustavo Serra Scalet <gustavo.scalet at eldorado.org.br>; Doerr, Martin
> <martin.doerr at sap.com>
> Cc: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> dev at openjdk.java.net>
> Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
> 
> Hi,
> 
> I pushed it a few days ago:
> http://hg.openjdk.java.net/jdk10/hs/rev/122833427b36
> 
> Cheers,
>   Goetz.
> 
> > -----Original Message-----
> > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> > Sent: Friday, September 29, 2017 11:26 PM
> > To: Doerr, Martin <martin.doerr at sap.com>; Lindenmaier, Goetz
> > <goetz.lindenmaier at sap.com>
> > Cc: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> > dev at openjdk.java.net>
> > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > SquareToLen intrinsics
> >
> > Hi Martin and Goetz,
> >
> > A new webrev updated to the new repo structure was requested and can
> > be viewed below:
> > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.05/
> >
> > PS: changes applied cleanly from old hotspot to new one.
> >
> > Can it be sponsored now?
> >
> > Thanks.
> >
> > > -----Original Message-----
> > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > Sent: quarta-feira, 6 de setembro de 2017 09:45
> > > To: Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Doerr, Martin
> > > <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net'
> > > <hotspot-compiler-dev at openjdk.java.net>
> > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > SquareToLen intrinsics
> > >
> > > Alright, thanks for the instructions. I'll keep that in mind.
> > >
> > > > -----Original Message-----
> > > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com]
> > > > Sent: quarta-feira, 6 de setembro de 2017 09:44
> > > > To: Gustavo Serra Scalet <gustavo.scalet at eldorado.org.br>; Doerr,
> > > > Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net'
> > > > <hotspot-compiler-dev at openjdk.java.net>
> > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > > SquareToLen intrinsics
> > > >
> > > > Hi Gustavo,
> > > >
> > > > the repos are all closed. Once they are opened again, you will
> > > > have to merge your change into the new repo structure, post a new
> > > > webrev and only then it can be sponsored.  Me or Martin will
> sponsor it then.
> > > >
> > > > Best regards,
> > > >   Goetz.
> > > >
> > > > > -----Original Message-----
> > > > > From: Gustavo Serra Scalet
> > > > > [mailto:gustavo.scalet at eldorado.org.br]
> > > > > Sent: Mittwoch, 6. September 2017 14:32
> > > > > To: Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Doerr,
> > > > > Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net'
> > > > > <hotspot-compiler-dev at openjdk.java.net>
> > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > > > SquareToLen intrinsics
> > > > >
> > > > > Thanks Goetz.
> > > > >
> > > > > Could somebody sponsor this change?
> > > > >
> > > > > THanks
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Lindenmaier, Goetz [mailto:goetz.lindenmaier at sap.com]
> > > > > > Sent: quarta-feira, 6 de setembro de 2017 03:30
> > > > > > To: Gustavo Serra Scalet <gustavo.scalet at eldorado.org.br>;
> > > > > > Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > > > dev at openjdk.java.net'
> > > > > > <hotspot-compiler-dev at openjdk.java.net>
> > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > > > > SquareToLen intrinsics
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I had a look at this change and tested it.  Reviewed.
> > > > > >
> > > > > > Best regards,
> > > > > >   Goetz.
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > > > > > Sent: Freitag, 1. September 2017 19:12
> > > > > > > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > > > > > > dev at openjdk.java.net'
> > > > > > > <hotspot-compiler-dev at openjdk.java.net>
> > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd
> > > > > > > and SquareToLen intrinsics
> > > > > > >
> > > > > > > Hi Martin,
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Doerr, Martin
> > > > > > > > your first webrev already works on Big Endian. So the only
> > > > > > > > required change is to fix your new code by this trivial
> patch:
> > > > > > > > --- a/src/cpu/ppc/vm/stubGenerator_ppc.cpp      Fri Sep 01
> > > > 17:47:45
> > > > > > 2017
> > > > > > > > +0200
> > > > > > > > +++ b/src/cpu/ppc/vm/stubGenerator_ppc.cpp      Fri Sep 01
> > > > 17:55:08
> > > > > > 2017
> > > > > > > > +0200
> > > > > > > > @@ -3426,7 +3426,9 @@
> > > > > > > >      __ srdi   (product,   product,   1);
> > > > > > > >      // join them to the same register and store it as
> > > > > > > > Little
> > > > Endian
> > > > > > > >      __ orr    (product,   lplw_s,    product);
> > > > > > > > +#ifdef VM_LITTLE_ENDIAN
> > > > > > > >      __ rldicl (product,   product,   32, 0);
> > > > > > > > +#endif
> > > > > > > >      __ stdu   (product,   8,         out_aux);
> > > > > > > >      __ bdnz   (LOOP_SQUARE);
> > > > > > > >
> > > > > > > > So please enable it again for Big Endian in
> vm_version_ppc.
> > > > > > > > Besides that, it looks good to me. We also need a 2nd
> review.
> > > > > > >
> > > > > > > Great! Thanks for checking it and suggesting the diff.
> > > > > > >
> > > > > > > I changed these things. You can find it below:
> > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.04/
> > > > > > >
> > > > > > > I wonder who could be a 2nd reviewer... Anybody in mind that
> > > > > > > we may
> > > > > > ping?
> > > > > > > Maybe Goetz Lindenmaier?
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Gustavo  Serra Scalet
> > > > > > >
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Martin
> > > > > > > >
> > > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Gustavo Serra Scalet
> > > > > > > > [mailto:gustavo.scalet at eldorado.org.br]
> > > > > > > > Sent: Mittwoch, 30. August 2017 19:03
> > > > > > > > To: Doerr, Martin <martin.doerr at sap.com>;
> > > > > > > > 'hotspot-compiler- dev at openjdk.java.net'
> > > > > > > > <hotspot-compiler-dev at openjdk.java.net>
> > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd
> > > > > > > > and SquareToLen intrinsics
> > > > > > > >
> > > > > > > > Hi Martin,
> > > > > > > >
> > > > > > > > (webrev at the end)
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Doerr, Martin
> > > > > > > > >
> > > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len"
> > > > > > > > > > doesn't seem to need further changes as it's being
> > > > > > > > > > cleared with clrldi, which is the same as rldic with
> no shift.
> > > > > > > > > > Therefore it's treated appropriately as requested for
> > > > "offset" parameter. Do you agree?
> > > > > > > > >
> > > > > > > > > No, I didn't find clrldi for len in generate_mulAdd().
> > > > > > > > > Only
> > > > for k.
> > > > > > > >
> > > > > > > > I'm sorry. I was thinking about "offset" and "k", which
> > > > > > > > are both cleaned on generate_mulAdd(). "len" was not
> > > > > > > > cleaned and it was being used on
> > > > > > > > muladd() directly with cmpdi, which could lead to
> problems.
> > > > > > > >
> > > > > > > > That is being changed.
> > > > > > > >
> > > > > > > > > Where are in_len and out_len fixed up in
> > > > generate_squareToLen()?
> > > > > > > >
> > > > > > > > They are not. According to your suggestions, I agree it
> > > > > > > > also needs to be done for the same reason.
> > > > > > > >
> > > > > > > > > > You are right. The way I'm building the 64 bits of the
> > > > > > > > > > register depends on which kind of endianness it is
> run.
> > > > > > > > > > For now it works only on little endian so I'm adding a
> > > > > > > > > > switch (just like I did for SHA) to make it available
> > > > > > > > > > only on
> > > > little endian systems.
> > > > > > > > >
> > > > > > > > > It shouldn't be that hard to get it working on big
> > > > > > > > > endian
> > > > > > > > > ;-) Btw., my point was not to replace the 2 4-byte store
> > > > > > > > > instructions by an 8-byte one (though I'm also ok with
> > > that).
> > > > > > > > > It was that 2 stwu which update the same pointer doesn't
> > > > > > > > > make sense from
> > > > > > performance point of view.
> > > > > > > > > Please keep something which works on big endian, too.
> > > > > > > >
> > > > > > > > I see. The 2x stwu was being used like that because it was
> > > > > > > > the trivial approach when considering the original java
> update:
> > > > > > > > z[i++] = (lastProductLowWord << 31) | (int)(product >>>
> > > > > > > > 33); z[i++] = (int)(product >>> 1);
> > > > > > > >
> > > > > > > > As you pointed out, that might cause some stall on the
> > > > > > > > pipeline so I made it with 1s stdu (and could improve code
> > > > > > > > by reducing 1
> > > > > > > > instruction)
> > > > > > > >
> > > > > > > > Now about having a big endian version: I'm not confident
> > > > > > > > in doing so as I don't have access to such a machine at
> > > > > > > > the
> > > moment.
> > > > > > > > You were kind on offering test support but I don't know if
> > > > > > > > it'd work like that. I may support you in checking out
> > > > > > > > which places are endianness-related but I'm not
> > > > > > > > comfortable in sending you untested
> > > > > > code.
> > > > > > > >
> > > > > > > > Would you be interested in doing such a changes for making
> > > > > > > > it work on Big Endian? For this patch, I provided an
> > > > > > > > interesting test that might help you to verify if it
> worked.
> > > > > > > >
> > > > > > > > > > No, I used the jdk8u152-b01 (State of repository at
> > > > > > > > > > Thu Apr
> > > > > > > > > > 6
> > > > > > > > > > 14:15:31 2017). The reported performance speedup was
> > > > > > > > > > calculated by running the following test
> > > > (TestSquareToLen.java):
> > > > > > > > >
> > > > > > > > > Seems like JDK-8145913 has not been backported, yet.
> > > > > > > > > Sorry for not checking this earlier. So if you want to
> > > > > > > > > make RSA really fast, it should be so much better to
> > > > > > > > > backport that one. But I can still sponsor this change
> > > > > > > > > as it may be used
> > > elsewhere.
> > > > > > > >
> > > > > > > > No problem. It's nice to know that I may not need to
> > > > > > > > request a backport of this patch for performance reasons.
> > > > > > > >
> > > > > > > > And at last, but not least, the new webrev with these
> > > > > > > > clrldi
> > > > > > changes:
> > > > > > > > https://gut.github.io/openjdk/webrev/JDK-
> > > > > > > 8185976/webrev.03/index.html
> > > > > > > >
> > > > > > > > Thank you once again,
> > > > > > > > Gustavo Serra Scalet
> > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Martin
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Gustavo Serra Scalet
> > > > > > > > > [mailto:gustavo.scalet at eldorado.org.br]
> > > > > > > > > Sent: Dienstag, 29. August 2017 22:37
> > > > > > > > > To: Doerr, Martin <martin.doerr at sap.com>;
> > > > > > > > > 'hotspot-compiler- dev at openjdk.java.net'
> > > > > > > > > <hotspot-compiler-dev at openjdk.java.net>
> > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement
> > > > > > > > > MulAdd and SquareToLen intrinsics
> > > > > > > > >
> > > > > > > > > Hi Martin,
> > > > > > > > >
> > > > > > > > > New changes:
> > > > > > > > > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.
> > > > > > > > > 02/
> > > > > > > > >
> > > > > > > > > Check comments below, please.
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Doerr, Martin
> > > > > > > > > >
> > > > > > > > > > 1. Sign extending offset and len Right, sign and zero
> > > > > > > > > > extending is equivalent for offset and len because
> > > > > > > > > > they are guaranteed to be >=0 (by checks in Java). But
> > > > > > > > > > you can only rely on bit 32 (IBM
> > > > > > > > > > notation) to be 0. Bit 0-31 may contain
> > > > > > > > > garbage.
> > > > > > > > > > rldicl was incorrect. My mistake, sorry for that.
> > > > > > > > > > Correct would be rldic which also clears the least
> > > > > > > > > > significant
> > > bits.
> > > > > > > > > > len should also get fixed e.g. by replacing cmpdi by
> > > > > > > > > > extsw_ in
> > > > > > > > muladd.
> > > > > > > > >
> > > > > > > > > The s/rldicl/rldic/ was fixed for "offset", but "len"
> > > > > > > > > doesn't seem to need further changes as it's being
> > > > > > > > > cleared with clrldi, which is the same as rldic with no
> shift.
> > > > > > > > > Therefore it's treated appropriately as requested for
> > > "offset"
> > > > parameter. Do you agree?
> > > > > > > > >
> > > > > > > > > > 2. Using 8 byte instructions for int The code which
> > > > > > > > > > feeds stdu is endianess specific. Doesn't work on all
> > > > > > > > > > PPC64 platforms.
> > > > > > > > >
> > > > > > > > > You are right. The way I'm building the 64 bits of the
> > > > > > > > > register depends on which kind of endianness it is run.
> > > > > > > > > For now it works only on little endian so I'm adding a
> > > > > > > > > switch (just like I did for
> > > > > > > > > SHA) to make it available only on little endian systems.
> > > > > > > > >
> > > > > > > > > > 3.Regarding Andrew's point: Superseded by Montgomery?
> > > > > > > > > > The Montgomery change got backported to jdk8u (JDK-
> > 8150152
> > > > > > > > > > in
> > > > > > > > 8u102).
> > > > > > > > > > I'd expect the performance improvement of these
> > > > > > > > > > intrinsics to be irrelevant for crypto.rsa. Did you
> > > > > > > > > > measure with an older jdk8
> > > > > > > > release?
> > > > > > > > >
> > > > > > > > > No, I used the jdk8u152-b01 (State of repository at Thu
> > > > > > > > > Apr
> > > > > > > > > 6
> > > > > > > > > 14:15:31 2017). The reported performance speedup was
> > > > > > > > > calculated by running the following test
> > > > (TestSquareToLen.java):
> > > > > > > > > import java.math.BigInteger;
> > > > > > > > >
> > > > > > > > > public class TestSquareToLen {
> > > > > > > > >
> > > > > > > > >     public static void main(String args[]) throws
> > > > > > > > > Exception {
> > > > > > > > >
> > > > > > > > >       int n = 10000000;
> > > > > > > > >       if (args.length >=1) {
> > > > > > > > >         n = Integer.parseInt(args[0]);
> > > > > > > > >       }
> > > > > > > > >
> > > > > > > > >       BigInteger b1 = new
> > > > > > > > >
> > > > > > >
> > > > >
> > BigInteger("3489398092355735908635051498208250392000229831187732
> > 0859
> > > > > > > 99
> > > > > > > > > 36
> > > > > > > > >
> > > > > > >
> > > > >
> > 73955941838010214688430713917560492078731370166315598379312147
> > 54926
> > > > > > > 092
> > > > > > > > > 22
> > > > > > > > >
> > > > > > >
> > > > >
> > 37802921102076092232721848082893366300577359694237268085206410
> > 30118
> > > > > > > 116
> > > > > > > > > 51
> > > > > > > > >
> > > > > > >
> > > > >
> > 64401804883382348239081994789652420763585798455208997799631311
> > 31540
> > > > > > > 166
> > > > > > > > > 68 718795349783157384006672542605760392289645528307");
> > > > > > > > >       BigInteger b2 = BigInteger.valueOf(0);
> > > > > > > > >       BigInteger check = BigInteger.valueOf(1);
> > > > > > > > >       for (int i = 0; i < n; i++) {
> > > > > > > > >         b2 = b1.multiply(b1);
> > > > > > > > >         if (i == 0)
> > > > > > > > >           // Didn't JIT yet. Comparing against
> > > > > > > > > interpreted
> > > > mode
> > > > > > > > >           check = b2;
> > > > > > > > >       }
> > > > > > > > >       if (b2.compareTo(check) == 0)
> > > > > > > > >         System.out.println("Check ok!");
> > > > > > > > >       else
> > > > > > > > >         System.out.println("Check failed!");
> > > > > > > > >    }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I got these results on JDK8 on my POWER8 machine:
> > > > > > > > > $ ./javac TestSquareToLen.java $ sudo perf stat -r 5
> > > > > > > > > ./java -XX:-UseMulAddIntrinsic -XX:-
> > > > > > > > > UseSquareToLenIntrinsic TestSquareToLen Check ok!
> > > > > > > > > Check ok!
> > > > > > > > > Check ok!
> > > > > > > > > Check ok!
> > > > > > > > > Check ok!
> > > > > > > > >
> > > > > > > > >  Performance counter stats for './java
> > > > > > > > > -XX:-UseMulAddIntrinsic
> > > > > > > > > -XX:- UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
> > > > > > > > >
> > > > > > > > >       15148.009557      task-clock (msec)         #
> 1.053
> > > > CPUs
> > > > > > > > > utilized            ( +-  0.48% )
> > > > > > > > >              2,425      context-switches          #
> 0.160
> > > > K/sec
> > > > > > > > > ( +-  5.84% )
> > > > > > > > >                356      cpu-migrations            #
> 0.023
> > > > K/sec
> > > > > > > > > ( +-  3.01% )
> > > > > > > > >              5,153      page-faults               #
> 0.340
> > > > K/sec
> > > > > > > > > ( +-  5.22% )
> > > > > > > > >     54,536,889,909      cycles                    #
> 3.600
> > > > GHz
> > > > > > > > > ( +-  0.56% )  (66.68%)
> > > > > > > > >        239,554,105      stalled-cycles-frontend   #
> 0.44%
> > > > > > frontend
> > > > > > > > > cycles idle     ( +-  4.87% )  (49.90%)
> > > > > > > > >     27,683,316,001      stalled-cycles-backend    #
> 50.76%
> > > > > > backend
> > > > > > > > > cycles idle      ( +-  0.56% )  (50.17%)
> > > > > > > > >    102,020,229,733      instructions              #
> 1.87
> > > > insn
> > > > > > per
> > > > > > > > > cycle
> > > > > > > > >                                                   #
> 0.27
> > > > > > stalled
> > > > > > > > > cycles per insn  ( +-  0.14% )  (66.94%)
> > > > > > > > >      7,706,072,218      branches                  #
> 508.718
> > > > M/sec
> > > > > > > > > ( +-  0.23% )  (50.20%)
> > > > > > > > >        456,051,162      branch-misses             #
> 5.92%
> > > > of
> > > > > > all
> > > > > > > > > branches          ( +-  0.09% )  (50.07%)
> > > > > > > > >
> > > > > > > > >       14.390840733 seconds time elapsed ( +-  0.09% )
> > > > > > > > >
> > > > > > > > > $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic -
> > > > > > > > > XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok!
> > > > > > > > > Check ok!
> > > > > > > > > Check ok!
> > > > > > > > > Check ok!
> > > > > > > > > Check ok!
> > > > > > > > >
> > > > > > > > >  Performance counter stats for './java
> > > > > > > > > -XX:+UseMulAddIntrinsic
> > > > > > > > > - XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
> > > > > > > > >
> > > > > > > > >       11368.141410      task-clock (msec)         #
> 1.045
> > > > CPUs
> > > > > > > > > utilized            ( +-  0.64% )
> > > > > > > > >              1,964      context-switches          #
> 0.173
> > > > K/sec
> > > > > > > > > ( +-  8.93% )
> > > > > > > > >                338      cpu-migrations            #
> 0.030
> > > > K/sec
> > > > > > > > > ( +-  7.65% )
> > > > > > > > >              5,627      page-faults               #
> 0.495
> > > > K/sec
> > > > > > > > > ( +-  6.15% )
> > > > > > > > >     41,100,168,967      cycles                    #
> 3.615
> > > > GHz
> > > > > > > > > ( +-  0.50% )  (66.36%)
> > > > > > > > >        309,052,316      stalled-cycles-frontend   #
> 0.75%
> > > > > > frontend
> > > > > > > > > cycles idle     ( +-  2.84% )  (49.89%)
> > > > > > > > >     14,188,581,685      stalled-cycles-backend    #
> 34.52%
> > > > > > backend
> > > > > > > > > cycles idle      ( +-  0.99% )  (50.34%)
> > > > > > > > >     77,846,029,829      instructions              #
> 1.89
> > > > insn
> > > > > > per
> > > > > > > > > cycle
> > > > > > > > >                                                   #
> 0.18
> > > > > > stalled
> > > > > > > > > cycles per insn  ( +-  0.29% )  (66.96%)
> > > > > > > > >      8,435,216,989      branches                  #
> 742.005
> > > > M/sec
> > > > > > > > > ( +-  0.28% )  (50.17%)
> > > > > > > > >        339,903,936      branch-misses             #
> 4.03%
> > > > of
> > > > > > all
> > > > > > > > > branches          ( +-  0.27% )  (49.90%)
> > > > > > > > >
> > > > > > > > >       10.882357546 seconds time elapsed ( +-  0.24% )
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > (out of curiosity, these numbers are 15.19s (+- 0.32%)
> > > > > > > > > and 13.42s
> > > > > > > > > (+-
> > > > > > > > > 0.53%) on JDK10)
> > > > > > > > >
> > > > > > > > > I may run for SpecJVM2008's crypto.rsa if you are
> > > interested.
> > > > > > > > >
> > > > > > > > > Thank you once again for reviewing this.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Gustavo
> > > > > > > > >
> > > > > > > > > > (I think the change is still acceptable as the
> > > > > > > > > > intrinsics could be used elsewhere and the
> > > > > > > > > > implementation also exists on other
> > > > > > > > > > platforms.)
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Martin
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Gustavo Serra Scalet
> > > > > > > > > > [mailto:gustavo.scalet at eldorado.org.br]
> > > > > > > > > > Sent: Mittwoch, 16. August 2017 18:50
> > > > > > > > > > To: Doerr, Martin <martin.doerr at sap.com>;
> > > > > > > > > > 'hotspot-compiler- dev at openjdk.java.net'
> > > > > > > > > > <hotspot-compiler-dev at openjdk.java.net>
> > > > > > > > > > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement
> > > > > > > > > > MulAdd and SquareToLen intrinsics
> > > > > > > > > >
> > > > > > > > > > Hi Martin,
> > > > > > > > > >
> > > > > > > > > > Thanks for dedicated review. It took me a while to be
> > > > > > > > > > able to work on this but I hope to have your points
> solved.
> > > > > > > > > > Please check below the review as well as my comments
> > > > > > > > > > quoting
> > > > your email:
> > > > > > > > > > https://gut.github.io/openjdk/webrev/JDK-
> > 8185976/webrev.01
> > > > > > > > > > /
> > > > > > > > > >
> > > > > > > > > > > -----Original Message----- First of all, C2 does not
> > > > > > > > > > > perform sign extend when calling
> > > > > > stubs.
> > > > > > > > > > > The int parms need to get zero/sign extended. (Could
> > > > > > > > > > > even be done without extra instructions by replacing
> > > > > > > > > > > sldi -> rldicl, cmpdi -> extsw_ in some
> > > > > > > > > > > cases.)
> > > > > > > > > >
> > > > > > > > > > Does it make a difference on my case?
> > > > > > > > > >
> > > > > > > > > > I guess you are talking about mulAdd preparation code.
> > > > > > > > > > The only aspect I found about him is to force the cast
> > > > > > > > > > from 32 bits -> 64 bits by cleaning higher bits.
> > > > > > > > > > Offset is a signed integer but it can't be
> > > > > > > > > negative anyway.
> > > > > > > > > >
> > > > > > > > > > So I changed from:
> > > > > > > > > > sldi   (R5_ARG3, R5_ARG3, 2);
> > > > > > > > > >
> > > > > > > > > > to:
> > > > > > > > > > rldicl (R5_ARG3, R5_ARG3, 2, 32);  // always positive
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > macroAssembler_ppc.cpp:
> > > > > > > > > > > - Indentation should be 2 spaces.
> > > > > > > > > >
> > > > > > > > > > Done
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > stubGenerator_ppc:cpp:
> > > > > > > > > > > - or_, addi_ should get replaced by orr, addi when
> > > > > > > > > > > CR0 result is not needed.
> > > > > > > > > >
> > > > > > > > > > Done
> > > > > > > > > >
> > > > > > > > > > > - Where is lplw initialized?
> > > > > > > > > >
> > > > > > > > > > It should be initialized with 0, I missed that...
> > > > > > > > > >
> > > > > > > > > > > - I believe that the updating load/store
> > > > > > > > > > > instructions
> > > e.g.
> > > > > > > > > > > lwzu don't perform well on some processors. At least
> > > > > > > > > > > using stwu 2 times in the loop doesn't make sense.
> > > > > > > > > >
> > > > > > > > > > You are right. I could manipulate the bits differently
> > > > > > > > > > and ended up with a single stdu in the loop. Neat!
> > > > > > > > > > Although I could not reduce the total number of
> instructions.
> > > > > > > > > >
> > > > > > > > > > > - Note: It should be possible to use 8 byte instead
> > > > > > > > > > > of 4 byte
> > > > > > > > > > > instructions: MacroAssembler::multiply64, addc,
> adde.
> > > > > > > > > > > But I'm not requesting to change that because I
> > > > > > > > > > > guess it would make the code very complicated,
> > > > > > > > > > > especially when supporting both endianess
> > > > > > > > > versions.
> > > > > > > > > >
> > > > > > > > > > Yes, that would require a new analysis on this code.
> > > > > > > > > > May we consider it next? As you said, I prefer having
> > > > > > > > > > an initial version that looks as simple as the
> > > > > > > > > > original java
> > > code.
> > > > > > > > > >
> > > > > > > > > > > - The squareToLen stub implementation is very close
> > > > > > > > > > > the Java implementation. So it'd be interesting to
> > > > > > > > > > > understand what C2 doesn't do as well as the hand
> > > > > > > > > > > written assembly code. Do you know that? (Not
> > > > > > > > > > > absolutely necessary for accepting this change as
> > > > > > > > > > > long as the stub is measurably
> > > > > > > > > > > faster.)
> > > > > > > > > >
> > > > > > > > > > I don't know either. Basically I chose doing it
> > > > > > > > > > because I noticed some performance gain on SpecJVM2008
> > > > > > > > > > when analyzing
> > > > > X64.
> > > > > > > > > > Then, taking a closer look, I didn't notice any AVX or
> > > > > > > > > > some special instructions on
> > > > > > > > > > X64 so I decided to try it on ppc64 by using some
> > > > > > > > > > basic
> > > > > > assembly.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Martin
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: hotspot-compiler-dev
> > > > > > > > > > > [mailto:hotspot-compiler-dev-
> > > > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra
> > > > > > > > > > > Scalet
> > > > > > > > > > > Sent: Donnerstag, 10. August 2017 19:22
> > > > > > > > > > > To: 'hotspot-compiler-dev at openjdk.java.net'
> > > > > > > > > > > <hotspot-compiler- dev at openjdk.java.net>
> > > > > > > > > > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement
> > > > > > > > > > > MulAdd
> > > > > and
> > > > > > > > > > > SquareToLen intrinsics
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> > > > > > > > > > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra
> > > > > > > > > > > Scalet
> > > > > > > > > > > Sent: terça-feira, 8 de agosto de 2017 17:19
> > > > > > > > > > > To: ppc-aix-port-dev at openjdk.java.net
> > > > > > > > > > > Subject: [10] RFR(M): 8185976: PPC64: Implement
> > > > > > > > > > > MulAdd and SquareToLen intrinsics
> > > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > Could you please review this specific PPC64 change
> > > > > > > > > > > to
> > > > hotspot?
> > > > > > > > > > > By implementing these intrinsics I noticed a small
> > > > > > > > > > > improvement with microbenchmarks analysis. On
> > > > > > > > > > > SpecJVM2008's crypto.rsa benchmark, only when
> > > > > > > > > > > backporting to JDK8 an improvement was
> > > > > > noticed.
> > > > > > > > > > >
> > > > > > > > > > > JBS:
> > > > > > > > > > > https://bugs.openjdk.java.net/browse/JDK-8185976
> > > > > > > > > > > Webrev: https://gut.github.io/openjdk/webrev/JDK-
> > > > > > > 8185976/webrev/
> > > > > > > > > > >
> > > > > > > > > > > Motivation for this implementation:
> > > > > > > > > > > https://twitter.com/ijuma/status/698309312498835457
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Gustavo Serra Scalet


More information about the hotspot-compiler-dev mailing list