"xdual(xdual()) should be identity" error when backporting Montgomery multiplication to JDK8 (PPC64le)
Gustavo Serra Scalet
gustavo.scalet at eldorado.org.br
Mon Sep 25 18:55:01 UTC 2017
Yup:
Node* call = NULL;
if (CCallingConventionRequiresIntsAsLongs) {
Node* len_I2L = ConvI2L(len);
call = make_runtime_call(RC_LEAF,
OptoRuntime::montgomeryMultiply_Type(),
stubAddr, stubName, TypePtr::BOTTOM,
a_start, b_start, n_start, len_I2L XTOP, inv,
top(), m_start);
} else {
call = make_runtime_call(RC_LEAF,
OptoRuntime::montgomeryMultiply_Type(),
stubAddr, stubName, TypePtr::BOTTOM,
a_start, b_start, n_start, len, inv, top(),
m_start);
}
set_result(m);
Any other thing I may be missing?
-----Original Message-----
From: Doerr, Martin [mailto:martin.doerr at sap.com]
Sent: segunda-feira, 25 de setembro de 2017 13:55
To: Gustavo Serra Scalet <gustavo.scalet at eldorado.org.br>
Cc: jdk8u-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
Subject: RE: "xdual(xdual()) should be identity" error when backporting Montgomery multiplication to JDK8 (PPC64le)
Hi Gustavo,
Did you also change library_call?
Should be something like (example for multiplyToLen):
Node* call = NULL;
if (CCallingConventionRequiresIntsAsLongs) {
Node* xlen_I2L = ConvI2L(xlen);
Node* ylen_I2L = ConvI2L(ylen);
Node* zlen_I2L = ConvI2L(zlen);
call = make_runtime_call(RC_LEAF|RC_NO_FP,
OptoRuntime::multiplyToLen_Type(),
stubAddr, stubName, TypePtr::BOTTOM,
x_start, xlen_I2L XTOP, y_start,
ylen_I2L XTOP, z_start, zlen_I2L XTOP);
}
else {
call = make_runtime_call(RC_LEAF|RC_NO_FP,
OptoRuntime::multiplyToLen_Type(),
stubAddr, stubName, TypePtr::BOTTOM,
x_start, xlen, y_start, ylen,
z_start, zlen);
}
Best regards,
Martin
-----Original Message-----
From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
Sent: Montag, 25. September 2017 16:03
To: Doerr, Martin <martin.doerr at sap.com>
Cc: jdk8u-dev at openjdk.java.net; ppc-aix-port-dev at openjdk.java.net
Subject: "xdual(xdual()) should be identity" error when backporting Montgomery multiplication to JDK8 (PPC64le)
Hi,
Previously I faced some issues when backporting intrinsics to JDK8[1] but after some help, manage to make it work.
Despite following those rules related to CCallingConvention and Int -> Long conversion, I still face issues not previously noticed on JDK9 when running SpecJVM2008's crypto.rsa bechmark.
[release build]:
Benchmark: crypto.rsa
Run mode: timed run
Test type: multi
Threads: 24
Warmup: 120s
Iterations: 1
Run length: 240s
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00003fffa0cc6f84, pid=3698, tid=0x00003fff7a34f1a0 # # JRE version: OpenJDK Runtime Environment (8.0) (build 1.8.0-internal-gut_2017_04_12_09_19-b00)
# Java VM: OpenJDK 64-Bit Server VM (25.71-b00 mixed mode linux-ppc64 compressed oops) # Problematic frame:
# V [libjvm.so+0x446f84] ConnectionGraph::process_call_arguments(CallNode*)+0x1c4
[slowdebug build]:
Benchmark: crypto.rsa
Run mode: timed run
Test type: multi
Threads: 24
Warmup: 120s
Iterations: 1
Run length: 240s
# To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/type.cpp:596 # # A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/home/gut/jdk8u-dev/hotspot/src/share/vm/opto/type.cpp:596), pid=6793, tid=0x00003fff5847f1a0 # assert(eq(dual_dual)) failed: xdual(xdual()) should be identity
Any hints? I see the issue is when calling the last line of this code on runtime.cpp:
const TypeFunc* OptoRuntime::montgomeryMultiply_Type() {
// create input type (domain)
int num_args = 7;
int argcnt = num_args;
const Type** fields = TypeTuple::fields(argcnt);
int argp = TypeFunc::Parms;
fields[argp++] = TypePtr::NOTNULL; // a
fields[argp++] = TypePtr::NOTNULL; // b
fields[argp++] = TypePtr::NOTNULL; // n
if (CCallingConventionRequiresIntsAsLongs) {
argcnt += 1; // additional placeholders
fields[argp++] = TypeLong::LONG; // len
fields[argp++] = TypeLong::HALF; // placeholder
} else {
fields[argp++] = TypeInt::INT; // len
}
fields[argp++] = TypeLong::LONG; // inv
fields[argp++] = Type::HALF;
fields[argp++] = TypePtr::NOTNULL; // result
assert(argp == TypeFunc::Parms+argcnt, "correct decoding");
const TypeTuple* domain = TypeTuple::make(TypeFunc::Parms+argcnt, fields);
Thanks
References:
[1] http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2017-April/002970.html
-----Original Message-----
From: Doerr, Martin [mailto:martin.doerr at sap.com]
Sent: sexta-feira, 15 de setembro de 2017 04:39
To: Gustavo Serra Scalet <gustavo.scalet at eldorado.org.br>
Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics (off-list)
Hi Gustavo,
you can request backport by sending a
[8u-dev] Request for backport approval: 8145913: PPC64: add Montgomery multiply intrinsic to jdk8u-dev at openjdk.java.net
But first of all, you need to check if the change applies to 8u and run tests.
The following information is needed in the email:
Original Bug: https://bugs.openjdk.java.net/browse/JDK-8145913
Jdk9 change: http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/ce0dacc26f3d
Jdk9 review thread: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2015-December/020534.html
Statement: Change applies cleanly or describe which changes were necessary You should also mention that it belongs to 8u102 backport of https://bugs.openjdk.java.net/browse/JDK-8150152
Once it's approved, you can ask a jdk8u reviewer or committer (which I'm not) to push it.
Best regards,
Martin
-----Original Message-----
From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
Sent: Donnerstag, 14. September 2017 21:00
To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics
Hi Martin,
Short question about the Montgomery backport to JDK8:
> > No, I used the jdk8u152-b01 (State of repository at Thu Apr 6
> > 14:15:31 2017). The reported performance speedup was calculated by >
> > running the following test (TestSquareToLen.java):
>
> Seems like JDK-8145913 has not been backported, yet. Sorry for not
> checking this earlier. So if you want to make RSA really fast, it
> should be so much better to backport that one. But I can still sponsor
> this change as it may be used elsewhere.
Do we have a bug ID for the Montgomery backport to JDK8 so I can track it? If not, should I request that backport or request this intrinsic to be backported (once it's inside of JDK10)? I'm interested in this kind of performance gain.
Thanks
>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> Sent: Dienstag, 29. August 2017 22:37
> To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
>
> Hi Martin,
>
> New changes:
> https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.02/
>
> Check comments below, please.
>
> > -----Original Message-----
> > From: Doerr, Martin
> >
> > 1. Sign extending offset and len
> > Right, sign and zero extending is equivalent for offset and len
> > because they are guaranteed to be >=0 (by checks in Java). But you
> > can only rely on bit 32 (IBM notation) to be 0. Bit 0-31 may contain
> garbage.
> > rldicl was incorrect. My mistake, sorry for that. Correct would be
> > rldic which also clears the least significant bits.
> > len should also get fixed e.g. by replacing cmpdi by extsw_ in muladd.
>
> The s/rldicl/rldic/ was fixed for "offset", but "len" doesn't seem to
> need further changes as it's being cleared with clrldi, which is the
> same as rldic with no shift. Therefore it's treated appropriately as
> requested for "offset" parameter. Do you agree?
>
> > 2. Using 8 byte instructions for int The code which feeds stdu is
> > endianess specific. Doesn't work on all
> > PPC64 platforms.
>
> You are right. The way I'm building the 64 bits of the register
> depends on which kind of endianness it is run. For now it works only
> on little endian so I'm adding a switch (just like I did for SHA) to
> make it available only on little endian systems.
>
> > 3.Regarding Andrew's point: Superseded by Montgomery?
> > The Montgomery change got backported to jdk8u (JDK-8150152 in 8u102).
> > I'd expect the performance improvement of these intrinsics to be
> > irrelevant for crypto.rsa. Did you measure with an older jdk8 release?
>
> No, I used the jdk8u152-b01 (State of repository at Thu Apr 6 14:15:31
> 2017). The reported performance speedup was calculated by running the
> following test (TestSquareToLen.java):
> import java.math.BigInteger;
>
> public class TestSquareToLen {
>
> public static void main(String args[]) throws Exception {
>
> int n = 10000000;
> if (args.length >=1) {
> n = Integer.parseInt(args[0]);
> }
>
> BigInteger b1 = new
> BigInteger("3489398092355735908635051498208250392000229831187732085999
> 36
> 7395594183801021468843071391756049207873137016631559837931214754926092
> 22
> 3780292110207609223272184808289336630057735969423726808520641030118116
> 51
> 6440180488338234823908199478965242076358579845520899779963131131540166
> 68 718795349783157384006672542605760392289645528307");
> BigInteger b2 = BigInteger.valueOf(0);
> BigInteger check = BigInteger.valueOf(1);
> for (int i = 0; i < n; i++) {
> b2 = b1.multiply(b1);
> if (i == 0)
> // Didn't JIT yet. Comparing against interpreted mode
> check = b2;
> }
> if (b2.compareTo(check) == 0)
> System.out.println("Check ok!");
> else
> System.out.println("Check failed!");
> }
> }
>
>
> I got these results on JDK8 on my POWER8 machine:
> $ ./javac TestSquareToLen.java
> $ sudo perf stat -r 5 ./java -XX:-UseMulAddIntrinsic -XX:-
> UseSquareToLenIntrinsic TestSquareToLen Check ok!
> Check ok!
> Check ok!
> Check ok!
> Check ok!
>
> Performance counter stats for './java -XX:-UseMulAddIntrinsic -XX:-
> UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
>
> 15148.009557 task-clock (msec) # 1.053 CPUs
> utilized ( +- 0.48% )
> 2,425 context-switches # 0.160 K/sec
> ( +- 5.84% )
> 356 cpu-migrations # 0.023 K/sec
> ( +- 3.01% )
> 5,153 page-faults # 0.340 K/sec
> ( +- 5.22% )
> 54,536,889,909 cycles # 3.600 GHz
> ( +- 0.56% ) (66.68%)
> 239,554,105 stalled-cycles-frontend # 0.44% frontend
> cycles idle ( +- 4.87% ) (49.90%)
> 27,683,316,001 stalled-cycles-backend # 50.76% backend
> cycles idle ( +- 0.56% ) (50.17%)
> 102,020,229,733 instructions # 1.87 insn per
> cycle
> # 0.27 stalled
> cycles per insn ( +- 0.14% ) (66.94%)
> 7,706,072,218 branches # 508.718 M/sec
> ( +- 0.23% ) (50.20%)
> 456,051,162 branch-misses # 5.92% of all
> branches ( +- 0.09% ) (50.07%)
>
> 14.390840733 seconds time elapsed ( +- 0.09% )
>
> $ sudo perf stat -r 5 ./java -XX:+UseMulAddIntrinsic -
> XX:+UseSquareToLenIntrinsic TestSquareToLen Check ok!
> Check ok!
> Check ok!
> Check ok!
> Check ok!
>
> Performance counter stats for './java -XX:+UseMulAddIntrinsic -
> XX:+UseSquareToLenIntrinsic TestSquareToLen' (5 runs):
>
> 11368.141410 task-clock (msec) # 1.045 CPUs
> utilized ( +- 0.64% )
> 1,964 context-switches # 0.173 K/sec
> ( +- 8.93% )
> 338 cpu-migrations # 0.030 K/sec
> ( +- 7.65% )
> 5,627 page-faults # 0.495 K/sec
> ( +- 6.15% )
> 41,100,168,967 cycles # 3.615 GHz
> ( +- 0.50% ) (66.36%)
> 309,052,316 stalled-cycles-frontend # 0.75% frontend
> cycles idle ( +- 2.84% ) (49.89%)
> 14,188,581,685 stalled-cycles-backend # 34.52% backend
> cycles idle ( +- 0.99% ) (50.34%)
> 77,846,029,829 instructions # 1.89 insn per
> cycle
> # 0.18 stalled
> cycles per insn ( +- 0.29% ) (66.96%)
> 8,435,216,989 branches # 742.005 M/sec
> ( +- 0.28% ) (50.17%)
> 339,903,936 branch-misses # 4.03% of all
> branches ( +- 0.27% ) (49.90%)
>
> 10.882357546 seconds time elapsed ( +- 0.24% )
>
>
> (out of curiosity, these numbers are 15.19s (+- 0.32%) and 13.42s (+-
> 0.53%) on JDK10)
>
> I may run for SpecJVM2008's crypto.rsa if you are interested.
>
> Thank you once again for reviewing this.
>
> Best regards,
> Gustavo
>
> > (I think the change is still acceptable as the intrinsics could be
> > used elsewhere and the implementation also exists on other
> > platforms.)
> >
> > Best regards,
> > Martin
> >
> >
> > -----Original Message-----
> > From: Gustavo Serra Scalet [mailto:gustavo.scalet at eldorado.org.br]
> > Sent: Mittwoch, 16. August 2017 18:50
> > To: Doerr, Martin <martin.doerr at sap.com>; 'hotspot-compiler-
> > dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> > Subject: RE: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > SquareToLen intrinsics
> >
> > Hi Martin,
> >
> > Thanks for dedicated review. It took me a while to be able to work
> > on this but I hope to have your points solved. Please check below
> > the review as well as my comments quoting your email:
> > https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/
> >
> > > -----Original Message-----
> > > First of all, C2 does not perform sign extend when calling stubs.
> > > The int parms need to get zero/sign extended. (Could even be done
> > > without extra instructions by replacing sldi -> rldicl, cmpdi ->
> > > extsw_ in some
> > > cases.)
> >
> > Does it make a difference on my case?
> >
> > I guess you are talking about mulAdd preparation code. The only
> > aspect I found about him is to force the cast from 32 bits -> 64
> > bits by cleaning higher bits. Offset is a signed integer but it
> > can't be
> negative anyway.
> >
> > So I changed from:
> > sldi (R5_ARG3, R5_ARG3, 2);
> >
> > to:
> > rldicl (R5_ARG3, R5_ARG3, 2, 32); // always positive
> >
> >
> > > macroAssembler_ppc.cpp:
> > > - Indentation should be 2 spaces.
> >
> > Done
> >
> >
> > > stubGenerator_ppc:cpp:
> > > - or_, addi_ should get replaced by orr, addi when CR0 result is
> > > not needed.
> >
> > Done
> >
> > > - Where is lplw initialized?
> >
> > It should be initialized with 0, I missed that...
> >
> > > - I believe that the updating load/store instructions e.g. lwzu
> > > don't perform well on some processors. At least using stwu 2 times
> > > in the loop doesn't make sense.
> >
> > You are right. I could manipulate the bits differently and ended up
> > with a single stdu in the loop. Neat! Although I could not reduce
> > the total number of instructions.
> >
> > > - Note: It should be possible to use 8 byte instead of 4 byte
> > > instructions: MacroAssembler::multiply64, addc, adde. But I'm not
> > > requesting to change that because I guess it would make the code
> > > very complicated, especially when supporting both endianess
> versions.
> >
> > Yes, that would require a new analysis on this code. May we consider
> > it next? As you said, I prefer having an initial version that looks
> > as simple as the original java code.
> >
> > > - The squareToLen stub implementation is very close the Java
> > > implementation. So it'd be interesting to understand what C2
> > > doesn't do as well as the hand written assembly code. Do you know
> > > that? (Not absolutely necessary for accepting this change as long
> > > as the stub is measurably faster.)
> >
> > I don't know either. Basically I chose doing it because I noticed
> > some performance gain on SpecJVM2008 when analyzing X64. Then,
> > taking a closer look, I didn't notice any AVX or some special
> > instructions on
> > X64 so I decided to try it on ppc64 by using some basic assembly.
> >
> > Thanks
> >
> > >
> > > Best regards,
> > > Martin
> > >
> > >
> > > -----Original Message-----
> > > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > Sent: Donnerstag, 10. August 2017 19:22
> > > To: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> > > dev at openjdk.java.net>
> > > Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > SquareToLen intrinsics
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> > > bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> > > Sent: terça-feira, 8 de agosto de 2017 17:19
> > > To: ppc-aix-port-dev at openjdk.java.net
> > > Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> > > SquareToLen intrinsics
> > >
> > > Hi,
> > >
> > > Could you please review this specific PPC64 change to hotspot? By
> > > implementing these intrinsics I noticed a small improvement with
> > > microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark,
> > > only when backporting to JDK8 an improvement was noticed.
> > >
> > > JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> > > Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/
> > >
> > > Motivation for this implementation:
> > > https://twitter.com/ijuma/status/698309312498835457
> > >
> > > Best regards,
> > > Gustavo Serra Scalet
More information about the jdk8u-dev
mailing list