[10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen intrinsics

Wed Aug 16 16:49:32 UTC 2017

Hi Martin,

Thanks for dedicated review. It took me a while to be able to work on this but I hope to have your points solved. Please check below the review as well as my comments quoting your email:
https://gut.github.io/openjdk/webrev/JDK-8185976/webrev.01/

> -----Original Message-----
> First of all, C2 does not perform sign extend when calling stubs. The
> int parms need to get zero/sign extended. (Could even be done without
> extra instructions by replacing sldi -> rldicl, cmpdi -> extsw_ in some
> cases.)

Does it make a difference on my case?

I guess you are talking about mulAdd preparation code. The only aspect I found about him is to force the cast from 32 bits -> 64 bits by cleaning higher bits. Offset is a signed integer but it can't be negative anyway. 

So I changed from:
sldi   (R5_ARG3, R5_ARG3, 2);

to:
rldicl (R5_ARG3, R5_ARG3, 2, 32);  // always positive

> macroAssembler_ppc.cpp:
> - Indentation should be 2 spaces.

Done

> stubGenerator_ppc:cpp:
> - or_, addi_ should get replaced by orr, addi when CR0 result is not
> needed.

Done

> - Where is lplw initialized?

It should be initialized with 0, I missed that...

> - I believe that the updating load/store instructions e.g. lwzu don't
> perform well on some processors. At least using stwu 2 times in the loop
> doesn't make sense.

You are right. I could manipulate the bits differently and ended up with a single stdu in the loop. Neat! Although I could not reduce the total number of instructions.

> - Note: It should be possible to use 8 byte instead of 4 byte
> instructions: MacroAssembler::multiply64, addc, adde. But I'm not
> requesting to change that because I guess it would make the code very
> complicated, especially when supporting both endianess versions.

Yes, that would require a new analysis on this code. May we consider it next? As you said, I prefer having an initial version that looks as simple as the original java code.

> - The squareToLen stub implementation is very close the Java
> implementation. So it'd be interesting to understand what C2 doesn't do
> as well as the hand written assembly code. Do you know that? (Not
> absolutely necessary for accepting this change as long as the stub is
> measurably faster.)

I don't know either. Basically I chose doing it because I noticed some performance gain on SpecJVM2008 when analyzing X64. Then, taking a closer look, I didn't notice any AVX or some special instructions on X64 so I decided to try it on ppc64 by using some basic assembly.

Thanks

> 
> Best regards,
> Martin
> 
> 
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> Sent: Donnerstag, 10. August 2017 19:22
> To: 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-
> dev at openjdk.java.net>
> Subject: FW: [10] RFR(M): 8185976: PPC64: Implement MulAdd and
> SquareToLen intrinsics
> 
> 
> 
> -----Original Message-----
> From: ppc-aix-port-dev [mailto:ppc-aix-port-dev-
> bounces at openjdk.java.net] On Behalf Of Gustavo Serra Scalet
> Sent: terça-feira, 8 de agosto de 2017 17:19
> To: ppc-aix-port-dev at openjdk.java.net
> Subject: [10] RFR(M): 8185976: PPC64: Implement MulAdd and SquareToLen
> intrinsics
> 
> Hi,
> 
> Could you please review this specific PPC64 change to hotspot? By
> implementing these intrinsics I noticed a small improvement with
> microbenchmarks analysis. On SpecJVM2008's crypto.rsa benchmark, only
> when backporting to JDK8 an improvement was noticed.
> 
> JBS: https://bugs.openjdk.java.net/browse/JDK-8185976
> Webrev: https://gut.github.io/openjdk/webrev/JDK-8185976/webrev/
> 
> Motivation for this implementation:
> https://twitter.com/ijuma/status/698309312498835457
> 
> Best regards,
> Gustavo Serra Scalet