RFR(S): 8187964: [s390][ppc]: Intrinsify Math.multiplyHigh(long, long)

Schmidt, Lutz lutz.schmidt at sap.com
Fri Oct 6 14:14:38 UTC 2017


Hi Martin,

thanks for your review!

I have removed the use of the tmp2 register. That was easy.
I do not like the idea of getting rid of the tmp1 register. This would have to be replaced by a scratch register. I try to avoid scratch registers at places where I can easily get a tmp from reg alloc.

Please find the updated webrev at http://cr.openjdk.java.net/~lucy/webrevs/8187964.01/index.html

The long division benefits quite a bit from multiplyHigh. With a simple MicroBenchmark, I see 4x to 5x improvement. Only the latest processor generation doesn’t benefit as much. I see a 1.5x improvement on z13 only.

There is an easy explanation to the z13 “anomaly”: the superscalar layout of a z13 core is twice as wide as that of a z196 core. Z13 needs rather complex loop bodies with independent data streams to reach its full potential. My simple benchmark obviously does not provide that.

Best Regards,
Lutz


On 06.10.2017, 15:03, "Doerr, Martin" <martin.doerr at sap.com<mailto:martin.doerr at sap.com>> wrote:

Hi Lutz,

looks good. If you like, you can get rid of one or both tmp registers if you want to save them.

Did you also check if it improves long division which also uses multiply high nodes?

I can sponsor this change.

Best regards,
Martin


From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
Sent: Freitag, 6. Oktober 2017 11:10
To: hotspot-compiler-dev at openjdk.java.net
Subject: RFR(S): 8187964: [s390][ppc]: Intrinsify Math.multiplyHigh(long, long)

Dear all,

I would like to request reviews for this s390-only enhancement (ppc support was already implemented for other purpose):

Bug:    https://bugs.openjdk.java.net/browse/JDK-8187964
Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8187964.00/index.html

This change provides platform-specific implementations for the Math.multiplyHigh method, exploiting 64bit x 64bit => 128bit multiply instructions available on these platforms. Microbenchmark performance shows improvement of 4x to 5x for [s390] and 10x to 15x for [ppc].

Thank you!
Lutz

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20171006/d64e1f26/attachment.html>


More information about the hotspot-compiler-dev mailing list