RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions
Schmidt, Lutz
lutz.schmidt at sap.com
Fri Oct 27 11:06:50 UTC 2017
Hi Martin,
Thanks for reviewing my change!
This is a preliminary response just to let you know I’m working on the change. I’m putting a lot of effort in producing reliable performance measurement data. Turns out this is not easy (to be more honest: almost impossible).
s390.ad:
You are absolutely right, the sequence load_const/string_compress makes no sense at all. But it does not hurt either – I could not find one match in all tests I ran. -> Match rule deleted.
macroAssembler_s390:
prefetch: did not see impact, neither positive nor negative. Artificial micro benchmarks will not benefit (data is in cache anyway). More complex benchmarks show measurement noise which covers the possible prefetch benefit. -> prefetch deleted.
Hardcoded vector registers: you are right. There are some design decisions pending, e.g. how many vector scratch registers?
Vperm instruction: using that is just another implementation variant that could save the vn vector instruction. On the other hand, loading the index vector is a (compared to vgmh) costly memory access. Given the fact that we mostly deal with short strings, initialization effort is relevant.
Code size vs. performance: the old, well known, often discussed tradeoff. Starting from the existing implementation, I invested quite some time in optimizing the (len <= 8) cases. With every refinement step I saw (or believed to see (measurement noise)) some improvement – or discarded it. Is the overall improvement worth the larger code size? -> tradeoff, discussion.
Best Regards,
Lutz
On 25.10.2017, 21:08, "Doerr, Martin" <martin.doerr at sap.com<mailto:martin.doerr at sap.com>> wrote:
Hi Lutz,
thanks for working on vector-based enhancements and for providing this webrev.
assembler_s390:
-The changes in the assembler look good.
s390.ad:
-It doesn't make sense to load constant len to a register and generate complex compare instructions for it and still to emit code for all cases. I assume that e.g. the 4 characters cases usually have a constant length. If so, much better code could be generated for them by omitting all the stuff around the simple instructions. (ppc64.ad already contains nodes for constant length of needle in indexOf rules.)
macroAssembler_s390:
-Are you sure the prefetch instructions improve performance?
I remember that we had them in other String intrinsics but removed them again as they showed absolutely no performance gain.
-Comment: Using hardcoded vector registers is ok for now, but may need to get changed e.g. when using them for C2's SuperWord optimization.
-Comment: You could use the vperm instruction instead of vo+vn, but I'm ok with the current implementation because loading a mask is much more convenient than getting the permutation vector loaded (e.g. from constant pool or pc relative).
-So the new vector loop looks good to me.
-In my opinion, the size of all the generated cases should be in relationship to their performance benefit.
As intrinsics are not like stubs and may get inlined often, I can't get rid of the impression that generating so large code wastes valuable code cache space with questionable performance gain in real world scenarios.
Best regards,
Martin
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
Sent: Mittwoch, 25. Oktober 2017 12:02
To: hotspot-compiler-dev at openjdk.java.net
Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions
Dear all,
I would like to request reviews for this s390-only enhancement:
Bug: https://bugs.openjdk.java.net/browse/JDK-8189793
Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
Vector instructions, which have been available on System z for a while (since z13), promise noticeable performance improvements. This enhancement improves the String Compress and String Inflate intrinsics by exploiting vector instructions, when available. For long strings, up to 2x performance improvement has been observed in micro-benchmarks.
Special care was taken to preserve good performance for short strings. All examined workloads showed a high ratio of short and very short strings.
Thank you!
Lutz
Dr. Lutz Schmidt | SAP JVM | PI SAP CP Core | T: +49 (6227) 7-42834
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20171027/cc75fe59/attachment.html>
More information about the hotspot-compiler-dev
mailing list