RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions
Schmidt, Lutz
lutz.schmidt at sap.com
Fri Nov 3 16:45:39 UTC 2017
Hi Goetz,
I agree. Knowing the string length greatly helps to optimize the generated code, both in terms of size and performance. There are no compress calls with constant length, though. Constant strings are stored compressed at compile time.
Here are some counters I found when running (a subset of) the SPECjvm2008 suite:
string_compress match count: 11
string_inflate match count: 171
string_inflate_const:
len = 1: 10 matches
len = 2: 2 matches
len = 4: 3 matches
len = 6: 9 matches
len = 7: 3 matches
len = 9: 4 matches
len = 10: 5 matches
len = 11: 15 matches
len = 15: 1 matches
len = 17: 1 matches
len = 18: 2 matches
len = 19: 2 matches
len = 29: 1 matches
len = 31: 1 matches
These (rather few) matches handle a lot of compress/inflate operations:
n #compress #inflate
<16 673 Mio 2895 Mio
<256 207 Mio 704 Mio
<4096 0.7 Mio 1.8 Mio
>=4096 1.1 Mio 0.3 Mio
A short not on performance gains:
I have done a lot of performance tests in different settings. With complex tests, like SPECjvm2008, the positive (or negative) effect of such low-level optimizations disappears in measurement noise. With a micro benchmark, just compressing and inflating a string, some effect is visible:
My new, improved implementation of the intrinsics shows a slight performance advantage of 1..4% for short strings. Once the vector instructions kick in (at len >= 32), performance improves by 50..70% for string_compress and by 50..150% for string_inflate. Measurements show a high variance, despite testing was done on a system with dedicated cpu resources and with no concurrent load.
BTW, there is a new webrev at http://cr.openjdk.java.net/~lucy/webrevs/8189793.01/index.html
In addition to the changes mentioned below, it contains two minor, nevertheless important fixes to the compress intrinsic:
1) z_bru(skipShortcut); is changed to z_brh(skipShortcut);
2) Code is added after label ScalarShortcut to check for zero length strings.
Regards,
Lutz
On 03.11.2017, 12:44, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com> wrote:
Hi Lutz,
I have been looking at your change. I think it's a good idea to match
for constant string length. I did this for the ppc string intrinsics in the
past.
I remember that distribution of constant strings was quite uneven.
StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
didn't.
Do you have any data on how often the new match rules match?
Actually, if there is a constant string deflated, a platform independent
optimization could compute that at compile time, but that's a different
issue ...
Best regards,
Goetz.
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
> Sent: Freitag, 27. Oktober 2017 13:07
> To: Doerr, Martin <martin.doerr at sap.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
> exploiting vector instructions
>
> Hi Martin,
>
>
>
> Thanks for reviewing my change!
>
>
>
> This is a preliminary response just to let you know I’m working on the
> change. I’m putting a lot of effort in producing reliable performance
> measurement data. Turns out this is not easy (to be more honest: almost
> impossible).
>
>
>
> s390.ad:
>
> You are absolutely right, the sequence load_const/string_compress makes
> no sense at all. But it does not hurt either – I could not find one match in all
> tests I ran. -> Match rule deleted.
>
>
>
> macroAssembler_s390:
>
> prefetch: did not see impact, neither positive nor negative. Artificial micro
> benchmarks will not benefit (data is in cache anyway). More complex
> benchmarks show measurement noise which covers the possible prefetch
> benefit. -> prefetch deleted.
>
> Hardcoded vector registers: you are right. There are some design decisions
> pending, e.g. how many vector scratch registers?
>
> Vperm instruction: using that is just another implementation variant that
> could save the vn vector instruction. On the other hand, loading the index
> vector is a (compared to vgmh) costly memory access. Given the fact that we
> mostly deal with short strings, initialization effort is relevant.
>
> Code size vs. performance: the old, well known, often discussed tradeoff.
> Starting from the existing implementation, I invested quite some time in
> optimizing the (len <= 8) cases. With every refinement step I saw (or
> believed to see (measurement noise)) some improvement – or discarded it.
> Is the overall improvement worth the larger code size? -> tradeoff,
> discussion.
>
>
>
> Best Regards,
>
> Lutz
>
>
>
>
>
>
>
> On 25.10.2017, 21:08, "Doerr, Martin" <martin.doerr at sap.com
> <mailto:martin.doerr at sap.com> > wrote:
>
>
>
> Hi Lutz,
>
>
>
> thanks for working on vector-based enhancements and for providing this
> webrev.
>
>
>
> assembler_s390:
>
> -The changes in the assembler look good.
>
>
>
> s390.ad:
>
> -It doesn't make sense to load constant len to a register and generate
> complex compare instructions for it and still to emit code for all cases. I
> assume that e.g. the 4 characters cases usually have a constant length. If so,
> much better code could be generated for them by omitting all the stuff
> around the simple instructions. (ppc64.ad already contains nodes for
> constant length of needle in indexOf rules.)
>
>
>
> macroAssembler_s390:
>
> -Are you sure the prefetch instructions improve performance?
>
> I remember that we had them in other String intrinsics but removed them
> again as they showed absolutely no performance gain.
>
> -Comment: Using hardcoded vector registers is ok for now, but may need to
> get changed e.g. when using them for C2's SuperWord optimization.
>
> -Comment: You could use the vperm instruction instead of vo+vn, but I'm ok
> with the current implementation because loading a mask is much more
> convenient than getting the permutation vector loaded (e.g. from constant
> pool or pc relative).
>
> -So the new vector loop looks good to me.
>
> -In my opinion, the size of all the generated cases should be in relationship to
> their performance benefit.
>
> As intrinsics are not like stubs and may get inlined often, I can't get rid of the
> impression that generating so large code wastes valuable code cache space
> with questionable performance gain in real world scenarios.
>
>
>
> Best regards,
>
> Martin
>
>
>
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
> Sent: Mittwoch, 25. Oktober 2017 12:02
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
> exploiting vector instructions
>
>
>
> Dear all,
>
>
>
> I would like to request reviews for this s390-only enhancement:
>
>
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8189793
>
> Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
>
>
>
> Vector instructions, which have been available on System z for a while (since
> z13), promise noticeable performance improvements. This enhancement
> improves the String Compress and String Inflate intrinsics by exploiting vector
> instructions, when available. For long strings, up to 2x performance
> improvement has been observed in micro-benchmarks.
>
>
>
> Special care was taken to preserve good performance for short strings. All
> examined workloads showed a high ratio of short and very short strings.
>
>
>
> Thank you!
>
> Lutz
>
>
>
>
>
>
>
>
>
> Dr. Lutz Schmidt | SAP JVM | PI SAP CP Core | T: +49 (6227) 7-42834
>
>
More information about the hotspot-compiler-dev
mailing list