RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Fri Nov 3 11:44:58 UTC 2017

Hi Lutz, 

I have been looking at your change. I think it's a good idea to match
for constant string length. I did this for the ppc string intrinsics in the
past. 
I remember that distribution of constant strings was quite uneven.
StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
didn't.

Do you have any data on how often the new match rules match?

Actually, if there is a constant string deflated, a platform independent
optimization could compute that at compile time, but that's a different
issue ...

Best regards,
  Goetz.

> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
> Sent: Freitag, 27. Oktober 2017 13:07
> To: Doerr, Martin <martin.doerr at sap.com>; hotspot-compiler-
> dev at openjdk.java.net
> Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
> exploiting vector instructions
> 
> Hi Martin,
> 
> 
> 
> Thanks for reviewing my change!
> 
> 
> 
> This is a preliminary response just to let you know I’m working on the
> change. I’m putting a lot of effort in producing reliable performance
> measurement data. Turns out this is not easy (to be more honest: almost
> impossible).
> 
> 
> 
> s390.ad:
> 
> You are absolutely right, the sequence load_const/string_compress makes
> no sense at all. But it does not hurt either – I could not find one match in all
> tests I ran. -> Match rule deleted.
> 
> 
> 
> macroAssembler_s390:
> 
> prefetch: did not see impact, neither positive nor negative. Artificial micro
> benchmarks will not benefit (data is in cache anyway). More complex
> benchmarks show measurement noise which covers the possible prefetch
> benefit. -> prefetch deleted.
> 
> Hardcoded vector registers: you are right. There are some design decisions
> pending, e.g. how many vector scratch registers?
> 
> Vperm instruction: using that is just another implementation variant that
> could save the vn vector instruction. On the other hand, loading the index
> vector is a (compared to vgmh) costly memory access. Given the fact that we
> mostly deal with short strings, initialization effort is relevant.
> 
> Code size vs. performance: the old, well known, often discussed tradeoff.
> Starting from the existing implementation, I invested quite some time in
> optimizing the (len <= 8) cases. With every refinement step I saw (or
> believed to see (measurement noise)) some improvement – or discarded it.
> Is the overall improvement worth the larger code size? -> tradeoff,
> discussion.
> 
> 
> 
> Best Regards,
> 
> Lutz
> 
> 
> 
> 
> 
> 
> 
> On 25.10.2017, 21:08, "Doerr, Martin" <martin.doerr at sap.com
> <mailto:martin.doerr at sap.com> > wrote:
> 
> 
> 
> Hi Lutz,
> 
> 
> 
> thanks for working on vector-based enhancements and for providing this
> webrev.
> 
> 
> 
> assembler_s390:
> 
> -The changes in the assembler look good.
> 
> 
> 
> s390.ad:
> 
> -It doesn't make sense to load constant len to a register and generate
> complex compare instructions for it and still to emit code for all cases. I
> assume that e.g. the 4 characters cases usually have a constant length. If so,
> much better code could be generated for them by omitting all the stuff
> around the simple instructions. (ppc64.ad already contains nodes for
> constant length of needle in indexOf rules.)
> 
> 
> 
> macroAssembler_s390:
> 
> -Are you sure the prefetch instructions improve performance?
> 
> I remember that we had them in other String intrinsics but removed them
> again as they showed absolutely no performance gain.
> 
> -Comment: Using hardcoded vector registers is ok for now, but may need to
> get changed e.g. when using them for C2's SuperWord optimization.
> 
> -Comment: You could use the vperm instruction instead of vo+vn, but I'm ok
> with the current implementation because loading a mask is much more
> convenient than getting the permutation vector loaded (e.g. from constant
> pool or pc relative).
> 
> -So the new vector loop looks good to me.
> 
> -In my opinion, the size of all the generated cases should be in relationship to
> their performance benefit.
> 
> As intrinsics are not like stubs and may get inlined often, I can't get rid of the
> impression that generating so large code wastes valuable code cache space
> with questionable performance gain in real world scenarios.
> 
> 
> 
> Best regards,
> 
> Martin
> 
> 
> 
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
> bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
> Sent: Mittwoch, 25. Oktober 2017 12:02
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
> exploiting vector instructions
> 
> 
> 
> Dear all,
> 
> 
> 
> I would like to request reviews for this s390-only enhancement:
> 
> 
> 
> Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793
> 
> Webrev: http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
> 
> 
> 
> Vector instructions, which have been available on System z for a while (since
> z13), promise noticeable performance improvements. This enhancement
> improves the String Compress and String Inflate intrinsics by exploiting vector
> instructions, when available. For long strings, up to 2x performance
> improvement has been observed in micro-benchmarks.
> 
> 
> 
> Special care was taken to preserve good performance for short strings. All
> examined workloads showed a high ratio of short and very short strings.
> 
> 
> 
> Thank you!
> 
> Lutz
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834
> 
>