RFR(L): 8189793: [s390]: Improve String compress/inflate by exploiting vector instructions

Mon Nov 13 11:52:58 UTC 2017

Thank you, Goetz!
Best Regards, 
Lutz

On 13.11.2017, 12:48, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com> wrote:

    Hi Lutz, 

    thanks for the numbers and for the two fixes. 
    Change looks good now. 

    The numbers indicate that deflation of constant strings
    at compile time would make sense, as well as optimizing
    compress for large strings.  (if jvm2008 is representative,
    but I assume it's good enough).

    Best regards,
      Goetz.

    > -----Original Message-----
    > From: Schmidt, Lutz
    > Sent: Freitag, 3. November 2017 17:46
    > To: Lindenmaier, Goetz <goetz.lindenmaier at sap.com>; Doerr, Martin
    > <martin.doerr at sap.com>; hotspot-compiler-dev at openjdk.java.net
    > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    > exploiting vector instructions
    > 
    > Hi Goetz,
    > 
    > I agree. Knowing the string length greatly helps to optimize the generated
    > code, both in terms of size and performance. There are no compress calls
    > with constant length, though. Constant strings are stored compressed at
    > compile time.
    > 
    > Here are some counters I found when running (a subset of) the SPECjvm2008
    > suite:
    >  string_compress match count:  11
    >  string_inflate  match count: 171
    >  string_inflate_const:
    >    len =  1: 10 matches
    >    len =  2:  2 matches
    >    len =  4:  3 matches
    >    len =  6:  9 matches
    >    len =  7:  3 matches
    >    len =  9:  4 matches
    >    len = 10:  5 matches
    >    len = 11: 15 matches
    >    len = 15:  1 matches
    >    len = 17:  1 matches
    >    len = 18:  2 matches
    >    len = 19:  2 matches
    >    len = 29:  1 matches
    >    len = 31:  1 matches
    > 
    > These (rather few) matches handle a lot of compress/inflate operations:
    >       n     #compress        #inflate
    >     <16       673 Mio        2895 Mio
    >    <256       207 Mio         704 Mio
    >   <4096       0.7 Mio         1.8 Mio
    >  >=4096       1.1 Mio         0.3 Mio
    > 
    > 
    > A short not on performance gains:
    > I have done a lot of performance tests in different settings. With complex
    > tests, like SPECjvm2008, the positive (or negative) effect of such low-level
    > optimizations disappears in measurement noise. With a micro benchmark,
    > just compressing and inflating a string, some effect is visible:
    > 
    > My new, improved implementation of the intrinsics shows a slight
    > performance advantage of 1..4% for short strings. Once the vector
    > instructions kick in (at len >= 32), performance improves by 50..70% for
    > string_compress and by 50..150% for string_inflate. Measurements show a
    > high variance, despite testing was done on a system with dedicated cpu
    > resources and with no concurrent load.
    > 
    > BTW, there is a new webrev at
    > http://cr.openjdk.java.net/~lucy/webrevs/8189793.01/index.html
    > 
    > In addition to the changes mentioned below, it contains two minor,
    > nevertheless important fixes to the compress intrinsic:
    > 1) z_bru(skipShortcut); is changed to z_brh(skipShortcut);
    > 2) Code is added after label ScalarShortcut to check for zero length strings.
    > 
    > Regards,
    > Lutz
    > 
    > 
    > 
    > 
    > On 03.11.2017, 12:44, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com>
    > wrote:
    > 
    >     Hi Lutz,
    > 
    >     I have been looking at your change. I think it's a good idea to match
    >     for constant string length. I did this for the ppc string intrinsics in the
    >     past.
    >     I remember that distribution of constant strings was quite uneven.
    >     StrIndexOf appeared with constant lengths a lot, StrEquals and StrComp
    >     didn't.
    > 
    >     Do you have any data on how often the new match rules match?
    > 
    >     Actually, if there is a constant string deflated, a platform independent
    >     optimization could compute that at compile time, but that's a different
    >     issue ...
    > 
    >     Best regards,
    >       Goetz.
    > 
    > 
    >     > -----Original Message-----
    >     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
    >     > bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
    >     > Sent: Freitag, 27. Oktober 2017 13:07
    >     > To: Doerr, Martin <martin.doerr at sap.com>; hotspot-compiler-
    >     > dev at openjdk.java.net
    >     > Subject: Re: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    >     > exploiting vector instructions
    >     >
    >     > Hi Martin,
    >     >
    >     >
    >     >
    >     > Thanks for reviewing my change!
    >     >
    >     >
    >     >
    >     > This is a preliminary response just to let you know I’m working on the
    >     > change. I’m putting a lot of effort in producing reliable performance
    >     > measurement data. Turns out this is not easy (to be more honest: almost
    >     > impossible).
    >     >
    >     >
    >     >
    >     > s390.ad:
    >     >
    >     > You are absolutely right, the sequence load_const/string_compress
    > makes
    >     > no sense at all. But it does not hurt either – I could not find one match in
    > all
    >     > tests I ran. -> Match rule deleted.
    >     >
    >     >
    >     >
    >     > macroAssembler_s390:
    >     >
    >     > prefetch: did not see impact, neither positive nor negative. Artificial
    > micro
    >     > benchmarks will not benefit (data is in cache anyway). More complex
    >     > benchmarks show measurement noise which covers the possible
    > prefetch
    >     > benefit. -> prefetch deleted.
    >     >
    >     > Hardcoded vector registers: you are right. There are some design
    > decisions
    >     > pending, e.g. how many vector scratch registers?
    >     >
    >     > Vperm instruction: using that is just another implementation variant that
    >     > could save the vn vector instruction. On the other hand, loading the
    > index
    >     > vector is a (compared to vgmh) costly memory access. Given the fact that
    > we
    >     > mostly deal with short strings, initialization effort is relevant.
    >     >
    >     > Code size vs. performance: the old, well known, often discussed
    > tradeoff.
    >     > Starting from the existing implementation, I invested quite some time in
    >     > optimizing the (len <= 8) cases. With every refinement step I saw (or
    >     > believed to see (measurement noise)) some improvement – or
    > discarded it.
    >     > Is the overall improvement worth the larger code size? -> tradeoff,
    >     > discussion.
    >     >
    >     >
    >     >
    >     > Best Regards,
    >     >
    >     > Lutz
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     > On 25.10.2017, 21:08, "Doerr, Martin" <martin.doerr at sap.com
    >     > <mailto:martin.doerr at sap.com> > wrote:
    >     >
    >     >
    >     >
    >     > Hi Lutz,
    >     >
    >     >
    >     >
    >     > thanks for working on vector-based enhancements and for providing this
    >     > webrev.
    >     >
    >     >
    >     >
    >     > assembler_s390:
    >     >
    >     > -The changes in the assembler look good.
    >     >
    >     >
    >     >
    >     > s390.ad:
    >     >
    >     > -It doesn't make sense to load constant len to a register and generate
    >     > complex compare instructions for it and still to emit code for all cases. I
    >     > assume that e.g. the 4 characters cases usually have a constant length. If
    > so,
    >     > much better code could be generated for them by omitting all the stuff
    >     > around the simple instructions. (ppc64.ad already contains nodes for
    >     > constant length of needle in indexOf rules.)
    >     >
    >     >
    >     >
    >     > macroAssembler_s390:
    >     >
    >     > -Are you sure the prefetch instructions improve performance?
    >     >
    >     > I remember that we had them in other String intrinsics but removed
    > them
    >     > again as they showed absolutely no performance gain.
    >     >
    >     > -Comment: Using hardcoded vector registers is ok for now, but may need
    > to
    >     > get changed e.g. when using them for C2's SuperWord optimization.
    >     >
    >     > -Comment: You could use the vperm instruction instead of vo+vn, but I'm
    > ok
    >     > with the current implementation because loading a mask is much more
    >     > convenient than getting the permutation vector loaded (e.g. from
    > constant
    >     > pool or pc relative).
    >     >
    >     > -So the new vector loop looks good to me.
    >     >
    >     > -In my opinion, the size of all the generated cases should be in
    > relationship to
    >     > their performance benefit.
    >     >
    >     > As intrinsics are not like stubs and may get inlined often, I can't get rid of
    > the
    >     > impression that generating so large code wastes valuable code cache
    > space
    >     > with questionable performance gain in real world scenarios.
    >     >
    >     >
    >     >
    >     > Best regards,
    >     >
    >     > Martin
    >     >
    >     >
    >     >
    >     > From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-
    >     > bounces at openjdk.java.net] On Behalf Of Schmidt, Lutz
    >     > Sent: Mittwoch, 25. Oktober 2017 12:02
    >     > To: hotspot-compiler-dev at openjdk.java.net
    >     > Subject: RFR(L): 8189793: [s390]: Improve String compress/inflate by
    >     > exploiting vector instructions
    >     >
    >     >
    >     >
    >     > Dear all,
    >     >
    >     >
    >     >
    >     > I would like to request reviews for this s390-only enhancement:
    >     >
    >     >
    >     >
    >     > Bug:    https://bugs.openjdk.java.net/browse/JDK-8189793
    >     >
    >     > Webrev:
    > http://cr.openjdk.java.net/~lucy/webrevs/8189793.00/index.html
    >     >
    >     >
    >     >
    >     > Vector instructions, which have been available on System z for a while
    > (since
    >     > z13), promise noticeable performance improvements. This enhancement
    >     > improves the String Compress and String Inflate intrinsics by exploiting
    > vector
    >     > instructions, when available. For long strings, up to 2x performance
    >     > improvement has been observed in micro-benchmarks.
    >     >
    >     >
    >     >
    >     > Special care was taken to preserve good performance for short strings.
    > All
    >     > examined workloads showed a high ratio of short and very short strings.
    >     >
    >     >
    >     >
    >     > Thank you!
    >     >
    >     > Lutz
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     >
    >     > Dr. Lutz Schmidt | SAP JVM | PI  SAP CP Core | T: +49 (6227) 7-42834
    >     >
    >     >
    > 
    >