[aarch64-port-dev ] RFR(M): 8189112 - AARCH64: optimize StringUTF16 compress intrinsic

Dmitrij Pochepko dmitrij.pochepko at bell-sw.com
Tue May 15 17:13:49 UTC 2018


Thank you for review.


On 15.05.2018 20:10, Andrew Haley wrote:
> On 05/08/2018 02:26 PM, Dmitrij Pochepko wrote:
>> Hi all,
>>
>> please review patch for 8189112 - AARCH64: optimize StringUTF16 compress
>> intrinsic
>>
>> This patch is based on 3 improvement ideas:
>>
>> - introduction of additional large loop with prefetch instruction for
>> long strings
>> - different compression implementation, using uzp1 and uzp2 instructions
>> instead of uqxtn and uqxtn2, which are more expensive. It also allows to
>> drop direct FPSR register operations, which are very slow on some CPUs.
>> - slightly another codeshape, which mostly executes branches and
>> independent operations while loads and stores are used (helps "in-order"
>> CPUs)
>>
>> benchmarks: I created JMH benchmark with direct call via reflection:
>> http://cr.openjdk.java.net/~dpochepk/8189112/StrCompressBench.java
> I think this benchmark is misleading because it uses Method.invoke()
> in the inner timing loop.  I rewrote it to use a MethodHandle, and got:
>
> Benchmark                                   (ALL)  (size)  Mode  Cnt    Score    Error  Units
> StrCompressBench.compressDifferent        1000000     256  avgt   10  394.814 ± 69.714  ns/op
> StrCompressBench.compressDifferentHandle  1000000     256  avgt   10  242.431 ±  0.861  ns/op
>
> It's at http://cr.openjdk.java.net/~aph/8189112/StrCompressBench.java
>
> (Note: Method.invoke() has so much jitter because it does a ton
> of work boxing and unboxing the args.  You'll see this if you look
> at the disassembly of StrCompressBench.compressDifferent() .)
>
> With that change, I get (on APM Mustang)
>
> Before your change:
>
> Benchmark                                   (ALL)  (size)  Mode  Cnt      Score   Error  Units
> StrCompressBench.compressDifferentHandle  1000000       4  avgt   10     30.739 ± 0.128  ns/op
> StrCompressBench.compressDifferentHandle  1000000       8  avgt   10     33.451 ± 0.172  ns/op
> StrCompressBench.compressDifferentHandle  1000000      16  avgt   10     42.327 ± 0.058  ns/op
> StrCompressBench.compressDifferentHandle  1000000     256  avgt   10    389.433 ± 1.608  ns/op
> StrCompressBench.compressDifferentHandle  1000000    1024  avgt   10   1028.375 ± 4.364  ns/op
> StrCompressBench.compressDifferentHandle  1000000   32768  avgt   10  15321.996 ± 5.059  ns/op
>
> After:
>
> Benchmark                                   (ALL)  (size)  Mode  Cnt      Score    Error  Units
> StrCompressBench.compressDifferentHandle  1000000       4  avgt   10     30.097 ±  0.071  ns/op
> StrCompressBench.compressDifferentHandle  1000000       8  avgt   10     29.482 ±  0.122  ns/op
> StrCompressBench.compressDifferentHandle  1000000      16  avgt   10     36.548 ±  0.070  ns/op
> StrCompressBench.compressDifferentHandle  1000000     256  avgt   10    240.499 ±  0.446  ns/op
> StrCompressBench.compressDifferentHandle  1000000    1024  avgt   10    603.500 ±  0.829  ns/op
> StrCompressBench.compressDifferentHandle  1000000   32768  avgt   10  14538.528 ± 30.215  ns/op
>
> ... which is a decent-enough speedup for medium-sized strings.
>
> OK.
>



More information about the hotspot-compiler-dev mailing list