[aarch64-port-dev ] RFR(M): 8189112 - AARCH64: optimize StringUTF16 compress intrinsic
Dmitrij Pochepko
dmitrij.pochepko at bell-sw.com
Tue May 15 17:13:49 UTC 2018
Thank you for review.
On 15.05.2018 20:10, Andrew Haley wrote:
> On 05/08/2018 02:26 PM, Dmitrij Pochepko wrote:
>> Hi all,
>>
>> please review patch for 8189112 - AARCH64: optimize StringUTF16 compress
>> intrinsic
>>
>> This patch is based on 3 improvement ideas:
>>
>> - introduction of additional large loop with prefetch instruction for
>> long strings
>> - different compression implementation, using uzp1 and uzp2 instructions
>> instead of uqxtn and uqxtn2, which are more expensive. It also allows to
>> drop direct FPSR register operations, which are very slow on some CPUs.
>> - slightly another codeshape, which mostly executes branches and
>> independent operations while loads and stores are used (helps "in-order"
>> CPUs)
>>
>> benchmarks: I created JMH benchmark with direct call via reflection:
>> http://cr.openjdk.java.net/~dpochepk/8189112/StrCompressBench.java
> I think this benchmark is misleading because it uses Method.invoke()
> in the inner timing loop. I rewrote it to use a MethodHandle, and got:
>
> Benchmark (ALL) (size) Mode Cnt Score Error Units
> StrCompressBench.compressDifferent 1000000 256 avgt 10 394.814 ± 69.714 ns/op
> StrCompressBench.compressDifferentHandle 1000000 256 avgt 10 242.431 ± 0.861 ns/op
>
> It's at http://cr.openjdk.java.net/~aph/8189112/StrCompressBench.java
>
> (Note: Method.invoke() has so much jitter because it does a ton
> of work boxing and unboxing the args. You'll see this if you look
> at the disassembly of StrCompressBench.compressDifferent() .)
>
> With that change, I get (on APM Mustang)
>
> Before your change:
>
> Benchmark (ALL) (size) Mode Cnt Score Error Units
> StrCompressBench.compressDifferentHandle 1000000 4 avgt 10 30.739 ± 0.128 ns/op
> StrCompressBench.compressDifferentHandle 1000000 8 avgt 10 33.451 ± 0.172 ns/op
> StrCompressBench.compressDifferentHandle 1000000 16 avgt 10 42.327 ± 0.058 ns/op
> StrCompressBench.compressDifferentHandle 1000000 256 avgt 10 389.433 ± 1.608 ns/op
> StrCompressBench.compressDifferentHandle 1000000 1024 avgt 10 1028.375 ± 4.364 ns/op
> StrCompressBench.compressDifferentHandle 1000000 32768 avgt 10 15321.996 ± 5.059 ns/op
>
> After:
>
> Benchmark (ALL) (size) Mode Cnt Score Error Units
> StrCompressBench.compressDifferentHandle 1000000 4 avgt 10 30.097 ± 0.071 ns/op
> StrCompressBench.compressDifferentHandle 1000000 8 avgt 10 29.482 ± 0.122 ns/op
> StrCompressBench.compressDifferentHandle 1000000 16 avgt 10 36.548 ± 0.070 ns/op
> StrCompressBench.compressDifferentHandle 1000000 256 avgt 10 240.499 ± 0.446 ns/op
> StrCompressBench.compressDifferentHandle 1000000 1024 avgt 10 603.500 ± 0.829 ns/op
> StrCompressBench.compressDifferentHandle 1000000 32768 avgt 10 14538.528 ± 30.215 ns/op
>
> ... which is a decent-enough speedup for medium-sized strings.
>
> OK.
>
More information about the hotspot-compiler-dev
mailing list