[aarch64-port-dev ] RFR(M): 8189112 - AARCH64: optimize StringUTF16 compress intrinsic

Tue May 15 17:10:46 UTC 2018

On 05/08/2018 02:26 PM, Dmitrij Pochepko wrote:
> Hi all,
> 
> please review patch for 8189112 - AARCH64: optimize StringUTF16 compress 
> intrinsic
> 
> This patch is based on 3 improvement ideas:
> 
> - introduction of additional large loop with prefetch instruction for 
> long strings
> - different compression implementation, using uzp1 and uzp2 instructions 
> instead of uqxtn and uqxtn2, which are more expensive. It also allows to 
> drop direct FPSR register operations, which are very slow on some CPUs.
> - slightly another codeshape, which mostly executes branches and 
> independent operations while loads and stores are used (helps "in-order" 
> CPUs)
> 
> benchmarks: I created JMH benchmark with direct call via reflection: 
> http://cr.openjdk.java.net/~dpochepk/8189112/StrCompressBench.java

I think this benchmark is misleading because it uses Method.invoke()
in the inner timing loop.  I rewrote it to use a MethodHandle, and got:

Benchmark                                   (ALL)  (size)  Mode  Cnt    Score    Error  Units
StrCompressBench.compressDifferent        1000000     256  avgt   10  394.814 ± 69.714  ns/op
StrCompressBench.compressDifferentHandle  1000000     256  avgt   10  242.431 ±  0.861  ns/op

It's at http://cr.openjdk.java.net/~aph/8189112/StrCompressBench.java

(Note: Method.invoke() has so much jitter because it does a ton
of work boxing and unboxing the args.  You'll see this if you look
at the disassembly of StrCompressBench.compressDifferent() .)

With that change, I get (on APM Mustang)

Before your change:

Benchmark                                   (ALL)  (size)  Mode  Cnt      Score   Error  Units
StrCompressBench.compressDifferentHandle  1000000       4  avgt   10     30.739 ± 0.128  ns/op
StrCompressBench.compressDifferentHandle  1000000       8  avgt   10     33.451 ± 0.172  ns/op
StrCompressBench.compressDifferentHandle  1000000      16  avgt   10     42.327 ± 0.058  ns/op
StrCompressBench.compressDifferentHandle  1000000     256  avgt   10    389.433 ± 1.608  ns/op
StrCompressBench.compressDifferentHandle  1000000    1024  avgt   10   1028.375 ± 4.364  ns/op
StrCompressBench.compressDifferentHandle  1000000   32768  avgt   10  15321.996 ± 5.059  ns/op

After:

Benchmark                                   (ALL)  (size)  Mode  Cnt      Score    Error  Units
StrCompressBench.compressDifferentHandle  1000000       4  avgt   10     30.097 ±  0.071  ns/op
StrCompressBench.compressDifferentHandle  1000000       8  avgt   10     29.482 ±  0.122  ns/op
StrCompressBench.compressDifferentHandle  1000000      16  avgt   10     36.548 ±  0.070  ns/op
StrCompressBench.compressDifferentHandle  1000000     256  avgt   10    240.499 ±  0.446  ns/op
StrCompressBench.compressDifferentHandle  1000000    1024  avgt   10    603.500 ±  0.829  ns/op
StrCompressBench.compressDifferentHandle  1000000   32768  avgt   10  14538.528 ± 30.215  ns/op

... which is a decent-enough speedup for medium-sized strings.

OK.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671