RFR: 8268229: Aarch64: Use Neon in intrinsics for String.equals [v3]
Wang Huang
whuang at openjdk.java.net
Mon Jul 5 06:57:54 UTC 2021
On Fri, 2 Jul 2021 14:30:18 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp line 4670:
>>
>>> 4668: __ cbnz(rscratch1, NOT_EQUAL);
>>> 4669: __ br(__ GE, LOOP);
>>> 4670:
>>
>> As I said before, we gain nothing by using Neon here.
>
> Much better:
>
>
> + __ ldp(r5, r6, Address(__ post(a1, wordSize * 2)));
> + __ ldp(rscratch1, rscratch2, Address(__ post(a2, wordSize * 2)));
> + __ cmp(r5, rscratch1);
> + __ ccmp(r6, rscratch2, 0, Assembler::EQ);
> + __ br(__ NE, NOT_EQUAL);
We changed `ld1` into `ldp` and get the result as following,
simple:
Benchmark |(size)| Mode| Cnt | Score| Error |Units
-------------------|------|-----|-----|-------|---------|-----
StringEquals.equal |45 |avgt |5 | 6.105 | ? 0.635 |us/op
StringEquals.equal |64 |avgt | 5 |7.226 |? 0.056 |us/op
StringEquals.equal |91 | avgt |5 |12.010 |? 0.375 | us/op
StringEquals.equal |121 |avgt |5 |14.772 |? 0.114 | us/op
StringEquals.equal |181 | avgt |5 | 21.468 | ? 0.676 |us/op
StringEquals.equal |256 | avgt |5 |28.942 |? 4.806 |us/op
StringEquals.equal | 512 |avgt | 5 |58.479 |? 5.918 |us/op
StringEquals.equal |1024 |avgt |5 |119.313 | ? 16.661 | us/op
ldp:
Benchmark |(size)| Mode| Cnt | Score| Error |Units
-------------------|------|-----|-----|-------|---------|-----
StringEquals.equal |45 |avgt |5 |6.449 | ? 0.202 |us/op
StringEquals.equal |64 |avgt | 5 |7.367 |? 0.055 |us/op
StringEquals.equal |91 |avgt |5 | 9.984 |? 0.065 |us/op
StringEquals.equal | 121 | avgt | 5 | 12.540 |? 0.545| us/op
StringEquals.equal |181 |avgt |5 | 15.614 |? 0.280 |us/op
StringEquals.equal | 256 |avgt | 5 |19.346 | ? 0.243| us/op
StringEquals.equal | 512 |avgt |5 |35.718 | ? 0.599 |us/op
StringEquals.equal |1024 |avgt |5 |67.846 | ? 0.439| us/op
neon:
Benchmark |(size)| Mode| Cnt | Score| Error |Units
-------------------|------|-----|-----|-------|---------|-----
StringEquals.equal |45 | avgt | 5 | 5.883 |? 0.173 | us/op
StringEquals.equal | 64 |avgt |5 | 6.737 |? 0.035 |us/op
StringEquals.equal | 91 | avgt |5 |8.997 |? 0.215 |us/op
StringEquals.equal |121 | avgt | 5 | 10.789 |? 0.386 |us/op
StringEquals.equal |181 |avgt |5 |14.063 |? 0.253 |us/op
StringEquals.equal |256 | avgt |5 |19.679 | ? 1.419 |us/op
StringEquals.equal |512 |avgt |5 |38.813 |? 1.378 |us/op
StringEquals.equal |1024 |avgt |5 | 77.769 |? 3.082 | us/op
>From the results, we can see that,
* for small size (45~181), the performance of `ldp` version is not as good as `neon/ ld1` version
* for big size, `ldp` version is better that `neon/ld1` version
* all versions (both `ldp` and `ld1`) are better that old `simple` version .
* I agree with you `ldp` version is better than `ld1` version at **last patch** because I used
__ ldr(v0, __ Q, Address(__ post(a1, wordSize * 2)));
__ ldr(v1, __ Q, Address(__ post(a2, wordSize * 2)));
at last patch. However, I use
__ ld1(v0, v1, __ T2D, Address(__ post(a1, loopThreshold)));
__ ld1(v2, v3, __ T2D, Address(__ post(a2, loopThreshold)));
in recent patch. I think this change has fixed the problem here.
-------------
PR: https://git.openjdk.java.net/jdk/pull/4423
More information about the hotspot-dev
mailing list