[aarch64-port-dev ] [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives
Dmitrij Pochepko
dmitrij.pochepko at bell-sw.com
Thu Jul 20 18:27:12 UTC 2017
Thank you.
Interesting results.
I see in general Stuart's version is faster on smaller sizes. I suppose
it's due to single 8-byte load at the start and the same load at the
end, which saves up to 3 loads + few cpu cycles. Also, an aligned access
might help on some platforms. I also have version of my code with
aligned access(attached as alternative implementation to CR quite a time
ago), but it seems like I don't have platform which shows large
difference in this case so, I've put this patch aside.
Btw: I've also considered such unconditional 8-bytes load at start, but
abandoned this idea since I wasn't sure if it's safe. Say, array is
allocated at the border of allocated region(so, last array byte == last
allocated region byte). Then hasNegatives is called with offset ==
array_length - 1 and len = 1 just to check last byte, so, then 8-byte
load is issued at this address?
I also have following results on ThunderX T88(shows significant
improvement on 10000 and 100000 length (about 1.5x and 2.5x) comparing
to Stuart's implementation):
My:
Benchmark (length) Mode Cnt Score
Error Unitsthat
HasNegativesBench.loopingFastMethod 1 avgt 5 7555.169 ?
35.714 ns/op
HasNegativesBench.loopingFastMethod 4 avgt 5 9030.759 ?
7.614 ns/op
HasNegativesBench.loopingFastMethod 31 avgt 5 27586.010 ?
16.815 ns/op
HasNegativesBench.loopingFastMethod 65 avgt 5 40239.515 ?
564.833 ns/op
HasNegativesBench.loopingFastMethod 101 avgt 5 52673.495 ?
176.033 ns/op
HasNegativesBench.loopingFastMethod 256 avgt 5 111487.193 ?
551.301 ns/op
HasNegativesBench.loopingFastMethod 1000 avgt 5 392706.118 ?
1749.139 ns/op
HasNegativesBench.loopingFastMethod 10000 avgt 5 1274876.279 ?
11404.115 ns/op
HasNegativesBench.loopingFastMethod 100000 avgt 5 13627036.757 ?
129977.081 ns/op
Stuart's:
Benchmark (length) Mode Cnt Score
Error Units
HasNegativesBench.loopingFastMethod 1 avgt 5 7535.175 ?
50.769 ns/op
HasNegativesBench.loopingFastMethod 4 avgt 5 7526.599 ?
8.993 ns/op
HasNegativesBench.loopingFastMethod 31 avgt 5 18554.420 ?
1.448 ns/op
HasNegativesBench.loopingFastMethod 65 avgt 5 26607.388 ?
89.429 ns/op
HasNegativesBench.loopingFastMethod 101 avgt 5 32641.349 ?
168.976 ns/op
HasNegativesBench.loopingFastMethod 256 avgt 5 60745.493 ?
362.656 ns/op
HasNegativesBench.loopingFastMethod 1000 avgt 5 202915.691 ?
1103.984 ns/op
HasNegativesBench.loopingFastMethod 10000 avgt 5 1898428.471 ?
10022.381 ns/op
HasNegativesBench.loopingFastMethod 100000 avgt 5 33463429.058 ?
548791.811 ns/op
And on R-Pi 3 (Cortex A53) (about the same improvement on large size):
My:
Benchmark (length) Mode Cnt Score
Error Units
HasNegativesBench.loopingFastMethod 1 avgt 5 15233.213 ±
10299.068 ns/op
HasNegativesBench.loopingFastMethod 4 avgt 5 28372.544 ±
22395.968 ns/op
HasNegativesBench.loopingFastMethod 31 avgt 5 54031.864 ±
41530.777 ns/op
HasNegativesBench.loopingFastMethod 65 avgt 5 60528.950 ±
23216.620 ns/op
HasNegativesBench.loopingFastMethod 101 avgt 5 68123.059 ±
31609.714 ns/op
HasNegativesBench.loopingFastMethod 256 avgt 5 130330.740 ±
109803.722 ns/op
HasNegativesBench.loopingFastMethod 1000 avgt 5 289047.106 ±
197153.259 ns/op
HasNegativesBench.loopingFastMethod 10000 avgt 5 3175862.063 ±
3126363.838 ns/op
HasNegativesBench.loopingFastMethod 100000 avgt 5 28595658.058 ±
15509202.529 ns/op
Stuart's:
Benchmark (length) Mode Cnt Score
Error Units
HasNegativesBench.loopingFastMethod 1 avgt 5 16068.939 ±
13611.338 ns/op
HasNegativesBench.loopingFastMethod 4 avgt 5 22888.871 ±
21902.553 ns/op
HasNegativesBench.loopingFastMethod 31 avgt 5 40784.842 ±
44233.928 ns/op
HasNegativesBench.loopingFastMethod 65 avgt 5 66288.469 ±
65255.857 ns/op
HasNegativesBench.loopingFastMethod 101 avgt 5 89416.174 ±
93875.338 ns/op
HasNegativesBench.loopingFastMethod 256 avgt 5 170013.296 ±
86799.999 ns/op
HasNegativesBench.loopingFastMethod 1000 avgt 5 635557.297 ±
141291.822 ns/op
HasNegativesBench.loopingFastMethod 10000 avgt 5 5368914.966 ±
7607076.827 ns/op
HasNegativesBench.loopingFastMethod 100000 avgt 5 47019213.416 ±
40360305.523 ns/op
Probably best way would be to merge large data loads from my patch and
Stuart's lightning-fast small arrays handling.
I'll be happy to merge these ideas in one intrinsic that works fastest
on small and large arrays if Stuart does not mind. I could use some help
testing the final solution on some of the HW we don't have. I don't mind
if Stuart want to merge it, then we'll help him with testing on h/w he
doesn't have.
Thanks,
Dmitrij
On 20.07.2017 19:32, Andrew Haley wrote:
> On 20/07/17 16:17, Dmitrij Pochepko wrote:
>> can you check large length, like 10000, 100000 (I support this jmh
>> options will do it: -p length=10000,100000)
> stuart:
>
> Benchmark (length) Mode Cnt Score Error Units
> HasNegatives.loopingFastMethod 10000 avgt 5 788432.952 ? 362.183 ns/op
> HasNegatives.loopingFastMethod 100000 avgt 5 12401737.536 ? 17752.545 ns/op
>
> dmitrij:
>
> Benchmark (length) Mode Cnt Score Error Units
> HasNegatives.loopingFastMethod 10000 avgt 5 918447.832 ? 223.858 ns/op
> HasNegatives.loopingFastMethod 100000 avgt 5 11745723.456 ? 7526.962 ns/op
>
More information about the aarch64-port-dev
mailing list