[aarch64-port-dev ] [10] RFR(S): JDK-8184943: AARCH64: Intrinsify hasNegatives

Thu Jul 20 18:27:12 UTC 2017

Thank you.

Interesting results.

I see in general Stuart's version is faster on smaller sizes. I suppose 
it's due to single 8-byte load at the start and the same load at the 
end, which saves up to 3 loads + few cpu cycles. Also, an aligned access 
might help on some platforms. I also have version of my code with 
aligned access(attached as alternative implementation to CR quite a time 
ago), but it seems like I don't have platform which shows large 
difference in this case so, I've put this patch aside.

Btw: I've also considered such unconditional 8-bytes load at start, but 
abandoned this idea since I wasn't sure if it's safe. Say, array is 
allocated at the border of allocated region(so, last array byte == last 
allocated region byte). Then hasNegatives is called with offset == 
array_length - 1 and len = 1 just to check last byte, so, then 8-byte 
load is issued at this address?

I also have following results on ThunderX T88(shows significant 
improvement on 10000 and 100000 length (about 1.5x and 2.5x) comparing 
to Stuart's implementation):

My:

Benchmark                          (length)  Mode  Cnt Score        
Error  Unitsthat
HasNegativesBench.loopingFastMethod       1  avgt    5 7555.169 ?     
35.714  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5 9030.759 ?      
7.614  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5 27586.010 ?     
16.815  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5 40239.515 ?    
564.833  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5 52673.495 ?    
176.033  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5 111487.193 ?    
551.301  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5 392706.118 ?   
1749.139  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5 1274876.279 ?  
11404.115  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5 13627036.757 ? 
129977.081  ns/op

Stuart's:
Benchmark                          (length)  Mode  Cnt Score        
Error  Units
HasNegativesBench.loopingFastMethod       1  avgt    5 7535.175 ?     
50.769  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5 7526.599 ?      
8.993  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5 18554.420 ?      
1.448  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5 26607.388 ?     
89.429  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5 32641.349 ?    
168.976  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5 60745.493 ?    
362.656  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5 202915.691 ?   
1103.984  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5 1898428.471 ?  
10022.381  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5 33463429.058 ? 
548791.811  ns/op

And on R-Pi 3 (Cortex A53) (about the same improvement on large size):

My:

Benchmark                            (length)  Mode  Cnt Score         
Error  Units
HasNegativesBench.loopingFastMethod       1  avgt    5 15233.213 ±   
10299.068  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5 28372.544 ±   
22395.968  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5 54031.864 ±   
41530.777  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5 60528.950 ±   
23216.620  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5 68123.059 ±   
31609.714  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5  130330.740 ± 
109803.722  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5  289047.106 ± 
197153.259  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5 3175862.063 ± 
3126363.838  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5 28595658.058 ± 
15509202.529  ns/op

Stuart's:

Benchmark                            (length)  Mode  Cnt Score       
Error  Units
HasNegativesBench.loopingFastMethod       1  avgt    5  16068.939 ± 
13611.338  ns/op
HasNegativesBench.loopingFastMethod       4  avgt    5  22888.871 ± 
21902.553  ns/op
HasNegativesBench.loopingFastMethod      31  avgt    5  40784.842 ± 
44233.928  ns/op
HasNegativesBench.loopingFastMethod      65  avgt    5  66288.469 ± 
65255.857  ns/op
HasNegativesBench.loopingFastMethod     101  avgt    5   89416.174 ± 
93875.338  ns/op
HasNegativesBench.loopingFastMethod     256  avgt    5  170013.296 ± 
86799.999  ns/op
HasNegativesBench.loopingFastMethod    1000  avgt    5   635557.297 ±  
141291.822  ns/op
HasNegativesBench.loopingFastMethod   10000  avgt    5  5368914.966 ± 
7607076.827  ns/op
HasNegativesBench.loopingFastMethod  100000  avgt    5  47019213.416 ± 
40360305.523  ns/op

Probably best way would be to merge large data loads from my patch and 
Stuart's lightning-fast small arrays handling.

I'll be happy to merge these ideas in one intrinsic that works fastest 
on small and large arrays if Stuart does not mind. I could use some help 
testing the final solution on some of the HW we don't have. I don't mind 
if Stuart want to merge it, then we'll help him with testing on h/w he 
doesn't have.

Thanks,

Dmitrij

On 20.07.2017 19:32, Andrew Haley wrote:
> On 20/07/17 16:17, Dmitrij Pochepko wrote:
>> can you check large length, like 10000, 100000   (I support this jmh
>> options will do it: -p length=10000,100000)
> stuart:
>
> Benchmark                       (length)  Mode  Cnt         Score       Error  Units
> HasNegatives.loopingFastMethod     10000  avgt    5    788432.952 ?   362.183  ns/op
> HasNegatives.loopingFastMethod    100000  avgt    5  12401737.536 ? 17752.545  ns/op
>
> dmitrij:
>
> Benchmark                       (length)  Mode  Cnt         Score      Error  Units
> HasNegatives.loopingFastMethod     10000  avgt    5    918447.832 ?  223.858  ns/op
> HasNegatives.loopingFastMethod    100000  avgt    5  11745723.456 ? 7526.962  ns/op
>