Observations on vectorbenchmarks on ARM

Wed Nov 18 17:29:05 UTC 2020

Hi,

To continue learning about the state of the Vector API, I looked at vectorbenchmarks on ARM.

First, I noticed that many benchmarks are often slower (often two or more magnitudes). However, the reason is pretty simple: many of these benchmarks use vectors of size 256bits or more, but, unfortunately, NEON only supports vectors up to 128bits. The Vector API then fallbacks to the pure-Java implementation without any form of hardware acceleration. I haven't dived more in-depth into it, but maybe the Java code could be made more autovectorization friendly to allow for some of the cases to be accelerated nonetheless.

Then, I looked closer at one of the simplest benchmark: BitmapLogicals.intersectAutovectorised and BitmapLogicals.intersectPanamaInt. Here, even when modified to use an Int128Vector (thus using NEON), the vector implementation is still 2x slower (on my machine, I get ~2000 ops/ms for intersectAutovertorised and ~1000 ops/ms for intersectPanamaInt). This slowdown is (at least partially) due to VectorIntrinsics.checkFromIndexSize since, if I define set -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0, the vector implementation jumps to up to ~2300 ops/ms.

Finally, even when setting -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0, I observed something weird: the throughput of BitmapLogicals.intersectPanamaInt would oscillate between ~1500 ops/ms and ~2300 ops/ms. The oscillation is very regular. See [1] for an example output. I have not gone to the bottom of why this oscillation happens. A clue is that it is particularly impactful on smaller arrays (2048 elements vs. 131072 elements), as with the bigger array, the bottleneck is memory access (~30% of L1d cache misses); as opposed to the smaller array where the data fits in L1d cache (<1% of L1d cache misses).

My next steps are to 1. figure out how we could improve bound check removal to improve the speed of the `fromArray` and `intoArray` operations and discuss it on the hotspot-compiler-dev mailing list, and 2. understand the root cause of the oscillation in performance. 

Any pointers would be greatly appreciated!

Thank you,
Ludovic

[1] 
```
# Warmup Iteration   1: 1132.934 ops/ms
# Warmup Iteration   2: 2227.649 ops/ms
# Warmup Iteration   3: 1532.657 ops/ms
# Warmup Iteration   4: 2285.691 ops/ms
# Warmup Iteration   5: 1472.801 ops/ms
# Warmup Iteration   6: 2260.508 ops/ms
# Warmup Iteration   7: 1512.909 ops/ms
# Warmup Iteration   8: 2312.553 ops/ms
# Warmup Iteration   9: 1508.813 ops/ms
# Warmup Iteration  10: 2303.308 ops/ms
# Warmup Iteration  11: 1534.152 ops/ms
# Warmup Iteration  12: 2260.814 ops/ms
# Warmup Iteration  13: 1490.340 ops/ms
# Warmup Iteration  14: 2287.080 ops/ms
# Warmup Iteration  15: 1512.709 ops/ms
# Warmup Iteration  16: 1469.387 ops/ms
# Warmup Iteration  17: 2305.459 ops/ms
# Warmup Iteration  18: 1495.418 ops/ms
# Warmup Iteration  19: 2282.759 ops/ms
# Warmup Iteration  20: 1486.290 ops/ms
Iteration   1: 2279.861 ops/ms
Iteration   2: 1469.178 ops/ms
Iteration   3: 2302.454 ops/ms
Iteration   4: 1507.177 ops/ms
Iteration   5: 2282.104 ops/ms
Iteration   6: 1532.449 ops/ms
Iteration   7: 2273.841 ops/ms
Iteration   8: 1471.526 ops/ms
Iteration   9: 2309.160 ops/ms
Iteration  10: 1495.957 ops/ms
Iteration  11: 2286.309 ops/ms
Iteration  12: 1481.660 ops/ms
Iteration  13: 2280.471 ops/ms
Iteration  14: 1511.154 ops/ms
Iteration  15: 2314.261 ops/ms
Iteration  16: 1472.466 ops/ms
Iteration  17: 2283.490 ops/ms
Iteration  18: 1482.873 ops/ms
Iteration  19: 2306.502 ops/ms
Iteration  20: 1505.751 ops/ms
```