Observations on vectorbenchmarks on ARM
Ludovic Henry
luhenry at microsoft.com
Wed Nov 18 17:29:05 UTC 2020
Hi,
To continue learning about the state of the Vector API, I looked at vectorbenchmarks on ARM.
First, I noticed that many benchmarks are often slower (often two or more magnitudes). However, the reason is pretty simple: many of these benchmarks use vectors of size 256bits or more, but, unfortunately, NEON only supports vectors up to 128bits. The Vector API then fallbacks to the pure-Java implementation without any form of hardware acceleration. I haven't dived more in-depth into it, but maybe the Java code could be made more autovectorization friendly to allow for some of the cases to be accelerated nonetheless.
Then, I looked closer at one of the simplest benchmark: BitmapLogicals.intersectAutovectorised and BitmapLogicals.intersectPanamaInt. Here, even when modified to use an Int128Vector (thus using NEON), the vector implementation is still 2x slower (on my machine, I get ~2000 ops/ms for intersectAutovertorised and ~1000 ops/ms for intersectPanamaInt). This slowdown is (at least partially) due to VectorIntrinsics.checkFromIndexSize since, if I define set -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0, the vector implementation jumps to up to ~2300 ops/ms.
Finally, even when setting -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0, I observed something weird: the throughput of BitmapLogicals.intersectPanamaInt would oscillate between ~1500 ops/ms and ~2300 ops/ms. The oscillation is very regular. See [1] for an example output. I have not gone to the bottom of why this oscillation happens. A clue is that it is particularly impactful on smaller arrays (2048 elements vs. 131072 elements), as with the bigger array, the bottleneck is memory access (~30% of L1d cache misses); as opposed to the smaller array where the data fits in L1d cache (<1% of L1d cache misses).
My next steps are to 1. figure out how we could improve bound check removal to improve the speed of the `fromArray` and `intoArray` operations and discuss it on the hotspot-compiler-dev mailing list, and 2. understand the root cause of the oscillation in performance.
Any pointers would be greatly appreciated!
Thank you,
Ludovic
[1]
```
# Warmup Iteration 1: 1132.934 ops/ms
# Warmup Iteration 2: 2227.649 ops/ms
# Warmup Iteration 3: 1532.657 ops/ms
# Warmup Iteration 4: 2285.691 ops/ms
# Warmup Iteration 5: 1472.801 ops/ms
# Warmup Iteration 6: 2260.508 ops/ms
# Warmup Iteration 7: 1512.909 ops/ms
# Warmup Iteration 8: 2312.553 ops/ms
# Warmup Iteration 9: 1508.813 ops/ms
# Warmup Iteration 10: 2303.308 ops/ms
# Warmup Iteration 11: 1534.152 ops/ms
# Warmup Iteration 12: 2260.814 ops/ms
# Warmup Iteration 13: 1490.340 ops/ms
# Warmup Iteration 14: 2287.080 ops/ms
# Warmup Iteration 15: 1512.709 ops/ms
# Warmup Iteration 16: 1469.387 ops/ms
# Warmup Iteration 17: 2305.459 ops/ms
# Warmup Iteration 18: 1495.418 ops/ms
# Warmup Iteration 19: 2282.759 ops/ms
# Warmup Iteration 20: 1486.290 ops/ms
Iteration 1: 2279.861 ops/ms
Iteration 2: 1469.178 ops/ms
Iteration 3: 2302.454 ops/ms
Iteration 4: 1507.177 ops/ms
Iteration 5: 2282.104 ops/ms
Iteration 6: 1532.449 ops/ms
Iteration 7: 2273.841 ops/ms
Iteration 8: 1471.526 ops/ms
Iteration 9: 2309.160 ops/ms
Iteration 10: 1495.957 ops/ms
Iteration 11: 2286.309 ops/ms
Iteration 12: 1481.660 ops/ms
Iteration 13: 2280.471 ops/ms
Iteration 14: 1511.154 ops/ms
Iteration 15: 2314.261 ops/ms
Iteration 16: 1472.466 ops/ms
Iteration 17: 2283.490 ops/ms
Iteration 18: 1482.873 ops/ms
Iteration 19: 2306.502 ops/ms
Iteration 20: 1505.751 ops/ms
```
More information about the panama-dev
mailing list