Observations on vectorbenchmarks on ARM

Wed Nov 18 20:27:36 UTC 2020

Hi Paul,

> We have not done much work trying to optimize the fallback cases e.g. composing using smaller vector sizes, or letting the auto-vectorizer have at it (harder for the compiler to see given the use of lambdas). Priority right now is to focus on getting the code gen right when the architecture supports the vector shapes.

Yes, that makes perfect sense.

> Regarding bounds checks, I am wondering if the Objects.checkIndex method [1] is fully intrinsic on ARM. Can you try on x86 and compare?

Great question. I'll check on that tomorrow morning (it's evening for me right now).

> (We also have some work in progress to improve bounds checks when the upper loop bound is calculated from VectorSpecies.loopBound.)

Where could I follow along this work? Or even where could I find some documentation / discussions on the topic?

> The oscillation might be due to alignment, perhaps if the vector loads/stores are misaligned the instruction cost is higher?

That is exactly right. After testing with `-XX:ObjectAlignmentInBytes=16`, the oscillation is gone and the performance is, as expected, stable at ~2300 ops/ms.

Thank you,
Ludovic

________________________________________
From: Paul Sandoz <paul.sandoz at oracle.com>
Sent: Wednesday, November 18, 2020 18:51
To: Ludovic Henry
Cc: panama-dev at openjdk.java.net; openjdk-aarch64
Subject: Re: Observations on vectorbenchmarks on ARM

Hi Ludovic,

We have not done much work trying to optimize the fallback cases e.g. composing using smaller vector sizes, or letting the auto-vectorizer have at it (harder for the compiler to see given the use of lambdas). Priority right now is to focus on getting the code gen right when the architecture supports the vector shapes.

Regarding bounds checks, I am wondering if the Objects.checkIndex method [1] is fully intrinsic on ARM. Can you try on x86 and compare? (We also have some work in progress to improve bounds checks when the upper loop bound is calculated from VectorSpecies.loopBound.)

The oscillation might be due to alignment, perhaps if the vector loads/stores are misaligned the instruction cost is higher?

Paul.

[1] More specifically Preconditions.checkIndex

> On Nov 18, 2020, at 9:29 AM, Ludovic Henry <luhenry at microsoft.com> wrote:
>
> Hi,
>
> To continue learning about the state of the Vector API, I looked at vectorbenchmarks on ARM.
>
> First, I noticed that many benchmarks are often slower (often two or more magnitudes). However, the reason is pretty simple: many of these benchmarks use vectors of size 256bits or more, but, unfortunately, NEON only supports vectors up to 128bits. The Vector API then fallbacks to the pure-Java implementation without any form of hardware acceleration. I haven't dived more in-depth into it, but maybe the Java code could be made more autovectorization friendly to allow for some of the cases to be accelerated nonetheless.
>
> Then, I looked closer at one of the simplest benchmark: BitmapLogicals.intersectAutovectorised and BitmapLogicals.intersectPanamaInt. Here, even when modified to use an Int128Vector (thus using NEON), the vector implementation is still 2x slower (on my machine, I get ~2000 ops/ms for intersectAutovertorised and ~1000 ops/ms for intersectPanamaInt). This slowdown is (at least partially) due to VectorIntrinsics.checkFromIndexSize since, if I define set -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0, the vector implementation jumps to up to ~2300 ops/ms.
>
> Finally, even when setting -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0, I observed something weird: the throughput of BitmapLogicals.intersectPanamaInt would oscillate between ~1500 ops/ms and ~2300 ops/ms. The oscillation is very regular. See [1] for an example output. I have not gone to the bottom of why this oscillation happens. A clue is that it is particularly impactful on smaller arrays (2048 elements vs. 131072 elements), as with the bigger array, the bottleneck is memory access (~30% of L1d cache misses); as opposed to the smaller array where the data fits in L1d cache (<1% of L1d cache misses).
>
> My next steps are to 1. figure out how we could improve bound check removal to improve the speed of the `fromArray` and `intoArray` operations and discuss it on the hotspot-compiler-dev mailing list, and 2. understand the root cause of the oscillation in performance.
>
> Any pointers would be greatly appreciated!
>
> Thank you,
> Ludovic
>
> [1]
> ```
> # Warmup Iteration   1: 1132.934 ops/ms
> # Warmup Iteration   2: 2227.649 ops/ms
> # Warmup Iteration   3: 1532.657 ops/ms
> # Warmup Iteration   4: 2285.691 ops/ms
> # Warmup Iteration   5: 1472.801 ops/ms
> # Warmup Iteration   6: 2260.508 ops/ms
> # Warmup Iteration   7: 1512.909 ops/ms
> # Warmup Iteration   8: 2312.553 ops/ms
> # Warmup Iteration   9: 1508.813 ops/ms
> # Warmup Iteration  10: 2303.308 ops/ms
> # Warmup Iteration  11: 1534.152 ops/ms
> # Warmup Iteration  12: 2260.814 ops/ms
> # Warmup Iteration  13: 1490.340 ops/ms
> # Warmup Iteration  14: 2287.080 ops/ms
> # Warmup Iteration  15: 1512.709 ops/ms
> # Warmup Iteration  16: 1469.387 ops/ms
> # Warmup Iteration  17: 2305.459 ops/ms
> # Warmup Iteration  18: 1495.418 ops/ms
> # Warmup Iteration  19: 2282.759 ops/ms
> # Warmup Iteration  20: 1486.290 ops/ms
> Iteration   1: 2279.861 ops/ms
> Iteration   2: 1469.178 ops/ms
> Iteration   3: 2302.454 ops/ms
> Iteration   4: 1507.177 ops/ms
> Iteration   5: 2282.104 ops/ms
> Iteration   6: 1532.449 ops/ms
> Iteration   7: 2273.841 ops/ms
> Iteration   8: 1471.526 ops/ms
> Iteration   9: 2309.160 ops/ms
> Iteration  10: 1495.957 ops/ms
> Iteration  11: 2286.309 ops/ms
> Iteration  12: 1481.660 ops/ms
> Iteration  13: 2280.471 ops/ms
> Iteration  14: 1511.154 ops/ms
> Iteration  15: 2314.261 ops/ms
> Iteration  16: 1472.466 ops/ms
> Iteration  17: 2283.490 ops/ms
> Iteration  18: 1482.873 ops/ms
> Iteration  19: 2306.502 ops/ms
> Iteration  20: 1505.751 ops/ms
> ```