Observations on vectorbenchmarks on ARM

Thu Nov 19 14:43:40 UTC 2020

Hi Paul,

> I had some spare time to look at the intersectPanamaInt benchmark, and there are issues on x86. Try this:
>
> @Benchmark
> public void intersectPanamaInt(Blackhole bh) {
>    for (int i = 0; i <= left.length - I256.length(); i += I256.length()) {
>        IntVector.fromArray(I256, left, i).and(IntVector.fromArray(I256, right, i)).intoArray(result, i);
>    }
>    bh.consume(result);
> }
>
> Or try unmodified with a clone and build of https://github.com/rwestrel/jdk/tree/range_checks_paul<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frwestrel%2Fjdk%2Ftree%2Frange_checks_paul&data=04%7C01%7Cluhenry%40microsoft.com%7C68967795d58f434b710508d88c32e278%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637413497462040846%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mD8MW%2Fyq6uOj%2F6yBwzI5%2Bkplo4wjl7bd3WZYeXJ%2BAQA%3D&reserved=0>

I can confirm that this snippet and the `range_checks_paul` branch do fix the issue on ARM. With these, I get to ~2300 ops/ms. From what I understand of the patch it has to do with eliminating casting operations, correct?

I’ve cherry-picked locally [1], [2] and [3] on top of `vectorIntrinsics` [4]. With it, I observe the expected ~2300 ops/ms. My question is then whether I can cherry-pick and submit a PR against this `vectorIntrinsics` branch?

> > Regarding bounds checks, I am wondering if the Objects.checkIndex method [1] is fully intrinsic on ARM. Can you try on x86 and compare?
>
> Great question. I'll check on that tomorrow morning (it's evening for me right now).

Given cherry-picking [1] accelerates it on ARM, I’m taking that Objects.checkIndex you’ve mentioned is indeed intrinsified on ARM as well, correct?

> Not much documentation, here are some links to Roland’s recent and ongoing work:
>
> 8255150: Add utility methods to check long indexes and ranges #1003
> https://github.com/openjdk/jdk/pull/1003 <https://github.com/openjdk/jdk/pull/1003>
>
> Experimental work to elide bounds checks when using VectorSpecies.loopBound
> https://github.com/rwestrel/jdk/tree/range_checks_paul

Thank you for sharing that!

I’ll keep digging into vectorbenchmarks and see if I’m running into any other performance issues specific to ARM.

[1] https://github.com/openjdk/jdk/commit/a7422ac2f4a7dab2951bd42098325384b07b6d29
[2] https://github.com/rwestrel/jdk/commit/65807b9161948b4312c15095cb29b981d9af90e6
[3] https://github.com/rwestrel/jdk/commit/24b110ae78898741d0ee0f7d8183199231451b59
[4] https://github.com/openjdk/panama-vector/commits/vectorIntrinsics

From: Paul Sandoz <paul.sandoz at oracle.com>
Sent: Thursday, 19 November 2020 03:29
To: Ludovic Henry <luhenry at microsoft.com>
Cc: panama-dev at openjdk.java.net; openjdk-aarch64 <openjdk-aarch64 at microsoft.com>
Subject: Re: Observations on vectorbenchmarks on ARM

I had some spare time to look at the intersectPanamaInt benchmark, and there are issues on x86. Try this:

@Benchmark
public void intersectPanamaInt(Blackhole bh) {
    for (int i = 0; i <= left.length - I256.length(); i += I256.length()) {
        IntVector.fromArray(I256, left, i).and(IntVector.fromArray(I256, right, i)).intoArray(result, i);
    }
    bh.consume(result);
}

Or try unmodified with a clone and build of https://github.com/rwestrel/jdk/tree/range_checks_paul<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frwestrel%2Fjdk%2Ftree%2Frange_checks_paul&data=04%7C01%7Cluhenry%40microsoft.com%7C68967795d58f434b710508d88c32e278%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637413497462040846%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mD8MW%2Fyq6uOj%2F6yBwzI5%2Bkplo4wjl7bd3WZYeXJ%2BAQA%3D&reserved=0>

Paul.

On Nov 18, 2020, at 1:26 PM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:

On Nov 18, 2020, at 12:27 PM, Ludovic Henry <luhenry at microsoft.com<mailto:luhenry at microsoft.com>> wrote:

Hi Paul,

We have not done much work trying to optimize the fallback cases e.g. composing using smaller vector sizes, or letting the auto-vectorizer have at it (harder for the compiler to see given the use of lambdas). Priority right now is to focus on getting the code gen right when the architecture supports the vector shapes.

Yes, that makes perfect sense.

Regarding bounds checks, I am wondering if the Objects.checkIndex method [1] is fully intrinsic on ARM. Can you try on x86 and compare?

Great question. I'll check on that tomorrow morning (it's evening for me right now).

Ok, thanks.

(We also have some work in progress to improve bounds checks when the upper loop bound is calculated from VectorSpecies.loopBound.)

Where could I follow along this work? Or even where could I find some documentation / discussions on the topic?

Not much documentation, here are some links to Roland’s recent and ongoing work:

8255150: Add utility methods to check long indexes and ranges #1003
https://github.com/openjdk/jdk/pull/1003<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk%2Fpull%2F1003&data=04%7C01%7Cluhenry%40microsoft.com%7C68967795d58f434b710508d88c32e278%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637413497462045829%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=tcpDOvuRYym2LNqWce0fYaU7b3pRttFgAlaY5gscAvU%3D&reserved=0> <https://github.com/openjdk/jdk/pull/1003<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk%2Fpull%2F1003&data=04%7C01%7Cluhenry%40microsoft.com%7C68967795d58f434b710508d88c32e278%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637413497462050821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=EhhOZqS4YuDE06AwsugLvKyjwrCLPRkD8pcDVAzPnS8%3D&reserved=0>>

Experimental work to elide bounds checks when using VectorSpecies.loopBound
https://github.com/rwestrel/jdk/tree/range_checks_paul<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frwestrel%2Fjdk%2Ftree%2Frange_checks_paul&data=04%7C01%7Cluhenry%40microsoft.com%7C68967795d58f434b710508d88c32e278%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637413497462055812%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pvPxAKzauChNofvK9FWepvjlFqp%2BwWYgJFan%2ByL0KlQ%3D&reserved=0>

Use git-blame (dreadful name!) to look at the changes for intrinsification of Preconditions.checkIndex.

The oscillation might be due to alignment, perhaps if the vector loads/stores are misaligned the instruction cost is higher?

That is exactly right. After testing with `-XX:ObjectAlignmentInBytes=16`, the oscillation is gone and the performance is, as expected, stable at ~2300 ops/ms.

When we support access to MemorySegments, it will be possible to allocate (native, or off-heap) segments with specific alignment characteristics.

Paul.