Observations on vectorbenchmarks on ARM

Thu Nov 19 15:52:48 UTC 2020

> On Nov 19, 2020, at 6:43 AM, Ludovic Henry <luhenry at microsoft.com> wrote:
> 
> Hi Paul,
>  
> > I had some spare time to look at the intersectPanamaInt benchmark, and there are issues on x86. Try this:
> > 
> > @Benchmark
> > public void intersectPanamaInt(Blackhole bh) {
> >    for (int i = 0; i <= left.length - I256.length(); i += I256.length()) {
> >        IntVector.fromArray(I256, left, i).and(IntVector.fromArray(I256, right, i)).intoArray(result, i);
> >    }
> >    bh.consume(result);
> > }
> > 
> > Or try unmodified with a clone and build of https://github.com/rwestrel/jdk/tree/range_checks_paul <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Frwestrel*2Fjdk*2Ftree*2Frange_checks_paul&data=04*7C01*7Cluhenry*40microsoft.com*7C68967795d58f434b710508d88c32e278*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637413497462040846*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=mD8MW*2Fyq6uOj*2F6yBwzI5*2Bkplo4wjl7bd3WZYeXJ*2BAQA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x9MxBc_Tw$>
>  
> I can confirm that this snippet and the `range_checks_paul` branch do fix the issue on ARM. With these, I get to ~2300 ops/ms. From what I understand of the patch it has to do with eliminating casting operations, correct?
>  

Thanks. I do not fully understand this area… checkIndex intrinsic adds extra cast nodes and those get in the the way of pattern matching for range checks.

> I’ve cherry-picked locally [1], [2] and [3] on top of `vectorIntrinsics` [4]. With it, I observe the expected ~2300 ops/ms. My question is then whether I can cherry-pick and submit a PR against this `vectorIntrinsics` branch? 
>  

I would prefer we let such fixes roll into jdk then merge from that, otherwise it makes it harder to track what is going on in vectorIntrinsics and manage merge conflicts.

> > > Regarding bounds checks, I am wondering if the Objects.checkIndex method [1] is fully intrinsic on ARM. Can you try on x86 and compare?
> > 
> > Great question. I'll check on that tomorrow morning (it's evening for me right now).
>  
> Given cherry-picking [1] accelerates it on ARM, I’m taking that Objects.checkIndex you’ve mentioned is indeed intrinsified on ARM as well, correct?
>  

Yes, I think so.

Paul.

> > Not much documentation, here are some links to Roland’s recent and ongoing work:
> > 
> > 8255150: Add utility methods to check long indexes and ranges #1003
> > https://github.com/openjdk/jdk/pull/1003 <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/1003__;!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x8iL8OnKA$> <https://github.com/openjdk/jdk/pull/1003 <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/1003__;!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x8iL8OnKA$>>
> > 
> > Experimental work to elide bounds checks when using VectorSpecies.loopBound
> > https://github.com/rwestrel/jdk/tree/range_checks_paul <https://urldefense.com/v3/__https://github.com/rwestrel/jdk/tree/range_checks_paul__;!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x-7cmdY5w$>
>  
> Thank you for sharing that!
>  
> I’ll keep digging into vectorbenchmarks and see if I’m running into any other performance issues specific to ARM.
>  
> [1] https://github.com/openjdk/jdk/commit/a7422ac2f4a7dab2951bd42098325384b07b6d29 <https://urldefense.com/v3/__https://github.com/openjdk/jdk/commit/a7422ac2f4a7dab2951bd42098325384b07b6d29__;!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x8X3NTing$>
> [2] https://github.com/rwestrel/jdk/commit/65807b9161948b4312c15095cb29b981d9af90e6 <https://urldefense.com/v3/__https://github.com/rwestrel/jdk/commit/65807b9161948b4312c15095cb29b981d9af90e6__;!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x9v-3gF4w$>
> [3] https://github.com/rwestrel/jdk/commit/24b110ae78898741d0ee0f7d8183199231451b59 <https://urldefense.com/v3/__https://github.com/rwestrel/jdk/commit/24b110ae78898741d0ee0f7d8183199231451b59__;!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x-sjYoQmw$>
> [4] https://github.com/openjdk/panama-vector/commits/vectorIntrinsics <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/commits/vectorIntrinsics__;!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x_iEKKmTQ$>
>  
> From: Paul Sandoz <paul.sandoz at oracle.com <mailto:paul.sandoz at oracle.com>> 
> Sent: Thursday, 19 November 2020 03:29
> To: Ludovic Henry <luhenry at microsoft.com <mailto:luhenry at microsoft.com>>
> Cc: panama-dev at openjdk.java.net <mailto:panama-dev at openjdk.java.net>; openjdk-aarch64 <openjdk-aarch64 at microsoft.com <mailto:openjdk-aarch64 at microsoft.com>>
> Subject: Re: Observations on vectorbenchmarks on ARM
>  
> I had some spare time to look at the intersectPanamaInt benchmark, and there are issues on x86. Try this:
>  
> @Benchmark
> public void intersectPanamaInt(Blackhole bh) {
>     for (int i = 0; i <= left.length - I256.length(); i += I256.length()) {
>         IntVector.fromArray(I256, left, i).and(IntVector.fromArray(I256, right, i)).intoArray(result, i);
>     }
>     bh.consume(result);
> }
>  
> Or try unmodified with a clone and build of https://github.com/rwestrel/jdk/tree/range_checks_paul <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Frwestrel*2Fjdk*2Ftree*2Frange_checks_paul&data=04*7C01*7Cluhenry*40microsoft.com*7C68967795d58f434b710508d88c32e278*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637413497462040846*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=mD8MW*2Fyq6uOj*2F6yBwzI5*2Bkplo4wjl7bd3WZYeXJ*2BAQA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUl!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x9MxBc_Tw$>
>  
> Paul.
>  
> 
> 
> On Nov 18, 2020, at 1:26 PM, Paul Sandoz <paul.sandoz at oracle.com <mailto:paul.sandoz at oracle.com>> wrote:
>  
> 
> 
> 
> On Nov 18, 2020, at 12:27 PM, Ludovic Henry <luhenry at microsoft.com <mailto:luhenry at microsoft.com>> wrote:
> 
> Hi Paul,
> 
> 
> We have not done much work trying to optimize the fallback cases e.g. composing using smaller vector sizes, or letting the auto-vectorizer have at it (harder for the compiler to see given the use of lambdas). Priority right now is to focus on getting the code gen right when the architecture supports the vector shapes.
> 
> Yes, that makes perfect sense.
> 
> 
> Regarding bounds checks, I am wondering if the Objects.checkIndex method [1] is fully intrinsic on ARM. Can you try on x86 and compare?
> 
> Great question. I'll check on that tomorrow morning (it's evening for me right now).
> 
> Ok, thanks.
> 
> 
> 
> 
> 
> (We also have some work in progress to improve bounds checks when the upper loop bound is calculated from VectorSpecies.loopBound.)
> 
> Where could I follow along this work? Or even where could I find some documentation / discussions on the topic?
> 
> 
> Not much documentation, here are some links to Roland’s recent and ongoing work:
> 
> 8255150: Add utility methods to check long indexes and ranges #1003
> https://github.com/openjdk/jdk/pull/1003 <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Fopenjdk*2Fjdk*2Fpull*2F1003&data=04*7C01*7Cluhenry*40microsoft.com*7C68967795d58f434b710508d88c32e278*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637413497462045829*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=tcpDOvuRYym2LNqWce0fYaU7b3pRttFgAlaY5gscAvU*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x9iJOrABw$> <https://github.com/openjdk/jdk/pull/1003 <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Fopenjdk*2Fjdk*2Fpull*2F1003&data=04*7C01*7Cluhenry*40microsoft.com*7C68967795d58f434b710508d88c32e278*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637413497462050821*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=EhhOZqS4YuDE06AwsugLvKyjwrCLPRkD8pcDVAzPnS8*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x_kwvRbPA$>>
> 
> Experimental work to elide bounds checks when using VectorSpecies.loopBound
> https://github.com/rwestrel/jdk/tree/range_checks_paul <https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Frwestrel*2Fjdk*2Ftree*2Frange_checks_paul&data=04*7C01*7Cluhenry*40microsoft.com*7C68967795d58f434b710508d88c32e278*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637413497462055812*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C1000&sdata=pvPxAKzauChNofvK9FWepvjlFqp*2BwWYgJFan*2ByL0KlQ*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!GqivPVa7Brio!NC6ptqAC-nW81GzlQHjF23bhUy4a_-BoIn4cQgPrzp3i1lR0IJwKEQIb1x-iK0oFIQ$>
> 
> Use git-blame (dreadful name!) to look at the changes for intrinsification of Preconditions.checkIndex.
> 
> 
> 
> The oscillation might be due to alignment, perhaps if the vector loads/stores are misaligned the instruction cost is higher?
> 
> That is exactly right. After testing with `-XX:ObjectAlignmentInBytes=16`, the oscillation is gone and the performance is, as expected, stable at ~2300 ops/ms.
> 
> 
> When we support access to MemorySegments, it will be possible to allocate (native, or off-heap) segments with specific alignment characteristics.
> 
> Paul.