[External] : Re: Issues with loop unrolling: better pinned node

Rado Smogura mail at smogura.eu
Wed Aug 18 07:17:54 UTC 2021


Hi Vladimir,


Thank you for answer.


In fact, it is was an attempt to confirm that memory flow can be a cause 
why loop opts do not work. That's very fair point. I'll think about it 
and maybe I'll be able to come out idea how this can be generalized.


Kind regards,

Rado

On 16.08.2021 15:41, Vladimir Ivanov wrote:
>> I wonder what do you think about something like this [1] - it's 
>> virtually small single class change
>
> Very interesting experiment, Rado! It's encouraging to hear that loop 
> opts immediately benefit from it.
>
> From a architectural perspective, a separate pass to optimize memory 
> graph brings excessive complexity:
>
>   (1) yet another pass over the graph and susceptible to pass ordering 
> issues;
>
>   (2) separate from GVN: you either have to duplicate GVN-based memory 
> optimizations or run new pass with IGVN in a loop until it stabilizes.
>
> IMO the problem you noticed illustrates a general weakness in GVN 
> implementation and that's the place where it should be fixed (ideally).
>
> Best regards,
> Vladimir Ivanov
>
>>
>> This change tries to find unique memory for load node. I implemented 
>> it as separate phase, as optimization may not run in Ideal method. I 
>> think it's ligher than phi split out.
>>
>> Loops has been transformed. RCE started.
>>
>> Kind regards,
>> Rado
>>
>> [1] - 
>> https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3 
>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvu60z1vk$> 
>>
>> [2] - 
>> https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation 
>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvkGUL-Pw$> 
>> (full test case)
>>
>> ------------------------------------------------------------------------
>> *From:* Radosław Smogura on behalf of Radosław Smogura <mail at smogura.eu>
>> *Sent:* Friday, August 6, 2021 22:43
>> *To:* Radosław Smogura <mail at smogura.eu>; Paul Sandoz 
>> <paul.sandoz at oracle.com>; Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>> *Subject:* Re: Issues with loop unrolling: better pinned node
>> Hi all,
>>
>> Now when I checked it again. it works as expected, and it's the same 
>> code.
>>
>> In draft code I check if the buffer is direct by using type checking 
>> to unswitch loop, as unswitching over ByteBuffer.hb did not work (the 
>> graph was quite similar). However, I thought that this unswitch 
>> actually helped to build correct loops, and any kind of improvement 
>> around it would be rather for the purpose of better-looking code.
>>
>> But it looks like that sometimes (but only sometimes) loop still can 
>> not be correctly built, or maybe the full optimization kicks in very, 
>> very late.
>>
>> Kind regards,
>> Rado
>> ------------------------------------------------------------------------
>> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on behalf of 
>> Radosław Smogura <mail at smogura.eu>
>> *Sent:* Friday, August 6, 2021 20:22
>> *To:* Paul Sandoz <paul.sandoz at oracle.com>
>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>> *Subject:* Re: Issues with loop unrolling: better pinned node
>> Yes,
>>
>> The normal case looks, good. It's all about polluted cases [1]
>>
>> BR,
>> Rado
>>
>> [1] https://github.com/openjdk/panama-vector/pull/109 
>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvTXVlXzw$>
>> [https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109]<https://github.com/openjdk/panama-vector/pull/109 
>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109**A3Chttps:/*github.com/openjdk/panama-vector/pull/109__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvjOF75Zk$>> 
>>
>> (Draft) Perofrmance improvements for polluted cases by rsmogura · 
>> Pull Request #109 · 
>> openjdk/panama-vector<https://github.com/openjdk/panama-vector/pull/109> 
>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvXk316cU$> 
>>
>> Hi all, I would like to submit this piece of work, for byte buffers 
>> and polluted cases. It resolves some performance issues related to 
>> mem barriers when in scope are both on- and off-heap buffer. T...
>> github.com
>>
>> [https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector]<https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1 
>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector**A3Chttps:/*github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvt9bVEEU$>> 
>>
>> Comparing openjdk:vectorIntrinsics...rsmogura:vectors-polluted-cases 
>> · 
>> openjdk/panama-vector<https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1> 
>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvW2CiAB0$> 
>>
>> Panama vector. Contribute to openjdk/panama-vector development by 
>> creating an account on GitHub.
>> github.com
>>
>> ________________________________
>> From: Paul Sandoz <paul.sandoz at oracle.com>
>> Sent: Friday, August 6, 2021 20:04
>> To: Radosław Smogura <mail at smogura.eu>
>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>> Subject: Re: Issues with loop unrolling: better pinned node
>>
>> I am confused as to the case under test. In your initial email of 
>> this thread were you also referring implicitly to polluted cases?
>>
>> Paul.
>>
>>> On Aug 6, 2021, at 10:56 AM, Radosław Smogura <mail at smogura.eu> wrote:
>>>
>>> Hi Paul,
>>>
>>> There's a performance improvement, but. I still can't unroll 
>>> polluted cases (I cherry-picked loop unrolling). The graph still has 
>>> few nodes taking buffer limit from phi, and on IR I don't see 
>>> vectors nodes cascading.
>>>
>>> make test TEST='micro:ByteBufferVectorAccess.p' MICRO="OPTIONS=-f 1 
>>> -prof perfasm 
>>> -jvmArgsPrepend=-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" 
>>> JOBS=12
>>> Benchmark                                     (size)  Mode Cnt   
>>> Score   Error  Units
>>> ByteBufferVectorAccess.pollutedBuffers2         1024  avgt 30  
>>> 40.472 ? 1.055  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers2:?asm    1024 avgt          
>>> NaN            ---
>>> ByteBufferVectorAccess.pollutedBuffers3         1024  avgt 30  
>>> 79.251 ? 0.786  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers3:?asm    1024 avgt          
>>> NaN            ---
>>> ByteBufferVectorAccess.pollutedBuffers4         1024  avgt 30  
>>> 83.627 ? 2.140  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers4:?asm    1024 avgt          
>>> NaN            ---
>>> ByteBufferVectorAccess.pollutedBuffers5         1024  avgt 30  
>>> 85.561 ? 1.156  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers5:?asm    1024 avgt          NaN
>>>
>>> make test TEST='micro:ByteBufferVectorAccess.p' MICRO="OPTIONS=-f 1 
>>> -prof perfasm"
>>> Benchmark                                     (size)  Mode Cnt    
>>> Score   Error  Units
>>> ByteBufferVectorAccess.pollutedBuffers2         1024  avgt 10   
>>> 49.326 ? 0.843  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers2:?asm    1024 avgt           
>>> NaN            ---
>>> ByteBufferVectorAccess.pollutedBuffers3         1024  avgt 10  
>>> 100.291 ? 1.271  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers3:?asm    1024 avgt           
>>> NaN            ---
>>> ByteBufferVectorAccess.pollutedBuffers4         1024  avgt 10  
>>> 101.494 ? 1.027  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers4:?asm    1024 avgt           
>>> NaN            ---
>>> ByteBufferVectorAccess.pollutedBuffers5         1024  avgt 10   
>>> 94.606 ? 1.522  ns/op
>>> ByteBufferVectorAccess.pollutedBuffers5:?asm    1024 avgt           NaN
>>>
>>>
>>> BR,
>>> Rado
>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>> Sent: Friday, August 6, 2021 18:04
>>> To: Radosław Smogura <mail at smogura.eu>
>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>
>>> Hi Rado,
>>>
>>> It’s good you are looking at the IR
>>>
>>> Out of curiosity, what happens if you turn off bounds checking [*]?
>>>
>>> Paul.
>>>
>>> [*]
>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>
>>> > On Aug 6, 2021, at 8:39 AM, Radosław Smogura <mail at smogura.eu> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I've found that even if we get rid of barriers, the loop can't get 
>>> unrolled, and not needed code is inside it.
>>> >
>>> > I've found this graph, I wonder if it's most optimal, in a 
>>> partiucalry Load of ByteBuffer index / hb is from phi, could it be 
>>> attached to initial memory?
>>> >
>>> > Here's a picture 
>>> https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$> 
>>
>>> > 
>>> [https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p]<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>
>> <https://urldefense.com/v3/__https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p**A3Chttps:/*drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvT2w-EKw$>> 
>>
>>> > 
>>> bb_issues.png<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>
>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$>> 
>>
>>> > drive.google.com
>>> >
>>> >
>>> > And sample code
>>> >
>>> > protected void copyMemory(ByteBuffer in, ByteBuffer out) {
>>> >  var limit = SPECIES.loopBound(in.limit());
>>> >  for (int i=0; i < limit; i += SPECIES.vectorByteSize()) {
>>> >    final var v = ByteVector.fromByteBuffer(SPECIES, in, i, 
>>> ByteOrder.nativeOrder());
>>> >    v.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>>> >  }
>>> > }
>>> >
>>> > Kind regards,
>>> > Rado
>>


More information about the panama-dev mailing list