[External] : Re: Issues with loop unrolling: better pinned node

Thu Aug 19 20:26:21 UTC 2021

I think I answered this question quite simply... it will not work.

On 19.08.2021 18:39, Rado Smogura wrote:
> Hi all,
>
>
> I hope you have a good day.
>
>
> As still optimizing loops would be good approach, I thought about 
> optimizing a mixed access with this approach:
>
>
> 1. When mixed access is detected set flag "raw / byte array" mixed 
> access.
>
> 2. Bail out and restart compilation (will happen during first phases, 
> and only for few methods).
>
> 3. Pass a flag to compiler.
>
> 4. Modify find_alias_type / flatten_alias_type, so that if byte array 
> will be queried for alias, raw ptr and raw alias will be used.
>
>
> Kind regards,
>
> Rado
>
> On 18.08.2021 09:17, Rado Smogura wrote:
>> Hi Vladimir,
>>
>>
>> Thank you for answer.
>>
>>
>> In fact, it is was an attempt to confirm that memory flow can be a 
>> cause why loop opts do not work. That's very fair point. I'll think 
>> about it and maybe I'll be able to come out idea how this can be 
>> generalized.
>>
>>
>> Kind regards,
>>
>> Rado
>>
>> On 16.08.2021 15:41, Vladimir Ivanov wrote:
>>>> I wonder what do you think about something like this [1] - it's 
>>>> virtually small single class change
>>>
>>> Very interesting experiment, Rado! It's encouraging to hear that 
>>> loop opts immediately benefit from it.
>>>
>>> From a architectural perspective, a separate pass to optimize memory 
>>> graph brings excessive complexity:
>>>
>>>   (1) yet another pass over the graph and susceptible to pass 
>>> ordering issues;
>>>
>>>   (2) separate from GVN: you either have to duplicate GVN-based 
>>> memory optimizations or run new pass with IGVN in a loop until it 
>>> stabilizes.
>>>
>>> IMO the problem you noticed illustrates a general weakness in GVN 
>>> implementation and that's the place where it should be fixed (ideally).
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>>
>>>> This change tries to find unique memory for load node. I 
>>>> implemented it as separate phase, as optimization may not run in 
>>>> Ideal method. I think it's ligher than phi split out.
>>>>
>>>> Loops has been transformed. RCE started.
>>>>
>>>> Kind regards,
>>>> Rado
>>>>
>>>> [1] - 
>>>> https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3 
>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvu60z1vk$> 
>>>>
>>>> [2] - 
>>>> https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation 
>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvkGUL-Pw$> 
>>>> (full test case)
>>>>
>>>> ------------------------------------------------------------------------ 
>>>>
>>>> *From:* Radosław Smogura on behalf of Radosław Smogura 
>>>> <mail at smogura.eu>
>>>> *Sent:* Friday, August 6, 2021 22:43
>>>> *To:* Radosław Smogura <mail at smogura.eu>; Paul Sandoz 
>>>> <paul.sandoz at oracle.com>; Vladimir Ivanov 
>>>> <vladimir.x.ivanov at oracle.com>
>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>> Hi all,
>>>>
>>>> Now when I checked it again. it works as expected, and it's the 
>>>> same code.
>>>>
>>>> In draft code I check if the buffer is direct by using type 
>>>> checking to unswitch loop, as unswitching over ByteBuffer.hb did 
>>>> not work (the graph was quite similar). However, I thought that 
>>>> this unswitch actually helped to build correct loops, and any kind 
>>>> of improvement around it would be rather for the purpose of 
>>>> better-looking code.
>>>>
>>>> But it looks like that sometimes (but only sometimes) loop still 
>>>> can not be correctly built, or maybe the full optimization kicks in 
>>>> very, very late.
>>>>
>>>> Kind regards,
>>>> Rado
>>>> ------------------------------------------------------------------------ 
>>>>
>>>> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on behalf of 
>>>> Radosław Smogura <mail at smogura.eu>
>>>> *Sent:* Friday, August 6, 2021 20:22
>>>> *To:* Paul Sandoz <paul.sandoz at oracle.com>
>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>> Yes,
>>>>
>>>> The normal case looks, good. It's all about polluted cases [1]
>>>>
>>>> BR,
>>>> Rado
>>>>
>>>> [1] https://github.com/openjdk/panama-vector/pull/109 
>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvTXVlXzw$>
>>>> [https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109]<https://github.com/openjdk/panama-vector/pull/109 
>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109**A3Chttps:/*github.com/openjdk/panama-vector/pull/109__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvjOF75Zk$>> 
>>>>
>>>> (Draft) Perofrmance improvements for polluted cases by rsmogura · 
>>>> Pull Request #109 · 
>>>> openjdk/panama-vector<https://github.com/openjdk/panama-vector/pull/109> 
>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvXk316cU$> 
>>>>
>>>> Hi all, I would like to submit this piece of work, for byte buffers 
>>>> and polluted cases. It resolves some performance issues related to 
>>>> mem barriers when in scope are both on- and off-heap buffer. T...
>>>> github.com
>>>>
>>>> [https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector]<https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1 
>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector**A3Chttps:/*github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvt9bVEEU$>> 
>>>>
>>>> Comparing 
>>>> openjdk:vectorIntrinsics...rsmogura:vectors-polluted-cases · 
>>>> openjdk/panama-vector<https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1> 
>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvW2CiAB0$> 
>>>>
>>>> Panama vector. Contribute to openjdk/panama-vector development by 
>>>> creating an account on GitHub.
>>>> github.com
>>>>
>>>> ________________________________
>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>> Sent: Friday, August 6, 2021 20:04
>>>> To: Radosław Smogura <mail at smogura.eu>
>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>
>>>> I am confused as to the case under test. In your initial email of 
>>>> this thread were you also referring implicitly to polluted cases?
>>>>
>>>> Paul.
>>>>
>>>>> On Aug 6, 2021, at 10:56 AM, Radosław Smogura <mail at smogura.eu> 
>>>>> wrote:
>>>>>
>>>>> Hi Paul,
>>>>>
>>>>> There's a performance improvement, but. I still can't unroll 
>>>>> polluted cases (I cherry-picked loop unrolling). The graph still 
>>>>> has few nodes taking buffer limit from phi, and on IR I don't see 
>>>>> vectors nodes cascading.
>>>>>
>>>>> make test TEST='micro:ByteBufferVectorAccess.p' MICRO="OPTIONS=-f 
>>>>> 1 -prof perfasm 
>>>>> -jvmArgsPrepend=-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" 
>>>>> JOBS=12
>>>>> Benchmark                                     (size)  Mode Cnt   
>>>>> Score   Error  Units
>>>>> ByteBufferVectorAccess.pollutedBuffers2         1024  avgt 30  
>>>>> 40.472 ? 1.055  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm    1024 avgt          
>>>>> NaN            ---
>>>>> ByteBufferVectorAccess.pollutedBuffers3         1024  avgt 30  
>>>>> 79.251 ? 0.786  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm    1024 avgt          
>>>>> NaN            ---
>>>>> ByteBufferVectorAccess.pollutedBuffers4         1024  avgt 30  
>>>>> 83.627 ? 2.140  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm    1024 avgt          
>>>>> NaN            ---
>>>>> ByteBufferVectorAccess.pollutedBuffers5         1024  avgt 30  
>>>>> 85.561 ? 1.156  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm    1024 avgt          
>>>>> NaN
>>>>>
>>>>> make test TEST='micro:ByteBufferVectorAccess.p' MICRO="OPTIONS=-f 
>>>>> 1 -prof perfasm"
>>>>> Benchmark                                     (size)  Mode Cnt    
>>>>> Score   Error  Units
>>>>> ByteBufferVectorAccess.pollutedBuffers2         1024  avgt 10   
>>>>> 49.326 ? 0.843  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm    1024 
>>>>> avgt           NaN            ---
>>>>> ByteBufferVectorAccess.pollutedBuffers3         1024  avgt 10  
>>>>> 100.291 ? 1.271  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm    1024 
>>>>> avgt           NaN            ---
>>>>> ByteBufferVectorAccess.pollutedBuffers4         1024  avgt 10  
>>>>> 101.494 ? 1.027  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm    1024 
>>>>> avgt           NaN            ---
>>>>> ByteBufferVectorAccess.pollutedBuffers5         1024  avgt 10   
>>>>> 94.606 ? 1.522  ns/op
>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm    1024 
>>>>> avgt           NaN
>>>>>
>>>>>
>>>>> BR,
>>>>> Rado
>>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>>> Sent: Friday, August 6, 2021 18:04
>>>>> To: Radosław Smogura <mail at smogura.eu>
>>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>>
>>>>> Hi Rado,
>>>>>
>>>>> It’s good you are looking at the IR
>>>>>
>>>>> Out of curiosity, what happens if you turn off bounds checking [*]?
>>>>>
>>>>> Paul.
>>>>>
>>>>> [*]
>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>>
>>>>> > On Aug 6, 2021, at 8:39 AM, Radosław Smogura <mail at smogura.eu> 
>>>>> wrote:
>>>>> >
>>>>> > Hi all,
>>>>> >
>>>>> > I've found that even if we get rid of barriers, the loop can't 
>>>>> get unrolled, and not needed code is inside it.
>>>>> >
>>>>> > I've found this graph, I wonder if it's most optimal, in a 
>>>>> partiucalry Load of ByteBuffer index / hb is from phi, could it be 
>>>>> attached to initial memory?
>>>>> >
>>>>> > Here's a picture 
>>>>> https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>>>
>>>>
>>>>
>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$> 
>>>>
>>>>> > 
>>>>> [https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p]<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>>>
>>>>
>>>>
>>>> <https://urldefense.com/v3/__https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p**A3Chttps:/*drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvT2w-EKw$>> 
>>>>
>>>>> > 
>>>>> bb_issues.png<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>>>
>>>>
>>>>
>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$>> 
>>>>
>>>>> > drive.google.com
>>>>> >
>>>>> >
>>>>> > And sample code
>>>>> >
>>>>> > protected void copyMemory(ByteBuffer in, ByteBuffer out) {
>>>>> >  var limit = SPECIES.loopBound(in.limit());
>>>>> >  for (int i=0; i < limit; i += SPECIES.vectorByteSize()) {
>>>>> >    final var v = ByteVector.fromByteBuffer(SPECIES, in, i, 
>>>>> ByteOrder.nativeOrder());
>>>>> >    v.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>>>>> >  }
>>>>> > }
>>>>> >
>>>>> > Kind regards,
>>>>> > Rado
>>>>