[External] : Re: Issues with loop unrolling: better pinned node

Mon Aug 30 15:12:59 UTC 2021

Hi all,

I added one missing thing. I want to build something like this. Would it 
make sense?

    STORE

                                addr
                                  │
                                  │
          reset_memory()          │
             │    ┌───────────────┴────────┐
             │    │ CheckCastPP (-> BOT)   │
             │    └──────┬─────────────────┘
             │           │
             ├───────┐   │
             │       │   │
             │       │   │
             │  ┌────┴───┴──────────────────────────┐
             │  │            StoreVector            │
             │  └───┬───────────────────────────┬───┘
             │      │                           │
             │      │                           │
┌┴──────┴───────────────────────────┴────────────────────────────┐
            │ BOT  RAW byte[]                          │
            │ MergeMem                                        │
└────────────────────────────────────────────────────────────────┘

     LOAD

              │
              │
              ├─────────┐
              │         │
              │ 
┌───────┴─────────────────────────────────────────────────────┐
              │ │ LoadVector 
(BOT)                                            │
              │ 
└───────────────────────┬─────────────────────────┬───────────┘
              │                         │ │
              │     addr base -> raw │                         │    addr 
base -> byte[]
              │                         │ │
              │           ┌─────────────┴─────────┐ 
┌───────────┴───────────┐
              │           │DummyStoreV (raw)      │ │DummyStoreV 
(byte[])   │ //No-op stores
              │           └──────┬────────────────┘ 
└──┬────────────────────┘
              │                  │                       │
              │     ┌────────────┘             ┌─────────┘
              │     │                          │
┌─┴─────┴──────────────────────────┴──────────────────────────────┐
            │ BOT  RAW byte[]                           │
            │ MergeMem                                         │
└─────────────────────────────────────────────────────────────────┘

DummyStore is "virtual" node inserted after load, intended to emulate 
store, and prevent writes / reads to go on the side of load vector (it 
fact it more prevents store / load to see through mem-memrge).

I did test it with following code.

public static void copyMemoryBytes3(ByteBuffer in, ByteBuffer out, ByteBuffer out2,byte[] arr) {
     for (int i=0; i <SPECIES_BYTE.loopBound(in.limit()); i +=SPECIES_BYTE.vectorByteSize()) {
         var v1 = ByteVector.fromByteBuffer(SPECIES_BYTE, in, i, ByteOrder.nativeOrder());
         arr[i] = (byte) i;
         var v2 = ByteVector.fromByteBuffer(SPECIES_BYTE, out, i, ByteOrder.nativeOrder());
         v1.intoByteBuffer(out, i, ByteOrder.nativeOrder());
     }
}

Kind regards,

Rado

On 27.08.2021 20:16, Rado Smogura wrote:
> Hi all,
>
>
> I experimented a little bit, and I wonder if this is reasonable, the 
> outcome on graphs is as expected, and operations looks like properly 
> ordered (but this is my private opinion).
>
> https://github.com/rsmogura/panama-vector/commit/755b62823aaed0cddf78e8ccfc60c063bb40779a 
>
>
> Kind regards,
>
> Rado
>
> On 19.08.2021 22:26, Rado Smogura wrote:
>> I think I answered this question quite simply... it will not work.
>>
>> On 19.08.2021 18:39, Rado Smogura wrote:
>>> Hi all,
>>>
>>>
>>> I hope you have a good day.
>>>
>>>
>>> As still optimizing loops would be good approach, I thought about 
>>> optimizing a mixed access with this approach:
>>>
>>>
>>> 1. When mixed access is detected set flag "raw / byte array" mixed 
>>> access.
>>>
>>> 2. Bail out and restart compilation (will happen during first 
>>> phases, and only for few methods).
>>>
>>> 3. Pass a flag to compiler.
>>>
>>> 4. Modify find_alias_type / flatten_alias_type, so that if byte 
>>> array will be queried for alias, raw ptr and raw alias will be used.
>>>
>>>
>>> Kind regards,
>>>
>>> Rado
>>>
>>> On 18.08.2021 09:17, Rado Smogura wrote:
>>>> Hi Vladimir,
>>>>
>>>>
>>>> Thank you for answer.
>>>>
>>>>
>>>> In fact, it is was an attempt to confirm that memory flow can be a 
>>>> cause why loop opts do not work. That's very fair point. I'll think 
>>>> about it and maybe I'll be able to come out idea how this can be 
>>>> generalized.
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Rado
>>>>
>>>> On 16.08.2021 15:41, Vladimir Ivanov wrote:
>>>>>> I wonder what do you think about something like this [1] - it's 
>>>>>> virtually small single class change
>>>>>
>>>>> Very interesting experiment, Rado! It's encouraging to hear that 
>>>>> loop opts immediately benefit from it.
>>>>>
>>>>> From a architectural perspective, a separate pass to optimize 
>>>>> memory graph brings excessive complexity:
>>>>>
>>>>>   (1) yet another pass over the graph and susceptible to pass 
>>>>> ordering issues;
>>>>>
>>>>>   (2) separate from GVN: you either have to duplicate GVN-based 
>>>>> memory optimizations or run new pass with IGVN in a loop until it 
>>>>> stabilizes.
>>>>>
>>>>> IMO the problem you noticed illustrates a general weakness in GVN 
>>>>> implementation and that's the place where it should be fixed 
>>>>> (ideally).
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>>
>>>>>> This change tries to find unique memory for load node. I 
>>>>>> implemented it as separate phase, as optimization may not run in 
>>>>>> Ideal method. I think it's ligher than phi split out.
>>>>>>
>>>>>> Loops has been transformed. RCE started.
>>>>>>
>>>>>> Kind regards,
>>>>>> Rado
>>>>>>
>>>>>> [1] - 
>>>>>> https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3 
>>>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvu60z1vk$> 
>>>>>>
>>>>>> [2] - 
>>>>>> https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation 
>>>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvkGUL-Pw$> 
>>>>>> (full test case)
>>>>>>
>>>>>> ------------------------------------------------------------------------ 
>>>>>>
>>>>>> *From:* Radosław Smogura on behalf of Radosław Smogura 
>>>>>> <mail at smogura.eu>
>>>>>> *Sent:* Friday, August 6, 2021 22:43
>>>>>> *To:* Radosław Smogura <mail at smogura.eu>; Paul Sandoz 
>>>>>> <paul.sandoz at oracle.com>; Vladimir Ivanov 
>>>>>> <vladimir.x.ivanov at oracle.com>
>>>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>>>> Hi all,
>>>>>>
>>>>>> Now when I checked it again. it works as expected, and it's the 
>>>>>> same code.
>>>>>>
>>>>>> In draft code I check if the buffer is direct by using type 
>>>>>> checking to unswitch loop, as unswitching over ByteBuffer.hb did 
>>>>>> not work (the graph was quite similar). However, I thought that 
>>>>>> this unswitch actually helped to build correct loops, and any 
>>>>>> kind of improvement around it would be rather for the purpose of 
>>>>>> better-looking code.
>>>>>>
>>>>>> But it looks like that sometimes (but only sometimes) loop still 
>>>>>> can not be correctly built, or maybe the full optimization kicks 
>>>>>> in very, very late.
>>>>>>
>>>>>> Kind regards,
>>>>>> Rado
>>>>>> ------------------------------------------------------------------------ 
>>>>>>
>>>>>> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on behalf 
>>>>>> of Radosław Smogura <mail at smogura.eu>
>>>>>> *Sent:* Friday, August 6, 2021 20:22
>>>>>> *To:* Paul Sandoz <paul.sandoz at oracle.com>
>>>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>>>> Yes,
>>>>>>
>>>>>> The normal case looks, good. It's all about polluted cases [1]
>>>>>>
>>>>>> BR,
>>>>>> Rado
>>>>>>
>>>>>> [1] https://github.com/openjdk/panama-vector/pull/109 
>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvTXVlXzw$> 
>>>>>>
>>>>>> [https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109]<https://github.com/openjdk/panama-vector/pull/109 
>>>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109**A3Chttps:/*github.com/openjdk/panama-vector/pull/109__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvjOF75Zk$>> 
>>>>>>
>>>>>> (Draft) Perofrmance improvements for polluted cases by rsmogura · 
>>>>>> Pull Request #109 · 
>>>>>> openjdk/panama-vector<https://github.com/openjdk/panama-vector/pull/109> 
>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvXk316cU$> 
>>>>>>
>>>>>> Hi all, I would like to submit this piece of work, for byte 
>>>>>> buffers and polluted cases. It resolves some performance issues 
>>>>>> related to mem barriers when in scope are both on- and off-heap 
>>>>>> buffer. T...
>>>>>> github.com
>>>>>>
>>>>>> [https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector]<https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1 
>>>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector**A3Chttps:/*github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvt9bVEEU$>> 
>>>>>>
>>>>>> Comparing 
>>>>>> openjdk:vectorIntrinsics...rsmogura:vectors-polluted-cases · 
>>>>>> openjdk/panama-vector<https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1> 
>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvW2CiAB0$> 
>>>>>>
>>>>>> Panama vector. Contribute to openjdk/panama-vector development by 
>>>>>> creating an account on GitHub.
>>>>>> github.com
>>>>>>
>>>>>> ________________________________
>>>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>>>> Sent: Friday, August 6, 2021 20:04
>>>>>> To: Radosław Smogura <mail at smogura.eu>
>>>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>>>
>>>>>> I am confused as to the case under test. In your initial email of 
>>>>>> this thread were you also referring implicitly to polluted cases?
>>>>>>
>>>>>> Paul.
>>>>>>
>>>>>>> On Aug 6, 2021, at 10:56 AM, Radosław Smogura <mail at smogura.eu> 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Paul,
>>>>>>>
>>>>>>> There's a performance improvement, but. I still can't unroll 
>>>>>>> polluted cases (I cherry-picked loop unrolling). The graph still 
>>>>>>> has few nodes taking buffer limit from phi, and on IR I don't 
>>>>>>> see vectors nodes cascading.
>>>>>>>
>>>>>>> make test TEST='micro:ByteBufferVectorAccess.p' 
>>>>>>> MICRO="OPTIONS=-f 1 -prof perfasm 
>>>>>>> -jvmArgsPrepend=-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" 
>>>>>>> JOBS=12
>>>>>>> Benchmark                                     (size) Mode Cnt   
>>>>>>> Score   Error  Units
>>>>>>> ByteBufferVectorAccess.pollutedBuffers2         1024 avgt 30  
>>>>>>> 40.472 ? 1.055  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm    1024 
>>>>>>> avgt          NaN            ---
>>>>>>> ByteBufferVectorAccess.pollutedBuffers3         1024 avgt 30  
>>>>>>> 79.251 ? 0.786  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm    1024 
>>>>>>> avgt          NaN            ---
>>>>>>> ByteBufferVectorAccess.pollutedBuffers4         1024 avgt 30  
>>>>>>> 83.627 ? 2.140  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm    1024 
>>>>>>> avgt          NaN            ---
>>>>>>> ByteBufferVectorAccess.pollutedBuffers5         1024 avgt 30  
>>>>>>> 85.561 ? 1.156  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm    1024 
>>>>>>> avgt          NaN
>>>>>>>
>>>>>>> make test TEST='micro:ByteBufferVectorAccess.p' 
>>>>>>> MICRO="OPTIONS=-f 1 -prof perfasm"
>>>>>>> Benchmark                                     (size) Mode Cnt    
>>>>>>> Score   Error  Units
>>>>>>> ByteBufferVectorAccess.pollutedBuffers2         1024 avgt 10   
>>>>>>> 49.326 ? 0.843  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm    1024 
>>>>>>> avgt           NaN            ---
>>>>>>> ByteBufferVectorAccess.pollutedBuffers3         1024 avgt 10  
>>>>>>> 100.291 ? 1.271  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm    1024 
>>>>>>> avgt           NaN            ---
>>>>>>> ByteBufferVectorAccess.pollutedBuffers4         1024 avgt 10  
>>>>>>> 101.494 ? 1.027  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm    1024 
>>>>>>> avgt           NaN            ---
>>>>>>> ByteBufferVectorAccess.pollutedBuffers5         1024 avgt 10   
>>>>>>> 94.606 ? 1.522  ns/op
>>>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm    1024 
>>>>>>> avgt           NaN
>>>>>>>
>>>>>>>
>>>>>>> BR,
>>>>>>> Rado
>>>>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>>>>> Sent: Friday, August 6, 2021 18:04
>>>>>>> To: Radosław Smogura <mail at smogura.eu>
>>>>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>>>>
>>>>>>> Hi Rado,
>>>>>>>
>>>>>>> It’s good you are looking at the IR
>>>>>>>
>>>>>>> Out of curiosity, what happens if you turn off bounds checking [*]?
>>>>>>>
>>>>>>> Paul.
>>>>>>>
>>>>>>> [*]
>>>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>>>>
>>>>>>> > On Aug 6, 2021, at 8:39 AM, Radosław Smogura <mail at smogura.eu> 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi all,
>>>>>>> >
>>>>>>> > I've found that even if we get rid of barriers, the loop can't 
>>>>>>> get unrolled, and not needed code is inside it.
>>>>>>> >
>>>>>>> > I've found this graph, I wonder if it's most optimal, in a 
>>>>>>> partiucalry Load of ByteBuffer index / hb is from phi, could it 
>>>>>>> be attached to initial memory?
>>>>>>> >
>>>>>>> > Here's a picture 
>>>>>>> https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$> 
>>>>>>
>>>>>>> > 
>>>>>>> [https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p]<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> <https://urldefense.com/v3/__https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p**A3Chttps:/*drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvT2w-EKw$>> 
>>>>>>
>>>>>>> > 
>>>>>>> bb_issues.png<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$>> 
>>>>>>
>>>>>>> > drive.google.com
>>>>>>> >
>>>>>>> >
>>>>>>> > And sample code
>>>>>>> >
>>>>>>> > protected void copyMemory(ByteBuffer in, ByteBuffer out) {
>>>>>>> >  var limit = SPECIES.loopBound(in.limit());
>>>>>>> >  for (int i=0; i < limit; i += SPECIES.vectorByteSize()) {
>>>>>>> >    final var v = ByteVector.fromByteBuffer(SPECIES, in, i, 
>>>>>>> ByteOrder.nativeOrder());
>>>>>>> >    v.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>>>>>>> >  }
>>>>>>> > }
>>>>>>> >
>>>>>>> > Kind regards,
>>>>>>> > Rado
>>>>>>