[External] : Re: Issues with loop unrolling: better pinned node

Thu Sep 9 16:14:34 UTC 2021

Hi Vladimir,

Here I'm not the expert. For following code

public static int test2(ByteBuffer in,ByteBuffer out,ByteBuffer out2,byte[] arr) {
     for (int i =0; i <SPECIES_BYTE.loopBound(in.limit()); i +=SPECIES_BYTE.vectorByteSize()) {
         var v1 =ByteVector.fromByteBuffer(SPECIES_BYTE, in, i,ByteOrder.nativeOrder());
         arr[i] = (byte) i;
         v1.intoByteBuffer(out, i,ByteOrder.nativeOrder());
     }

     return 0;
}

I think there's missing dependency between loadV and storeB - I put graph here [1].

My understanding is, initially and for short period of time, storeB is anchored to same memory as loadV. However storeB as it's "normal" store optimizes own memory chain, and later phi split happens.

Finally storeB and loadV are anchored to different memory, so anti-dependencies can't find interference. Not sure if there's a possibility to separate such loads and stores in terms of blocks precedence.

Kind regards,
Rado

[1] https://drive.google.com/file/d/1_Q7LUmmQqbVnH9pdJmSkjrRsfDyekrbo/view?usp=sharing

On 07.09.2021 20:01, Vladimir Ivanov wrote:
> Thanks for giving it a try, Rado.
>
> It feels like a lot of complexity comes from attempting to support 
> multiple slices per memory operation.
>
> How would it look like if you give up on them and use the 
> TypePtr::BOTTOM/AliasIdxBot? Such memory operations won't be amenable 
> for further memory-related optimizations (since they alias with any 
> other memory operation), but it should significantly simplify their 
> support, shouldn't it?
>
> Best regards,
> Vladimir Ivanov
>
> On 02.09.2021 22:53, Rado Smogura wrote:
>> Hi Vladimir,
>>
>>
>> Thank you for feedback.
>>
>>
>> There was one idea I had previously and I added it here (I surprised 
>> it works):
>>
>> * add additional filed TypeTuple _multi_load_adr to Node and set it 
>> in mixed mode,
>>
>> * in anti-deps add external loop to do analysis for every address 
>> from this tuple
>>
>> Minor changes:
>>
>> * pass this field to mach node;
>>
>> * in anti-deps load node has to traverse memory chain (normally this 
>> is done in Ideal).
>>
>>
>> I checked it with mixed "mode" operating on int and byte vectors and 
>> I see storeV (raw / byte[]) gets anit-dep to loadV (raw/int[]), and 
>> same for storeV(raw/byte[]) - so that's good - as there's 
>> interference over raw.
>>
>>
>> https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics*mask...rsmogura:mixed-mode-use-bot-mem-opt-antideps?expand=1__;Kw!!ACWV5N9M2RV99hQ!ZnX5KqhoIDbUbqdEBiwN3v2aGgLQLfRteZuZKx0RmLzqhfMhcKrMedWCzfG8mBggvHhJ2R8$ 
>>
>>
>> Kind regards,
>>
>> Rado
>>
>> On 01.09.2021 15:22, Vladimir Ivanov wrote:
>>> Interesting idea, Rado! Representing memory effects of 
>>> mixed/mismatched accesses with TypePtr::BOTTOM does look promising.
>>>
>>> Regarding the preferred IR shapes, I'd try to teach alias analysis 
>>> (Compile::find_alias_type()) and PhaseCFG::insert_anti_dependences() 
>>> about loads/stores on wide memory (TypePtr::BOTTOM) and see what 
>>> kind of problems arise to decide how to proceed. I hope there's a 
>>> way to avoid dummy nodes when representing desired effects.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 30.08.2021 18:12, Rado Smogura wrote:
>>>> Hi all,
>>>>
>>>>
>>>> I added one missing thing. I want to build something like this. 
>>>> Would it make sense?
>>>>
>>>>
>>>>     STORE
>>>>
>>>>
>>>>                                 addr
>>>>                                   │
>>>>                                   │
>>>>           reset_memory()          │
>>>>              │    ┌───────────────┴────────┐
>>>>              │    │ CheckCastPP (-> BOT)   │
>>>>              │    └──────┬─────────────────┘
>>>>              │           │
>>>>              ├───────┐   │
>>>>              │       │   │
>>>>              │       │   │
>>>>              │  ┌────┴───┴──────────────────────────┐
>>>>              │  │            StoreVector            │
>>>>              │  └───┬───────────────────────────┬───┘
>>>>              │      │                           │
>>>>              │      │                           │
>>>> ┌┴──────┴───────────────────────────┴────────────────────────────┐
>>>>             │ BOT  RAW byte[]                          │
>>>>             │ MergeMem                                        │
>>>> └────────────────────────────────────────────────────────────────┘
>>>>
>>>>
>>>>
>>>>      LOAD
>>>>
>>>>               │
>>>>               │
>>>>               ├─────────┐
>>>>               │         │
>>>>               │ 
>>>> ┌───────┴─────────────────────────────────────────────────────┐
>>>>               │ │ LoadVector 
>>>> (BOT)                                            │
>>>>               │ 
>>>> └───────────────────────┬─────────────────────────┬───────────┘
>>>>               │                         │ │
>>>>               │     addr base -> raw │                         │ 
>>>> addr base -> byte[]
>>>>               │                         │ │
>>>>               │           ┌─────────────┴─────────┐ 
>>>> ┌───────────┴───────────┐
>>>>               │           │DummyStoreV (raw)      │ │DummyStoreV 
>>>> (byte[])   │ //No-op stores
>>>>               │           └──────┬────────────────┘ 
>>>> └──┬────────────────────┘
>>>>               │                  │                       │
>>>>               │     ┌────────────┘             ┌─────────┘
>>>>               │     │                          │
>>>> ┌─┴─────┴──────────────────────────┴──────────────────────────────┐
>>>>             │ BOT  RAW byte[]                           │
>>>>             │ MergeMem                                         │
>>>> └─────────────────────────────────────────────────────────────────┘
>>>>
>>>>
>>>> DummyStore is "virtual" node inserted after load, intended to 
>>>> emulate store, and prevent writes / reads to go on the side of load 
>>>> vector (it fact it more prevents store / load to see through 
>>>> mem-memrge).
>>>>
>>>> I did test it with following code.
>>>>
>>>> public static void copyMemoryBytes3(ByteBuffer in, ByteBuffer out, 
>>>> ByteBuffer out2,byte[] arr) {
>>>>      for (int i=0; i <SPECIES_BYTE.loopBound(in.limit()); i 
>>>> +=SPECIES_BYTE.vectorByteSize()) {
>>>>          var v1 = ByteVector.fromByteBuffer(SPECIES_BYTE, in, i, 
>>>> ByteOrder.nativeOrder());
>>>>          arr[i] = (byte) i;
>>>>          var v2 = ByteVector.fromByteBuffer(SPECIES_BYTE, out, i, 
>>>> ByteOrder.nativeOrder());
>>>>          v1.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>>>>      }
>>>> }
>>>>
>>>> Kind regards,
>>>>
>>>> Rado
>>>>
>>>> On 27.08.2021 20:16, Rado Smogura wrote:
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I experimented a little bit, and I wonder if this is reasonable, 
>>>>> the outcome on graphs is as expected, and operations looks like 
>>>>> properly ordered (but this is my private opinion).
>>>>>
>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/755b62823aaed0cddf78e8ccfc60c063bb40779a__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVvmWp1wY$ 
>>>>>
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Rado
>>>>>
>>>>> On 19.08.2021 22:26, Rado Smogura wrote:
>>>>>> I think I answered this question quite simply... it will not work.
>>>>>>
>>>>>> On 19.08.2021 18:39, Rado Smogura wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> I hope you have a good day.
>>>>>>>
>>>>>>>
>>>>>>> As still optimizing loops would be good approach, I thought 
>>>>>>> about optimizing a mixed access with this approach:
>>>>>>>
>>>>>>>
>>>>>>> 1. When mixed access is detected set flag "raw / byte array" 
>>>>>>> mixed access.
>>>>>>>
>>>>>>> 2. Bail out and restart compilation (will happen during first 
>>>>>>> phases, and only for few methods).
>>>>>>>
>>>>>>> 3. Pass a flag to compiler.
>>>>>>>
>>>>>>> 4. Modify find_alias_type / flatten_alias_type, so that if byte 
>>>>>>> array will be queried for alias, raw ptr and raw alias will be 
>>>>>>> used.
>>>>>>>
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Rado
>>>>>>>
>>>>>>> On 18.08.2021 09:17, Rado Smogura wrote:
>>>>>>>> Hi Vladimir,
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you for answer.
>>>>>>>>
>>>>>>>>
>>>>>>>> In fact, it is was an attempt to confirm that memory flow can 
>>>>>>>> be a cause why loop opts do not work. That's very fair point. 
>>>>>>>> I'll think about it and maybe I'll be able to come out idea how 
>>>>>>>> this can be generalized.
>>>>>>>>
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>>
>>>>>>>> Rado
>>>>>>>>
>>>>>>>> On 16.08.2021 15:41, Vladimir Ivanov wrote:
>>>>>>>>>> I wonder what do you think about something like this [1] - 
>>>>>>>>>> it's virtually small single class change
>>>>>>>>>
>>>>>>>>> Very interesting experiment, Rado! It's encouraging to hear 
>>>>>>>>> that loop opts immediately benefit from it.
>>>>>>>>>
>>>>>>>>> From a architectural perspective, a separate pass to optimize 
>>>>>>>>> memory graph brings excessive complexity:
>>>>>>>>>
>>>>>>>>>   (1) yet another pass over the graph and susceptible to pass 
>>>>>>>>> ordering issues;
>>>>>>>>>
>>>>>>>>>   (2) separate from GVN: you either have to duplicate 
>>>>>>>>> GVN-based memory optimizations or run new pass with IGVN in a 
>>>>>>>>> loop until it stabilizes.
>>>>>>>>>
>>>>>>>>> IMO the problem you noticed illustrates a general weakness in 
>>>>>>>>> GVN implementation and that's the place where it should be 
>>>>>>>>> fixed (ideally).
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Vladimir Ivanov
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This change tries to find unique memory for load node. I 
>>>>>>>>>> implemented it as separate phase, as optimization may not run 
>>>>>>>>>> in Ideal method. I think it's ligher than phi split out.
>>>>>>>>>>
>>>>>>>>>> Loops has been transformed. RCE started.
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Rado
>>>>>>>>>>
>>>>>>>>>> [1] - 
>>>>>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVLT5AsEE$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvu60z1vk$> 
>>>>>>>>>>
>>>>>>>>>> [2] - 
>>>>>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVcBkmVi0$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvkGUL-Pw$> 
>>>>>>>>>> (full test case)
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>
>>>>>>>>>> *From:* Radosław Smogura on behalf of Radosław Smogura 
>>>>>>>>>> <mail at smogura.eu>
>>>>>>>>>> *Sent:* Friday, August 6, 2021 22:43
>>>>>>>>>> *To:* Radosław Smogura <mail at smogura.eu>; Paul Sandoz 
>>>>>>>>>> <paul.sandoz at oracle.com>; Vladimir Ivanov 
>>>>>>>>>> <vladimir.x.ivanov at oracle.com>
>>>>>>>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Now when I checked it again. it works as expected, and it's 
>>>>>>>>>> the same code.
>>>>>>>>>>
>>>>>>>>>> In draft code I check if the buffer is direct by using type 
>>>>>>>>>> checking to unswitch loop, as unswitching over ByteBuffer.hb 
>>>>>>>>>> did not work (the graph was quite similar). However, I 
>>>>>>>>>> thought that this unswitch actually helped to build correct 
>>>>>>>>>> loops, and any kind of improvement around it would be rather 
>>>>>>>>>> for the purpose of better-looking code.
>>>>>>>>>>
>>>>>>>>>> But it looks like that sometimes (but only sometimes) loop 
>>>>>>>>>> still can not be correctly built, or maybe the full 
>>>>>>>>>> optimization kicks in very, very late.
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Rado
>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>
>>>>>>>>>> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on 
>>>>>>>>>> behalf of Radosław Smogura <mail at smogura.eu>
>>>>>>>>>> *Sent:* Friday, August 6, 2021 20:22
>>>>>>>>>> *To:* Paul Sandoz <paul.sandoz at oracle.com>
>>>>>>>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>>>>>>>> Yes,
>>>>>>>>>>
>>>>>>>>>> The normal case looks, good. It's all about polluted cases [1]
>>>>>>>>>>
>>>>>>>>>> BR,
>>>>>>>>>> Rado
>>>>>>>>>>
>>>>>>>>>> [1] 
>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVfxQRu38$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvTXVlXzw$> 
>>>>>>>>>>
>>>>>>>>>> [https://urldefense.com/v3/__https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVmHZKrgY$ 
>>>>>>>>>> ]<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVfxQRu38$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109**A3Chttps:/*github.com/openjdk/panama-vector/pull/109__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvjOF75Zk$>> 
>>>>>>>>>>
>>>>>>>>>> (Draft) Perofrmance improvements for polluted cases by 
>>>>>>>>>> rsmogura · Pull Request #109 · 
>>>>>>>>>> openjdk/panama-vector<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVfxQRu38$ 
>>>>>>>>>> > 
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvXk316cU$> 
>>>>>>>>>>
>>>>>>>>>> Hi all, I would like to submit this piece of work, for byte 
>>>>>>>>>> buffers and polluted cases. It resolves some performance 
>>>>>>>>>> issues related to mem barriers when in scope are both on- and 
>>>>>>>>>> off-heap buffer. T...
>>>>>>>>>> github.com
>>>>>>>>>>
>>>>>>>>>> [https://urldefense.com/v3/__https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVLW0LAx0$ 
>>>>>>>>>> ]<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVBYc4LXE$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector**A3Chttps:/*github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvt9bVEEU$>> 
>>>>>>>>>>
>>>>>>>>>> Comparing 
>>>>>>>>>> openjdk:vectorIntrinsics...rsmogura:vectors-polluted-cases · 
>>>>>>>>>> openjdk/panama-vector<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVBYc4LXE$ 
>>>>>>>>>> > 
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvW2CiAB0$> 
>>>>>>>>>>
>>>>>>>>>> Panama vector. Contribute to openjdk/panama-vector 
>>>>>>>>>> development by creating an account on GitHub.
>>>>>>>>>> github.com
>>>>>>>>>>
>>>>>>>>>> ________________________________
>>>>>>>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>>>>>>>> Sent: Friday, August 6, 2021 20:04
>>>>>>>>>> To: Radosław Smogura <mail at smogura.eu>
>>>>>>>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>>>>>>>
>>>>>>>>>> I am confused as to the case under test. In your initial 
>>>>>>>>>> email of this thread were you also referring implicitly to 
>>>>>>>>>> polluted cases?
>>>>>>>>>>
>>>>>>>>>> Paul.
>>>>>>>>>>
>>>>>>>>>>> On Aug 6, 2021, at 10:56 AM, Radosław Smogura 
>>>>>>>>>>> <mail at smogura.eu> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>
>>>>>>>>>>> There's a performance improvement, but. I still can't unroll 
>>>>>>>>>>> polluted cases (I cherry-picked loop unrolling). The graph 
>>>>>>>>>>> still has few nodes taking buffer limit from phi, and on IR 
>>>>>>>>>>> I don't see vectors nodes cascading.
>>>>>>>>>>>
>>>>>>>>>>> make test TEST='micro:ByteBufferVectorAccess.p' 
>>>>>>>>>>> MICRO="OPTIONS=-f 1 -prof perfasm 
>>>>>>>>>>> -jvmArgsPrepend=-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" 
>>>>>>>>>>> JOBS=12
>>>>>>>>>>> Benchmark (size) Mode Cnt Score   Error  Units
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2 1024 avgt 30 40.472 
>>>>>>>>>>> ? 1.055  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm 1024 
>>>>>>>>>>> avgt          NaN            ---
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3 1024 avgt 30 79.251 
>>>>>>>>>>> ? 0.786  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm 1024 
>>>>>>>>>>> avgt          NaN            ---
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4 1024 avgt 30 83.627 
>>>>>>>>>>> ? 2.140  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm 1024 
>>>>>>>>>>> avgt          NaN            ---
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5 1024 avgt 30 85.561 
>>>>>>>>>>> ? 1.156  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm 1024 
>>>>>>>>>>> avgt          NaN
>>>>>>>>>>>
>>>>>>>>>>> make test TEST='micro:ByteBufferVectorAccess.p' 
>>>>>>>>>>> MICRO="OPTIONS=-f 1 -prof perfasm"
>>>>>>>>>>> Benchmark (size) Mode Cnt Score   Error  Units
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2 1024 avgt 10 49.326 
>>>>>>>>>>> ? 0.843  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm 1024 
>>>>>>>>>>> avgt           NaN            ---
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3 1024 avgt 10 100.291 
>>>>>>>>>>> ? 1.271  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm 1024 
>>>>>>>>>>> avgt           NaN            ---
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4 1024 avgt 10 101.494 
>>>>>>>>>>> ? 1.027  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm 1024 
>>>>>>>>>>> avgt           NaN            ---
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5 1024 avgt 10 94.606 
>>>>>>>>>>> ? 1.522  ns/op
>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm 1024 
>>>>>>>>>>> avgt           NaN
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> BR,
>>>>>>>>>>> Rado
>>>>>>>>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>>>>>>>>> Sent: Friday, August 6, 2021 18:04
>>>>>>>>>>> To: Radosław Smogura <mail at smogura.eu>
>>>>>>>>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>>>>>>>>
>>>>>>>>>>> Hi Rado,
>>>>>>>>>>>
>>>>>>>>>>> It’s good you are looking at the IR
>>>>>>>>>>>
>>>>>>>>>>> Out of curiosity, what happens if you turn off bounds 
>>>>>>>>>>> checking [*]?
>>>>>>>>>>>
>>>>>>>>>>> Paul.
>>>>>>>>>>>
>>>>>>>>>>> [*]
>>>>>>>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>>>>>>>>
>>>>>>>>>>> > On Aug 6, 2021, at 8:39 AM, Radosław Smogura 
>>>>>>>>>>> <mail at smogura.eu> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Hi all,
>>>>>>>>>>> >
>>>>>>>>>>> > I've found that even if we get rid of barriers, the loop 
>>>>>>>>>>> can't get unrolled, and not needed code is inside it.
>>>>>>>>>>> >
>>>>>>>>>>> > I've found this graph, I wonder if it's most optimal, in a 
>>>>>>>>>>> partiucalry Load of ByteBuffer index / hb is from phi, could 
>>>>>>>>>>> it be attached to initial memory?
>>>>>>>>>>> >
>>>>>>>>>>> > Here's a picture 
>>>>>>>>>>> https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVkhhZ0w8$ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$> 
>>>>>>>>>>
>>>>>>>>>>> > 
>>>>>>>>>>> [https://urldefense.com/v3/__https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVgkskdP0$ 
>>>>>>>>>>> ]<https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVkhhZ0w8$ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <https://urldefense.com/v3/__https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p**A3Chttps:/*drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvT2w-EKw$>> 
>>>>>>>>>>
>>>>>>>>>>> > 
>>>>>>>>>>> bb_issues.png<https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVkhhZ0w8$ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$>> 
>>>>>>>>>>
>>>>>>>>>>> > drive.google.com
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > And sample code
>>>>>>>>>>> >
>>>>>>>>>>> > protected void copyMemory(ByteBuffer in, ByteBuffer out) {
>>>>>>>>>>> >  var limit = SPECIES.loopBound(in.limit());
>>>>>>>>>>> >  for (int i=0; i < limit; i += SPECIES.vectorByteSize()) {
>>>>>>>>>>> >    final var v = ByteVector.fromByteBuffer(SPECIES, in, i, 
>>>>>>>>>>> ByteOrder.nativeOrder());
>>>>>>>>>>> >    v.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>>>>>>>>>>> >  }
>>>>>>>>>>> > }
>>>>>>>>>>> >
>>>>>>>>>>> > Kind regards,
>>>>>>>>>>> > Rado
>>>>>>>>>>