[External] : Re: Issues with loop unrolling: better pinned node

Thu Sep 9 18:54:27 UTC 2021

I'm sorry, I think I attached wrong screenshot this one could better 
describe my concern [1]. This is part of preloop

[1] 
https://drive.google.com/file/d/1dKJL6Dqu8bFluiIz63LxFisQ1i_XT_W_/view?usp=sharing

On 09.09.2021 18:14, Rado Smogura wrote:
> Hi Vladimir,
>
> Here I'm not the expert. For following code
>
> public static int test2(ByteBuffer in,ByteBuffer out,ByteBuffer 
> out2,byte[] arr) {
>     for (int i =0; i <SPECIES_BYTE.loopBound(in.limit()); i 
> +=SPECIES_BYTE.vectorByteSize()) {
>         var v1 =ByteVector.fromByteBuffer(SPECIES_BYTE, in, 
> i,ByteOrder.nativeOrder());
>         arr[i] = (byte) i;
>         v1.intoByteBuffer(out, i,ByteOrder.nativeOrder());
>     }
>
>     return 0;
> }
>
> I think there's missing dependency between loadV and storeB - I put 
> graph here [1].
>
> My understanding is, initially and for short period of time, storeB is 
> anchored to same memory as loadV. However storeB as it's "normal" 
> store optimizes own memory chain, and later phi split happens.
>
> Finally storeB and loadV are anchored to different memory, so 
> anti-dependencies can't find interference. Not sure if there's a 
> possibility to separate such loads and stores in terms of blocks 
> precedence.
>
> Kind regards,
> Rado
>
> [1] 
> https://drive.google.com/file/d/1_Q7LUmmQqbVnH9pdJmSkjrRsfDyekrbo/view?usp=sharing
>
> On 07.09.2021 20:01, Vladimir Ivanov wrote:
>> Thanks for giving it a try, Rado.
>>
>> It feels like a lot of complexity comes from attempting to support 
>> multiple slices per memory operation.
>>
>> How would it look like if you give up on them and use the 
>> TypePtr::BOTTOM/AliasIdxBot? Such memory operations won't be amenable 
>> for further memory-related optimizations (since they alias with any 
>> other memory operation), but it should significantly simplify their 
>> support, shouldn't it?
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 02.09.2021 22:53, Rado Smogura wrote:
>>> Hi Vladimir,
>>>
>>>
>>> Thank you for feedback.
>>>
>>>
>>> There was one idea I had previously and I added it here (I surprised 
>>> it works):
>>>
>>> * add additional filed TypeTuple _multi_load_adr to Node and set it 
>>> in mixed mode,
>>>
>>> * in anti-deps add external loop to do analysis for every address 
>>> from this tuple
>>>
>>> Minor changes:
>>>
>>> * pass this field to mach node;
>>>
>>> * in anti-deps load node has to traverse memory chain (normally this 
>>> is done in Ideal).
>>>
>>>
>>> I checked it with mixed "mode" operating on int and byte vectors and 
>>> I see storeV (raw / byte[]) gets anit-dep to loadV (raw/int[]), and 
>>> same for storeV(raw/byte[]) - so that's good - as there's 
>>> interference over raw.
>>>
>>>
>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics*mask...rsmogura:mixed-mode-use-bot-mem-opt-antideps?expand=1__;Kw!!ACWV5N9M2RV99hQ!ZnX5KqhoIDbUbqdEBiwN3v2aGgLQLfRteZuZKx0RmLzqhfMhcKrMedWCzfG8mBggvHhJ2R8$ 
>>>
>>>
>>> Kind regards,
>>>
>>> Rado
>>>
>>> On 01.09.2021 15:22, Vladimir Ivanov wrote:
>>>> Interesting idea, Rado! Representing memory effects of 
>>>> mixed/mismatched accesses with TypePtr::BOTTOM does look promising.
>>>>
>>>> Regarding the preferred IR shapes, I'd try to teach alias analysis 
>>>> (Compile::find_alias_type()) and 
>>>> PhaseCFG::insert_anti_dependences() about loads/stores on wide 
>>>> memory (TypePtr::BOTTOM) and see what kind of problems arise to 
>>>> decide how to proceed. I hope there's a way to avoid dummy nodes 
>>>> when representing desired effects.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> On 30.08.2021 18:12, Rado Smogura wrote:
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I added one missing thing. I want to build something like this. 
>>>>> Would it make sense?
>>>>>
>>>>>
>>>>>     STORE
>>>>>
>>>>>
>>>>>                                 addr
>>>>>                                   │
>>>>>                                   │
>>>>>           reset_memory()          │
>>>>>              │    ┌───────────────┴────────┐
>>>>>              │    │ CheckCastPP (-> BOT)   │
>>>>>              │    └──────┬─────────────────┘
>>>>>              │           │
>>>>>              ├───────┐   │
>>>>>              │       │   │
>>>>>              │       │   │
>>>>>              │  ┌────┴───┴──────────────────────────┐
>>>>>              │  │            StoreVector            │
>>>>>              │  └───┬───────────────────────────┬───┘
>>>>>              │      │                           │
>>>>>              │      │                           │
>>>>> ┌┴──────┴───────────────────────────┴────────────────────────────┐
>>>>>             │ BOT  RAW byte[]                          │
>>>>>             │ MergeMem                                        │
>>>>> └────────────────────────────────────────────────────────────────┘
>>>>>
>>>>>
>>>>>
>>>>>      LOAD
>>>>>
>>>>>               │
>>>>>               │
>>>>>               ├─────────┐
>>>>>               │         │
>>>>>               │ 
>>>>> ┌───────┴─────────────────────────────────────────────────────┐
>>>>>               │ │ LoadVector 
>>>>> (BOT)                                            │
>>>>>               │ 
>>>>> └───────────────────────┬─────────────────────────┬───────────┘
>>>>>               │                         │ │
>>>>>               │     addr base -> raw │                         │ 
>>>>> addr base -> byte[]
>>>>>               │                         │ │
>>>>>               │           ┌─────────────┴─────────┐ 
>>>>> ┌───────────┴───────────┐
>>>>>               │           │DummyStoreV (raw)      │ │DummyStoreV 
>>>>> (byte[])   │ //No-op stores
>>>>>               │           └──────┬────────────────┘ 
>>>>> └──┬────────────────────┘
>>>>>               │                  │                       │
>>>>>               │     ┌────────────┘             ┌─────────┘
>>>>>               │     │                          │
>>>>> ┌─┴─────┴──────────────────────────┴──────────────────────────────┐
>>>>>             │ BOT  RAW byte[]                           │
>>>>>             │ MergeMem                                         │
>>>>> └─────────────────────────────────────────────────────────────────┘
>>>>>
>>>>>
>>>>> DummyStore is "virtual" node inserted after load, intended to 
>>>>> emulate store, and prevent writes / reads to go on the side of 
>>>>> load vector (it fact it more prevents store / load to see through 
>>>>> mem-memrge).
>>>>>
>>>>> I did test it with following code.
>>>>>
>>>>> public static void copyMemoryBytes3(ByteBuffer in, ByteBuffer out, 
>>>>> ByteBuffer out2,byte[] arr) {
>>>>>      for (int i=0; i <SPECIES_BYTE.loopBound(in.limit()); i 
>>>>> +=SPECIES_BYTE.vectorByteSize()) {
>>>>>          var v1 = ByteVector.fromByteBuffer(SPECIES_BYTE, in, i, 
>>>>> ByteOrder.nativeOrder());
>>>>>          arr[i] = (byte) i;
>>>>>          var v2 = ByteVector.fromByteBuffer(SPECIES_BYTE, out, i, 
>>>>> ByteOrder.nativeOrder());
>>>>>          v1.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>>>>>      }
>>>>> }
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Rado
>>>>>
>>>>> On 27.08.2021 20:16, Rado Smogura wrote:
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> I experimented a little bit, and I wonder if this is reasonable, 
>>>>>> the outcome on graphs is as expected, and operations looks like 
>>>>>> properly ordered (but this is my private opinion).
>>>>>>
>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/755b62823aaed0cddf78e8ccfc60c063bb40779a__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVvmWp1wY$ 
>>>>>>
>>>>>>
>>>>>> Kind regards,
>>>>>>
>>>>>> Rado
>>>>>>
>>>>>> On 19.08.2021 22:26, Rado Smogura wrote:
>>>>>>> I think I answered this question quite simply... it will not work.
>>>>>>>
>>>>>>> On 19.08.2021 18:39, Rado Smogura wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>>
>>>>>>>> I hope you have a good day.
>>>>>>>>
>>>>>>>>
>>>>>>>> As still optimizing loops would be good approach, I thought 
>>>>>>>> about optimizing a mixed access with this approach:
>>>>>>>>
>>>>>>>>
>>>>>>>> 1. When mixed access is detected set flag "raw / byte array" 
>>>>>>>> mixed access.
>>>>>>>>
>>>>>>>> 2. Bail out and restart compilation (will happen during first 
>>>>>>>> phases, and only for few methods).
>>>>>>>>
>>>>>>>> 3. Pass a flag to compiler.
>>>>>>>>
>>>>>>>> 4. Modify find_alias_type / flatten_alias_type, so that if byte 
>>>>>>>> array will be queried for alias, raw ptr and raw alias will be 
>>>>>>>> used.
>>>>>>>>
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>>
>>>>>>>> Rado
>>>>>>>>
>>>>>>>> On 18.08.2021 09:17, Rado Smogura wrote:
>>>>>>>>> Hi Vladimir,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you for answer.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In fact, it is was an attempt to confirm that memory flow can 
>>>>>>>>> be a cause why loop opts do not work. That's very fair point. 
>>>>>>>>> I'll think about it and maybe I'll be able to come out idea 
>>>>>>>>> how this can be generalized.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kind regards,
>>>>>>>>>
>>>>>>>>> Rado
>>>>>>>>>
>>>>>>>>> On 16.08.2021 15:41, Vladimir Ivanov wrote:
>>>>>>>>>>> I wonder what do you think about something like this [1] - 
>>>>>>>>>>> it's virtually small single class change
>>>>>>>>>>
>>>>>>>>>> Very interesting experiment, Rado! It's encouraging to hear 
>>>>>>>>>> that loop opts immediately benefit from it.
>>>>>>>>>>
>>>>>>>>>> From a architectural perspective, a separate pass to optimize 
>>>>>>>>>> memory graph brings excessive complexity:
>>>>>>>>>>
>>>>>>>>>>   (1) yet another pass over the graph and susceptible to pass 
>>>>>>>>>> ordering issues;
>>>>>>>>>>
>>>>>>>>>>   (2) separate from GVN: you either have to duplicate 
>>>>>>>>>> GVN-based memory optimizations or run new pass with IGVN in a 
>>>>>>>>>> loop until it stabilizes.
>>>>>>>>>>
>>>>>>>>>> IMO the problem you noticed illustrates a general weakness in 
>>>>>>>>>> GVN implementation and that's the place where it should be 
>>>>>>>>>> fixed (ideally).
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Vladimir Ivanov
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This change tries to find unique memory for load node. I 
>>>>>>>>>>> implemented it as separate phase, as optimization may not 
>>>>>>>>>>> run in Ideal method. I think it's ligher than phi split out.
>>>>>>>>>>>
>>>>>>>>>>> Loops has been transformed. RCE started.
>>>>>>>>>>>
>>>>>>>>>>> Kind regards,
>>>>>>>>>>> Rado
>>>>>>>>>>>
>>>>>>>>>>> [1] - 
>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVLT5AsEE$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/commit/a44f515890d2c4df3fd0e0ced76545a7664926c3__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvu60z1vk$> 
>>>>>>>>>>>
>>>>>>>>>>> [2] - 
>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVcBkmVi0$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/rsmogura/panama-vector/tree/housekeeping-load-memory-optimiziation__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvkGUL-Pw$> 
>>>>>>>>>>> (full test case)
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>> *From:* Radosław Smogura on behalf of Radosław Smogura 
>>>>>>>>>>> <mail at smogura.eu>
>>>>>>>>>>> *Sent:* Friday, August 6, 2021 22:43
>>>>>>>>>>> *To:* Radosław Smogura <mail at smogura.eu>; Paul Sandoz 
>>>>>>>>>>> <paul.sandoz at oracle.com>; Vladimir Ivanov 
>>>>>>>>>>> <vladimir.x.ivanov at oracle.com>
>>>>>>>>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Now when I checked it again. it works as expected, and it's 
>>>>>>>>>>> the same code.
>>>>>>>>>>>
>>>>>>>>>>> In draft code I check if the buffer is direct by using type 
>>>>>>>>>>> checking to unswitch loop, as unswitching over ByteBuffer.hb 
>>>>>>>>>>> did not work (the graph was quite similar). However, I 
>>>>>>>>>>> thought that this unswitch actually helped to build correct 
>>>>>>>>>>> loops, and any kind of improvement around it would be rather 
>>>>>>>>>>> for the purpose of better-looking code.
>>>>>>>>>>>
>>>>>>>>>>> But it looks like that sometimes (but only sometimes) loop 
>>>>>>>>>>> still can not be correctly built, or maybe the full 
>>>>>>>>>>> optimization kicks in very, very late.
>>>>>>>>>>>
>>>>>>>>>>> Kind regards,
>>>>>>>>>>> Rado
>>>>>>>>>>> ------------------------------------------------------------------------ 
>>>>>>>>>>>
>>>>>>>>>>> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on 
>>>>>>>>>>> behalf of Radosław Smogura <mail at smogura.eu>
>>>>>>>>>>> *Sent:* Friday, August 6, 2021 20:22
>>>>>>>>>>> *To:* Paul Sandoz <paul.sandoz at oracle.com>
>>>>>>>>>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>>> *Subject:* Re: Issues with loop unrolling: better pinned node
>>>>>>>>>>> Yes,
>>>>>>>>>>>
>>>>>>>>>>> The normal case looks, good. It's all about polluted cases [1]
>>>>>>>>>>>
>>>>>>>>>>> BR,
>>>>>>>>>>> Rado
>>>>>>>>>>>
>>>>>>>>>>> [1] 
>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVfxQRu38$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvTXVlXzw$> 
>>>>>>>>>>>
>>>>>>>>>>> [https://urldefense.com/v3/__https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVmHZKrgY$ 
>>>>>>>>>>> ]<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVfxQRu38$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/daf8e3b93dd4c25e04d1ce6ae2a91e1b725625bfd85b5027c61fb78ae3a6a361/openjdk/panama-vector/pull/109**A3Chttps:/*github.com/openjdk/panama-vector/pull/109__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvjOF75Zk$>> 
>>>>>>>>>>>
>>>>>>>>>>> (Draft) Perofrmance improvements for polluted cases by 
>>>>>>>>>>> rsmogura · Pull Request #109 · 
>>>>>>>>>>> openjdk/panama-vector<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVfxQRu38$ 
>>>>>>>>>>> > 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/pull/109*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvXk316cU$> 
>>>>>>>>>>>
>>>>>>>>>>> Hi all, I would like to submit this piece of work, for byte 
>>>>>>>>>>> buffers and polluted cases. It resolves some performance 
>>>>>>>>>>> issues related to mem barriers when in scope are both on- 
>>>>>>>>>>> and off-heap buffer. T...
>>>>>>>>>>> github.com
>>>>>>>>>>>
>>>>>>>>>>> [https://urldefense.com/v3/__https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVLW0LAx0$ 
>>>>>>>>>>> ]<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVBYc4LXE$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://opengraph.githubassets.com/5fde12f89c012a2abef1542ed59c7272429fa7556f6e82a5e617a293d3a5bee1/openjdk/panama-vector**A3Chttps:/*github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvt9bVEEU$>> 
>>>>>>>>>>>
>>>>>>>>>>> Comparing 
>>>>>>>>>>> openjdk:vectorIntrinsics...rsmogura:vectors-polluted-cases · 
>>>>>>>>>>> openjdk/panama-vector<https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVBYc4LXE$ 
>>>>>>>>>>> > 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/compare/vectorIntrinsics...rsmogura:vectors-polluted-cases?expand=1*3E__;JQ!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvW2CiAB0$> 
>>>>>>>>>>>
>>>>>>>>>>> Panama vector. Contribute to openjdk/panama-vector 
>>>>>>>>>>> development by creating an account on GitHub.
>>>>>>>>>>> github.com
>>>>>>>>>>>
>>>>>>>>>>> ________________________________
>>>>>>>>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>>>>>>>>> Sent: Friday, August 6, 2021 20:04
>>>>>>>>>>> To: Radosław Smogura <mail at smogura.eu>
>>>>>>>>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>>>>>>>>
>>>>>>>>>>> I am confused as to the case under test. In your initial 
>>>>>>>>>>> email of this thread were you also referring implicitly to 
>>>>>>>>>>> polluted cases?
>>>>>>>>>>>
>>>>>>>>>>> Paul.
>>>>>>>>>>>
>>>>>>>>>>>> On Aug 6, 2021, at 10:56 AM, Radosław Smogura 
>>>>>>>>>>>> <mail at smogura.eu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Paul,
>>>>>>>>>>>>
>>>>>>>>>>>> There's a performance improvement, but. I still can't 
>>>>>>>>>>>> unroll polluted cases (I cherry-picked loop unrolling). The 
>>>>>>>>>>>> graph still has few nodes taking buffer limit from phi, and 
>>>>>>>>>>>> on IR I don't see vectors nodes cascading.
>>>>>>>>>>>>
>>>>>>>>>>>> make test TEST='micro:ByteBufferVectorAccess.p' 
>>>>>>>>>>>> MICRO="OPTIONS=-f 1 -prof perfasm 
>>>>>>>>>>>> -jvmArgsPrepend=-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" 
>>>>>>>>>>>> JOBS=12
>>>>>>>>>>>> Benchmark (size) Mode Cnt Score   Error Units
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2 1024 avgt 30 40.472 
>>>>>>>>>>>> ? 1.055  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm 1024 
>>>>>>>>>>>> avgt          NaN            ---
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3 1024 avgt 30 79.251 
>>>>>>>>>>>> ? 0.786  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm 1024 
>>>>>>>>>>>> avgt          NaN            ---
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4 1024 avgt 30 83.627 
>>>>>>>>>>>> ? 2.140  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm 1024 
>>>>>>>>>>>> avgt          NaN            ---
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5 1024 avgt 30 85.561 
>>>>>>>>>>>> ? 1.156  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm 1024 
>>>>>>>>>>>> avgt          NaN
>>>>>>>>>>>>
>>>>>>>>>>>> make test TEST='micro:ByteBufferVectorAccess.p' 
>>>>>>>>>>>> MICRO="OPTIONS=-f 1 -prof perfasm"
>>>>>>>>>>>> Benchmark (size) Mode Cnt Score   Error Units
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2 1024 avgt 10 49.326 
>>>>>>>>>>>> ? 0.843  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers2:?asm 1024 
>>>>>>>>>>>> avgt           NaN            ---
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3 1024 avgt 10 
>>>>>>>>>>>> 100.291 ? 1.271  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers3:?asm 1024 
>>>>>>>>>>>> avgt           NaN            ---
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4 1024 avgt 10 
>>>>>>>>>>>> 101.494 ? 1.027  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers4:?asm 1024 
>>>>>>>>>>>> avgt           NaN            ---
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5 1024 avgt 10 94.606 
>>>>>>>>>>>> ? 1.522  ns/op
>>>>>>>>>>>> ByteBufferVectorAccess.pollutedBuffers5:?asm 1024 
>>>>>>>>>>>> avgt           NaN
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> BR,
>>>>>>>>>>>> Rado
>>>>>>>>>>>> From: Paul Sandoz <paul.sandoz at oracle.com>
>>>>>>>>>>>> Sent: Friday, August 6, 2021 18:04
>>>>>>>>>>>> To: Radosław Smogura <mail at smogura.eu>
>>>>>>>>>>>> Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>>>>>>>>>>> Subject: Re: Issues with loop unrolling: better pinned node
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Rado,
>>>>>>>>>>>>
>>>>>>>>>>>> It’s good you are looking at the IR
>>>>>>>>>>>>
>>>>>>>>>>>> Out of curiosity, what happens if you turn off bounds 
>>>>>>>>>>>> checking [*]?
>>>>>>>>>>>>
>>>>>>>>>>>> Paul.
>>>>>>>>>>>>
>>>>>>>>>>>> [*]
>>>>>>>>>>>> -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0
>>>>>>>>>>>>
>>>>>>>>>>>> > On Aug 6, 2021, at 8:39 AM, Radosław Smogura 
>>>>>>>>>>>> <mail at smogura.eu> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > Hi all,
>>>>>>>>>>>> >
>>>>>>>>>>>> > I've found that even if we get rid of barriers, the loop 
>>>>>>>>>>>> can't get unrolled, and not needed code is inside it.
>>>>>>>>>>>> >
>>>>>>>>>>>> > I've found this graph, I wonder if it's most optimal, in 
>>>>>>>>>>>> a partiucalry Load of ByteBuffer index / hb is from phi, 
>>>>>>>>>>>> could it be attached to initial memory?
>>>>>>>>>>>> >
>>>>>>>>>>>> > Here's a picture 
>>>>>>>>>>>> https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVkhhZ0w8$ 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$> 
>>>>>>>>>>>
>>>>>>>>>>>> > 
>>>>>>>>>>>> [https://urldefense.com/v3/__https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVgkskdP0$ 
>>>>>>>>>>>> ]<https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVkhhZ0w8$ 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> <https://urldefense.com/v3/__https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p**A3Chttps:/*drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;XSUv!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvT2w-EKw$>> 
>>>>>>>>>>>
>>>>>>>>>>>> > 
>>>>>>>>>>>> bb_issues.png<https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!ceve5Eoh01VSiAxgPOSMpL_oQpz6MJI6KeGEcvULButhjMZGdxMq2SB02arX5hxVkhhZ0w8$ 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> <https://urldefense.com/v3/__https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing__;!!ACWV5N9M2RV99hQ!c_1aeHKPVlV91PddNfGPUgWISKQSh-fctE1r_hS0mCRD7zdKUeyFHAZBxTadx8tvDYUmUX8$>> 
>>>>>>>>>>>
>>>>>>>>>>>> > drive.google.com
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > And sample code
>>>>>>>>>>>> >
>>>>>>>>>>>> > protected void copyMemory(ByteBuffer in, ByteBuffer out) {
>>>>>>>>>>>> >  var limit = SPECIES.loopBound(in.limit());
>>>>>>>>>>>> >  for (int i=0; i < limit; i += SPECIES.vectorByteSize()) {
>>>>>>>>>>>> >    final var v = ByteVector.fromByteBuffer(SPECIES, in, 
>>>>>>>>>>>> i, ByteOrder.nativeOrder());
>>>>>>>>>>>> >    v.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>>>>>>>>>>>> >  }
>>>>>>>>>>>> > }
>>>>>>>>>>>> >
>>>>>>>>>>>> > Kind regards,
>>>>>>>>>>>> > Rado
>>>>>>>>>>>