Issues with loop unrolling: better pinned node

Fri Aug 6 17:56:06 UTC 2021

Hi Paul,

There's a performance improvement, but. I still can't unroll polluted cases (I cherry-picked loop unrolling). The graph still has few nodes taking buffer limit from phi, and on IR I don't see vectors nodes cascading.

make test TEST='micro:ByteBufferVectorAccess.p' MICRO="OPTIONS=-f 1 -prof perfasm -jvmArgsPrepend=-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0" JOBS=12
Benchmark                                     (size)  Mode  Cnt   Score   Error  Units
ByteBufferVectorAccess.pollutedBuffers2         1024  avgt   30  40.472 ? 1.055  ns/op
ByteBufferVectorAccess.pollutedBuffers2:?asm    1024  avgt          NaN            ---
ByteBufferVectorAccess.pollutedBuffers3         1024  avgt   30  79.251 ? 0.786  ns/op
ByteBufferVectorAccess.pollutedBuffers3:?asm    1024  avgt          NaN            ---
ByteBufferVectorAccess.pollutedBuffers4         1024  avgt   30  83.627 ? 2.140  ns/op
ByteBufferVectorAccess.pollutedBuffers4:?asm    1024  avgt          NaN            ---
ByteBufferVectorAccess.pollutedBuffers5         1024  avgt   30  85.561 ? 1.156  ns/op
ByteBufferVectorAccess.pollutedBuffers5:?asm    1024  avgt          NaN

make test TEST='micro:ByteBufferVectorAccess.p' MICRO="OPTIONS=-f 1 -prof perfasm"
Benchmark                                     (size)  Mode  Cnt    Score   Error  Units
ByteBufferVectorAccess.pollutedBuffers2         1024  avgt   10   49.326 ? 0.843  ns/op
ByteBufferVectorAccess.pollutedBuffers2:?asm    1024  avgt           NaN            ---
ByteBufferVectorAccess.pollutedBuffers3         1024  avgt   10  100.291 ? 1.271  ns/op
ByteBufferVectorAccess.pollutedBuffers3:?asm    1024  avgt           NaN            ---
ByteBufferVectorAccess.pollutedBuffers4         1024  avgt   10  101.494 ? 1.027  ns/op
ByteBufferVectorAccess.pollutedBuffers4:?asm    1024  avgt           NaN            ---
ByteBufferVectorAccess.pollutedBuffers5         1024  avgt   10   94.606 ? 1.522  ns/op
ByteBufferVectorAccess.pollutedBuffers5:?asm    1024  avgt           NaN

BR,
Rado
________________________________
From: Paul Sandoz <paul.sandoz at oracle.com>
Sent: Friday, August 6, 2021 18:04
To: Radosław Smogura <mail at smogura.eu>
Cc: panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Subject: Re: Issues with loop unrolling: better pinned node

Hi Rado,

It’s good you are looking at the IR

Out of curiosity, what happens if you turn off bounds checking [*]?

Paul.

[*]
-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0

> On Aug 6, 2021, at 8:39 AM, Radosław Smogura <mail at smogura.eu> wrote:
>
> Hi all,
>
> I've found that even if we get rid of barriers, the loop can't get unrolled, and not needed code is inside it.
>
> I've found this graph, I wonder if it's most optimal, in a partiucalry Load of ByteBuffer index / hb is from phi, could it be attached to initial memory?
>
> Here's a picture https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing
> [https://lh6.googleusercontent.com/SKgGZgfVWFpG8w4mWqguLSU4DVfa1MKYPSQhxv8EoX04XzVz8U8Kc4zHP0iwdR26Suc=w1200-h630-p]<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing>
> bb_issues.png<https://drive.google.com/file/d/1G7ZN0xHOVIVHmZ_5TTIUdm3F30okAzvO/view?usp=sharing>
> drive.google.com
>
>
> And sample code
>
> protected void copyMemory(ByteBuffer in, ByteBuffer out) {
>  var limit = SPECIES.loopBound(in.limit());
>  for (int i=0; i < limit; i += SPECIES.vectorByteSize()) {
>    final var v = ByteVector.fromByteBuffer(SPECIES, in, i, ByteOrder.nativeOrder());
>    v.intoByteBuffer(out, i, ByteOrder.nativeOrder());
>  }
> }
>
> Kind regards,
> Rado