[vectorIntrinsics] RFR: Optimize mem barriers for ByteBuffer cases

Mon Aug 2 14:29:49 UTC 2021

Hi Andrew,

Thank you for feedback.

I'm not an expert at M1 (I think about testing it cloud), however, as I check ARM asm it looks like the C++ version use load instruction to load data into two registers at once, and M1 bit size for vecros is 128bits.

I noticed, as well, that code does not do unrolling, and this already should be fixed [1], but looks like it waits merging to the branch. Probably with above change GB/s can go up.

Kind regards,
Rado

[1] https://github.com/openjdk/panama-vector/commit/1f51e13ea763e642dac440142e9cb3a177df7959

________________________________
From: panama-dev <panama-dev-retn at openjdk.java.net> on behalf of Andrew Haley <aph at redhat.com>
Sent: Sunday, August 1, 2021 17:37
To: 'panama-dev at openjdk.java.net' <panama-dev at openjdk.java.net>
Subject: Re: [vectorIntrinsics] RFR: Optimize mem barriers for ByteBuffer cases

This is interesting. I'm looking at Apple M1, where I see

ByteBufferVectorAccess.heapBuffers    1024  avgt   10  86.512 ? 0.041  ns/op

which corresponds to about 23 Gigabytes/sec, when I do:

... -wi 5 -i 10 -p size=524288 -tu s  -bm thrpt

Benchmark                           (size)   Mode  Cnt      Score    Error  Units
ByteBufferVectorAccess.heapBuffers  524288  thrpt   10  22921.655 ? 42.230  ops/s

(We're moving half a megabyte from A to B, so ops/s is megabytes/s)

This is pretty good.

The code HotSpot generates looks like this:

  0.49%        sxtw     x16, w14
 52.13%        ldr       q16, [x13, x16]
  0.39%        sxtw     x16, w14
  0.23%        str      q16, [x15, x16]
 20.37%        add       w14, w14, #0x10
  0.17%        cmp      w14, w18
               b.lt     0x0000fffface568b0  // b.tstop

So, how well can C do with something similar?

Here's a loop from the STREAM benchmark, which copies ten million
doubles:

 for (j=0; j<10_000_000; j++)
     c[j] = a[j];

which generates (something like this code, I've edited it a bit):

LBB0_7:
        ldp     q0, q1, [x8, #-16]
        stp     q0, q1, [x10, #-16]
        add     x8, x8, #32
        add     x10, x10, #32
        subs    x9, x9, #4
        b.ne    LBB0_7

and runs at about 60 Gbytes/sec.

The difference here, I suspect, might be not much more than using 32-byte
accesses rather than 16-byte accesses. We're very close.

--
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671