[vectorIntrinsics] RFR: Optimize mem barriers for ByteBuffer cases

Sun Aug 1 15:37:58 UTC 2021

This is interesting. I'm looking at Apple M1, where I see

ByteBufferVectorAccess.heapBuffers    1024  avgt   10  86.512 ? 0.041  ns/op

which corresponds to about 23 Gigabytes/sec, when I do:

... -wi 5 -i 10 -p size=524288 -tu s  -bm thrpt

Benchmark                           (size)   Mode  Cnt      Score    Error  Units
ByteBufferVectorAccess.heapBuffers  524288  thrpt   10  22921.655 ? 42.230  ops/s

(We're moving half a megabyte from A to B, so ops/s is megabytes/s)

This is pretty good.

The code HotSpot generates looks like this:

  0.49%        sxtw	x16, w14
 52.13%        ldr	q16, [x13, x16]
  0.39%        sxtw	x16, w14
  0.23%        str	q16, [x15, x16]
 20.37%        add	w14, w14, #0x10
  0.17%        cmp	w14, w18
               b.lt	0x0000fffface568b0  // b.tstop

So, how well can C do with something similar?

Here's a loop from the STREAM benchmark, which copies ten million
doubles:

 for (j=0; j<10_000_000; j++)
     c[j] = a[j];

which generates (something like this code, I've edited it a bit):

LBB0_7:
        ldp     q0, q1, [x8, #-16]
        stp     q0, q1, [x10, #-16]
        add     x8, x8, #32
        add     x10, x10, #32
        subs    x9, x9, #4
        b.ne    LBB0_7

and runs at about 60 Gbytes/sec.

The difference here, I suspect, might be not much more than using 32-byte
accesses rather than 16-byte accesses. We're very close.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671