[vectorIntrinsics] RFR: Optimize mem barriers for ByteBuffer cases
Andrew Haley
aph at redhat.com
Sun Aug 1 15:37:58 UTC 2021
This is interesting. I'm looking at Apple M1, where I see
ByteBufferVectorAccess.heapBuffers 1024 avgt 10 86.512 ? 0.041 ns/op
which corresponds to about 23 Gigabytes/sec, when I do:
... -wi 5 -i 10 -p size=524288 -tu s -bm thrpt
Benchmark (size) Mode Cnt Score Error Units
ByteBufferVectorAccess.heapBuffers 524288 thrpt 10 22921.655 ? 42.230 ops/s
(We're moving half a megabyte from A to B, so ops/s is megabytes/s)
This is pretty good.
The code HotSpot generates looks like this:
0.49% sxtw x16, w14
52.13% ldr q16, [x13, x16]
0.39% sxtw x16, w14
0.23% str q16, [x15, x16]
20.37% add w14, w14, #0x10
0.17% cmp w14, w18
b.lt 0x0000fffface568b0 // b.tstop
So, how well can C do with something similar?
Here's a loop from the STREAM benchmark, which copies ten million
doubles:
for (j=0; j<10_000_000; j++)
c[j] = a[j];
which generates (something like this code, I've edited it a bit):
LBB0_7:
ldp q0, q1, [x8, #-16]
stp q0, q1, [x10, #-16]
add x8, x8, #32
add x10, x10, #32
subs x9, x9, #4
b.ne LBB0_7
and runs at about 60 Gbytes/sec.
The difference here, I suspect, might be not much more than using 32-byte
accesses rather than 16-byte accesses. We're very close.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the panama-dev
mailing list