RFR: 8354674: AArch64: Intrinsify Unsafe::setMemory [v8]
John R Rose
jrose at openjdk.org
Thu May 22 20:27:01 UTC 2025
On Thu, 15 May 2025 16:03:44 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> This intrinsic is generally faster than the current implementation for Panama segment operations for all writes larger than about 8 bytes in size, increasing to more than 2* the performance on larger memory blocks on Graviton 2, between "panama" (C2 generated, what we use now) and "unsafe" (this intrinsic).
>>
>>
>> Benchmark (aligned) (size) Mode Cnt Score Error Units
>> MemorySegmentFillUnsafe.panama true 262143 avgt 10 7295.638 ± 0.422 ns/op
>> MemorySegmentFillUnsafe.panama false 262143 avgt 10 8345.300 ± 80.161 ns/op
>> MemorySegmentFillUnsafe.unsafe true 262143 avgt 10 2930.594 ± 0.180 ns/op
>> MemorySegmentFillUnsafe.unsafe false 262143 avgt 10 3136.828 ± 0.232 ns/op
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> Copyright format correction
Nice!
There's a nicely written loop tail that handles power-of-two chunks from 32 bytes (stpq) down to a single byte.
Like many such tails, it is O(lg N), N being the max tail size, and that can be annoying when the loop tail is most or all of the work.
One thing that sometimes helps is a count leading zeroes followed by a multiway switch at the start, or just before the tail, to get started at the right place in the tail (its log-size cascade), for very small inputs.
This PR https://github.com/openjdk/jdk/pull/25383 uses clz in that way.
It also uses an overlapping-store technique to reduce an O(lg N) tail to an O(1) tail, which also depends on the clz step.
When atomicity is not an issue, the overlapping-store technique is faster on my MacBook M1. It lets you (say) store 7 bytes in two cycles and no extra branches. The downside is some bytes get stored twice (in the overlap), so it only works on unshared memory.
My rough notes on the relative performance of overlapping loads and stores are here FWIW:
https://cr.openjdk.org/~jrose/jvm/PartialMemoryWord.cpp
BTW, overlapping loads (properly bit-masked) are just as atomic as loads of individual bytes, and much faster. But that's not the topic here.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/25147#issuecomment-2902463076
More information about the core-libs-dev
mailing list