RFR: 8354674: AArch64: Intrinsify Unsafe::setMemory [v8]

Thu May 22 20:27:01 UTC 2025

On Thu, 15 May 2025 16:03:44 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This intrinsic is generally faster than the current implementation for Panama segment operations for all writes larger than about 8 bytes in size, increasing to more than 2* the performance on larger memory blocks on Graviton 2, between "panama" (C2 generated, what we use now) and "unsafe" (this intrinsic).
>> 
>> 
>> Benchmark                       (aligned)  (size)  Mode  Cnt     Score    Error  Units
>> MemorySegmentFillUnsafe.panama       true  262143  avgt   10  7295.638 ±  0.422  ns/op
>> MemorySegmentFillUnsafe.panama      false  262143  avgt   10  8345.300 ± 80.161  ns/op
>> MemorySegmentFillUnsafe.unsafe       true  262143  avgt   10  2930.594 ±  0.180  ns/op
>> MemorySegmentFillUnsafe.unsafe      false  262143  avgt   10  3136.828 ±  0.232  ns/op
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Copyright format correction

Nice!

There's a nicely written loop tail that handles power-of-two chunks from 32 bytes (stpq) down to a single byte.

Like many such tails, it is O(lg N), N being the max tail size, and that can be annoying when the loop tail is most or all of the work.

One thing that sometimes helps is a count leading zeroes followed by a multiway switch at the start, or just before the tail, to get started at the right place in the tail (its log-size cascade), for very small inputs.

This PR https://github.com/openjdk/jdk/pull/25383 uses clz in that way.

It also uses an overlapping-store technique to reduce an O(lg N) tail to an O(1) tail, which also depends on the clz step.

When atomicity is not an issue, the overlapping-store technique is faster on my MacBook M1.  It lets you (say) store 7 bytes in two cycles and no extra branches.  The downside is some bytes get stored twice (in the overlap), so it only works on unshared memory.

My rough notes on the relative performance of overlapping loads and stores are here FWIW:
https://cr.openjdk.org/~jrose/jvm/PartialMemoryWord.cpp

BTW, overlapping loads (properly bit-masked) are just as atomic as loads of individual bytes, and much faster.  But that's not the topic here.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25147#issuecomment-2902463076