RFR: 8338967: Improve performance for MemorySegment::fill [v10]

Tue Sep 3 08:41:20 UTC 2024

On Mon, 2 Sep 2024 09:32:56 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> If I run:
>> 
>> 
>>     @Benchmark
>>     public long shift() {
>>         return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | ELEM_SIZE;
>>     }
>> 
>>     @Benchmark
>>     public long mul() {
>>         return ELEM_SIZE * 0xFFFF_FFFF_FFFFL;
>>     }
>> 
>> Then I get:
>> 
>> Benchmark       (ELEM_SIZE)  Mode  Cnt  Score   Error  Units
>> TestFill.mul             31  avgt   30  0.586 ? 0.045  ns/op
>> TestFill.shift           31  avgt   30  0.938 ? 0.017  ns/op
>> 
>> On my M1 machine.
>
> I found similar small improvements to be had (I wrote about them offline) when replacing the bitwise-based tests (e.g. `foo & 4 != 0`) with a more explicit check for `remainingBytes >=4`. Seems like bitwise operations are not as optimized (or perhaps the assembly instructions for them is overall more convoluted - I haven't checked).

I've tried 

final long longValue = Byte.toUnsignedLong(value) * 0x0101010101010101L;

But it had the same performance as explicit bit shifting on M1.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1741664877