RFR: 8338967: Improve performance for MemorySegment::fill [v10]

Mon Sep 2 08:59:22 UTC 2024

On Fri, 30 Aug 2024 14:15:24 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java line 208:
>> 
>>> 206:             }
>>> 207:             final long u = Byte.toUnsignedLong(value);
>>> 208:             final long longValue = u << 56 | u << 48 | u << 40 | u << 32 | u << 24 | u << 16 | u << 8 | u;
>> 
>> this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure if fast(er), need to measure.
>> 
>> Most of the time filling is happy with 0 since zeroing is the most common case
>
>> this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure if fast(er), need to measure.
>> 
>> Most of the time filling is happy with 0 since zeroing is the most common case
> 
> It's a clever trick. However, I was looking at similar tricks and found that the time spent here is irrelevant (e.g. I tried to always force `0` as the value, and couldn't see any difference).

If I run:

    @Benchmark
    public long shift() {
        return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | ELEM_SIZE;
    }

    @Benchmark
    public long mul() {
        return ELEM_SIZE * 0xFFFF_FFFF_FFFFL;
    }

Then I get:

Benchmark       (ELEM_SIZE)  Mode  Cnt  Score   Error  Units
TestFill.mul             31  avgt   30  0.586 ? 0.045  ns/op
TestFill.shift           31  avgt   30  0.938 ? 0.017  ns/op

On my M1 machine.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1740564110