RFR: 8338967: Improve performance for MemorySegment::fill [v10]

Mon Sep 2 09:39:21 UTC 2024

On Mon, 2 Sep 2024 08:56:47 GMT, Per Minborg <pminborg at openjdk.org> wrote:

>>> this can be u * 0xFFFFFFFFFFFFL if value != 0 and just 0L if not: not sure if fast(er), need to measure.
>>> 
>>> Most of the time filling is happy with 0 since zeroing is the most common case
>> 
>> It's a clever trick. However, I was looking at similar tricks and found that the time spent here is irrelevant (e.g. I tried to always force `0` as the value, and couldn't see any difference).
>
> If I run:
> 
> 
>     @Benchmark
>     public long shift() {
>         return ELEM_SIZE << 56 | ELEM_SIZE << 48 | ELEM_SIZE << 40 | ELEM_SIZE << 32 | ELEM_SIZE << 24 | ELEM_SIZE << 16 | ELEM_SIZE << 8 | ELEM_SIZE;
>     }
> 
>     @Benchmark
>     public long mul() {
>         return ELEM_SIZE * 0xFFFF_FFFF_FFFFL;
>     }
> 
> Then I get:
> 
> Benchmark       (ELEM_SIZE)  Mode  Cnt  Score   Error  Units
> TestFill.mul             31  avgt   30  0.586 ? 0.045  ns/op
> TestFill.shift           31  avgt   30  0.938 ? 0.017  ns/op
> 
> On my M1 machine.

I found similar small improvements to be had (I wrote about them offline) when replacing the bitwise-based tests (e.g. `foo & 4 != 0`) with a more explicit check for `remainingBytes >=4`. Seems like bitwise operations are not as optimized (or perhaps the assembly instructions for them is overall more convoluted - I haven't checked).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1740612559