RFR: 8338967: Improve performance for MemorySegment::fill [v5]

Fri Aug 30 12:18:20 UTC 2024

On Wed, 28 Aug 2024 15:32:40 GMT, Francesco Nigro <duke at openjdk.org> wrote:

>>> How fast do we need to be here given we are measuring in a few nanoseconds per operation?
>>> 
>>> What if the goal is not to regress from say explicitly filling in a small sized segment or a comparable array (e.g., < 8 bytes) then maybe a loop suffices and the code is simple?
>> 
>> Fair question. I have another version (called "patch bits" below) that is based on bit logic (first doing int ops, then short and lastly byte, similar to `ArraySupport::vectorizedMismatch`). This has slightly worse performance but is more scalable and perhaps simpler.
>> 
>> ![image](https://github.com/user-attachments/assets/292c75aa-0df8-4bb7-b45f-426d0f8470d9)
>
> @minborg Hi! I didn't checked the numbers with the benchmark I've written at https://github.com/openjdk/jdk/pull/20712#discussion_r1732802685 which is meant to stress the branch predictor (without enough `samples` i.e. past 128K on my machine) - can you give it a shot with M1 🙏 ?

@franz1981 Here is what I get if I run your performance test on my M1 Mac (unfortunately no -perf data):

Benchmark                         (samples)  (shuffle)  Mode  Cnt        Score       Error  Units
TestBranchFill.heap_segment_fill       1024      false  avgt   30     3695.815 ?    24.615  ns/op
TestBranchFill.heap_segment_fill       1024       true  avgt   30     3938.582 ?   124.510  ns/op
TestBranchFill.heap_segment_fill     128000      false  avgt   30   420845.301 ?  1605.080  ns/op
TestBranchFill.heap_segment_fill     128000       true  avgt   30  1778362.506 ? 39250.756  ns/op

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20712#issuecomment-2321048180