RFR: 8338967: Improve performance for MemorySegment::fill [v4]

Mon Aug 26 14:28:06 UTC 2024

On Mon, 26 Aug 2024 13:29:40 GMT, Per Minborg <pminborg at openjdk.org> wrote:

>> src/java.base/share/classes/jdk/internal/foreign/AbstractMemorySegmentImpl.java line 200:
>> 
>>> 198:             switch ((int) length) {
>>> 199:                 case 0 : checkReadOnly(false); checkValidState(); break; // Explicit tests
>>> 200:                 case 1 : set(JAVA_BYTE, 0, value); break;
>> 
>> beware using a switch, because if this code if is too big to be inlined (or we're unlucky) will die due to branch-mispredict in case the different "small fills" are unstable/unpredictable.
>> Having a test which feed different fill sizes per each iteration + counting branch misses, will reveal if the improvement is worthy even with such cases
>
> It is true, that this is a compromise where we give up inline space, code-cache space, and introduce added complexity against the prospect of better small-size performance. Depending on the workload, this may or may not pay off. In the (presumably common) case where we allocate/fill small segments of constant sizes, this is likely a win. Writing a dynamic performance test sounds like a good idea.

Here is a benchmark that fills segments of various random sizes:

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class TestFill {

    private static final int SIZE = 16;
    private static final int[] INDICES = new Random(42).ints(0, 8)
            .limit(SIZE)
            .toArray();

    private MemorySegment[] segments;

    @Setup
    public void setup() {
        segments = IntStream.of(INDICES)
                .mapToObj(i -> MemorySegment.ofArray(new byte[i]))
                .toArray(MemorySegment[]::new);
    }

    @Benchmark
    public void heap_segment_fill() {
        for (int i = 0; i < SIZE; i++) {
            segments[i].fill((byte) 0);
        }
    }

}

This produces the following on my Mac M1:

Benchmark                   Mode  Cnt   Score   Error  Units
TestFill.heap_segment_fill  avgt   30  59.054 ? 3.723  ns/op

On average, an operation will take 59/16 = ~3 ns per operation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20712#discussion_r1731331461