RFR: 8338967: Improve performance for MemorySegment::fill [v5]

Fri Aug 30 21:23:25 UTC 2024

On Fri, 30 Aug 2024 15:31:26 GMT, Francesco Nigro <duke at openjdk.org> wrote:

> good point: relatively to the baseline, nope, cause the new version improve regardless, even when the new version got high branch misses

My feeling is that the intrinsic we have under the hood must be doing some similar branching to fixup the tail of the loop. In a way, what you are measuring is the worst possible case: a method that works on segments of different sizes, but whose size is so small not to benefit much from loop optimizations. Because of that, the cost of branching dominates everything. I think it's unavoidable to have some kind of jitter for small sizes. (e.g. even if we could write a single loop using `byte` and C2 auto-vectorized all that - there's going to be a loop tail where we need to fill in the contents for some remainder size - and that logic is going to be [branchy](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/util/ArraysSupport.java#L146)).

On the other hand, the main point of this PR is to avoid the intrinsics for segments smaller than a certain size as jumping into the intrinsics seem to have some fixed cost that doesn't make it worth it for such small segments. (a similar situation arises for `copy` and `mismatch` - to an even greater extent).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20712#issuecomment-2322353850