RFR: 8338967: Improve performance for MemorySegment::fill [v10]

Fri Aug 30 22:07:24 UTC 2024

On Fri, 30 Aug 2024 10:51:59 GMT, Per Minborg <pminborg at openjdk.org> wrote:

>> The performance of the `MemorySegment::fil` can be improved by replacing the `checkAccess()` method call with calling `checkReadOnly()` instead (as the bounds of the segment itself do not need to be checked).
>> 
>> Also, smaller segments can be handled directly by Java code rather than transitioning to native code.
>> 
>> Here is how the `MemorySegment::fill` performance is improved by this PR:
>> 
>> ![image](https://github.com/user-attachments/assets/ee29fdf0-a7cf-4d5b-bb6b-278b01d97e3c)
>> 
>> Operations involving 8 or more bytes are delegated to native code whereas smaller segments are handled via a switch rake.
>> 
>> It should be noted that `Arena::allocate` is using `MemorySegment::fil`. Hence, this PR will also have a positive effect on memory allocation performance.
>
> Per Minborg has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Revert copyright year
>  - Move logic back to AMSI

It is a good analysis; effectively even fill will likely have to handle tail/head for reminder bytes - and this will eventually lead to, more or less, some branchy code: this can be a tight loop, a series of if and byte per byte write (7 ifs), or as it is handled in this pr.
All of these strategies are better than what we have now, probably because the existing instrinsics still perform some poor decision, but I haven't dug yet into perfasm out to see what it does wrong; maybe is something which could be fixed in the intrinsic itself?
Said that, the 3 approaches I have mentioned could be interesting to check against both predictable or not workloads, I see pros and cons in all of them, TBH, although just as an academic exercise.

One qq; by reading https://bugs.openjdk.org/browse/JDK-8139457 it appears to me that via some unsafe mechanism we could avoid being branchy;
If a single byte[] still need to be 8 bytes (or 16?) aligned, we could just use long and write past the end of the array? Is it a safe assumption?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20712#issuecomment-2322414069