[External] : Re: MemorySegment.ofAddress(...).reinterpret(...)

Tue Jul 25 14:27:39 UTC 2023

It occurred to me that, rather than speeding up access using an ALL 
segment, we could use a very different tactic.

Instead of having a segment that encompasses the entire native heap, we 
could build ad-hoc segments around the offset that needs to be accessed, 
like this:

|long offset = < accessed offset > MemorySegment wrapper = 
MemorySegment.ofAddress(offset).reinterpret(8); // 8 is a size that is 
"big enough" for all accesses. Using another value is probably ok too |

One might argue that this code would end up allocating a lot more 
objects. While scalarization would typically do a good job at addressing 
issues such as this (as the allocated segment is discared immediately 
after use), it is possible that scalarization might fail due to the 
client code not being inlined. But here’s the trick: we can move all 
that mechanical code /inside/ the var handle itself. This way, all the 
allocation occurs inside the var handle bubble, which has much more lax 
inlining budget (as do method handles). In practice I don’t think it 
should be possible to observe allocation when using such an adapted var 
handle.

So, how do we adapt a plain memory segment var handle so that we can use 
it with any offset (even negative ones!!) ?

First, let’s define some method that does the above adaptation of a long 
address into a memory segment:

|static MemorySegment ofAddressUnsafe(long address) { return 
MemorySegment.ofAddress(address).reinterpret(8); } |

Then, let’s create a method handle that points there:

|static final MethodHandle OF_ADDRESS_UNSAFE = ... |

Then, create the memory segment var handle as follows:

|static final VarHandle BYTE_HANDLE = 
adaptSegmentHandle(JAVA_BYTE.varHandle()); |

(note: this is not limited to var handles using JAVA_BYTE, you can use 
any input layout).

Where |adaptSegmentHandle| is defined as follows:

|static VarHandle adaptSegmentHandle(VarHandle handle) { handle = 
MethodHandles.insertCoordinates(handle,1, 0L); handle = 
MethodHandles.filterCoordinates(handle, 0, OF_ADDRESS_UNSAFE); return 
handle; } |

What the above code does is:

 1. it injects 0L as the offset used by the var handle
 2. it wraps the incoming long address into a segment (using
    OF_ADDRESS_UNSAFE), which is then injected into the var handle

In other words, we now have a var handle that takes a simple long 
parameter (no more segments). This parameter (either positive, or 
negative) will be wrapped in a fresh memory segment with a known big 
enough bound. Since all the allocation occurs inside the var handle, we 
expect the JIT to reliably scalarize all the memory segment instances we 
create in the process.

And, indeed, the binary search benchmark does a remarkable jump towards 
Unsafe-like performance:

|Benchmark Mode Cnt Score Error Units BinarySearch.binarySearch_panama 
avgt 30 27.871 ? 0.350 ns/op BinarySearch.binarySearch_unsafe avgt 30 
27.470 ? 0.145 ns/op |

GC activity is zero (everything is scalarized), and that should happen 
quite reliably (e.g. not an artifact of using JMH).

This approach should also scale to memory copy and other static routines 
defined in MemorySegment (I have not tried that). E.g. one could obtain 
a method handle that points to MemorySegment::copy, and then adapt the 
incoming MS parameters in the same way.

Cheers
Maurizio

On 21/07/2023 18:08, Maurizio Cimadamore wrote:

    Hi Brian, I have been playing some more with the benchmark you
    provided here:

    https://gist.github.com/broneill/3a39051635e3cb758d0cca5a963c685e

    The following patch seems to help quite significantly:

    https://cr.openjdk.org/~mcimadamore/panama/segment_normalize_offsets.patch

    Here are the benchmark results:

    (before)

    |Benchmark Mode Cnt Score Error Units
    BinarySearch.binarySearch_panama avgt 30 61.404 ? 1.315 ns/op
    BinarySearch.binarySearch_unsafe avgt 30 28.099 ? 0.330 ns/op

    |

    (after)

    |Benchmark Mode Cnt Score Error Units
    BinarySearch.binarySearch_panama avgt 30 38.103 ? 0.471 ns/op
    BinarySearch.binarySearch_unsafe avgt 30 27.556 ? 0.208 ns/op

    |

    We believe there must be some C2 issue lurking in there - basically
    we’re forcing all offsets to be positive using this:

    |@ForceInline private static long normalize(long offset) { if
    (offset < 0) { throw new IndexOutOfBoundsException(“Offset is < 0”);
    } return offset & Long.MAX_VALUE; }

    |

    Now, surprisingly, if the “& Long.MAX_VALUE” is dropped, performance
    dips again. I believe we lose track of the positivity here and the
    above patch provides some kind of workaround.

    I’d be interested to know if this patch helps the situation with
    your bigger benchmark, or if things stay the same (in which case, we
    would have to conclude that this benchmark, while interesting, is
    perhaps not reflective of the issues in your bigger codebase).

    Thanks Maurizio

    On 06/07/2023 23:19, Brian S O’Neill wrote:

        When I use the “ALL” MemorySegment instead of allocating
        MemorySegments on the fly for copies, the performance regression
        drops to ~2%.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230725/714ed8c6/attachment-0001.htm>