[External] : Re: MemorySegment.ofAddress(...).reinterpret(...)
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Jul 25 14:27:39 UTC 2023
It occurred to me that, rather than speeding up access using an ALL
segment, we could use a very different tactic.
Instead of having a segment that encompasses the entire native heap, we
could build ad-hoc segments around the offset that needs to be accessed,
like this:
|long offset = < accessed offset > MemorySegment wrapper =
MemorySegment.ofAddress(offset).reinterpret(8); // 8 is a size that is
"big enough" for all accesses. Using another value is probably ok too |
One might argue that this code would end up allocating a lot more
objects. While scalarization would typically do a good job at addressing
issues such as this (as the allocated segment is discared immediately
after use), it is possible that scalarization might fail due to the
client code not being inlined. But here’s the trick: we can move all
that mechanical code /inside/ the var handle itself. This way, all the
allocation occurs inside the var handle bubble, which has much more lax
inlining budget (as do method handles). In practice I don’t think it
should be possible to observe allocation when using such an adapted var
handle.
So, how do we adapt a plain memory segment var handle so that we can use
it with any offset (even negative ones!!) ?
First, let’s define some method that does the above adaptation of a long
address into a memory segment:
|static MemorySegment ofAddressUnsafe(long address) { return
MemorySegment.ofAddress(address).reinterpret(8); } |
Then, let’s create a method handle that points there:
|static final MethodHandle OF_ADDRESS_UNSAFE = ... |
Then, create the memory segment var handle as follows:
|static final VarHandle BYTE_HANDLE =
adaptSegmentHandle(JAVA_BYTE.varHandle()); |
(note: this is not limited to var handles using JAVA_BYTE, you can use
any input layout).
Where |adaptSegmentHandle| is defined as follows:
|static VarHandle adaptSegmentHandle(VarHandle handle) { handle =
MethodHandles.insertCoordinates(handle,1, 0L); handle =
MethodHandles.filterCoordinates(handle, 0, OF_ADDRESS_UNSAFE); return
handle; } |
What the above code does is:
1. it injects 0L as the offset used by the var handle
2. it wraps the incoming long address into a segment (using
OF_ADDRESS_UNSAFE), which is then injected into the var handle
In other words, we now have a var handle that takes a simple long
parameter (no more segments). This parameter (either positive, or
negative) will be wrapped in a fresh memory segment with a known big
enough bound. Since all the allocation occurs inside the var handle, we
expect the JIT to reliably scalarize all the memory segment instances we
create in the process.
And, indeed, the binary search benchmark does a remarkable jump towards
Unsafe-like performance:
|Benchmark Mode Cnt Score Error Units BinarySearch.binarySearch_panama
avgt 30 27.871 ? 0.350 ns/op BinarySearch.binarySearch_unsafe avgt 30
27.470 ? 0.145 ns/op |
GC activity is zero (everything is scalarized), and that should happen
quite reliably (e.g. not an artifact of using JMH).
This approach should also scale to memory copy and other static routines
defined in MemorySegment (I have not tried that). E.g. one could obtain
a method handle that points to MemorySegment::copy, and then adapt the
incoming MS parameters in the same way.
Cheers
Maurizio
On 21/07/2023 18:08, Maurizio Cimadamore wrote:
Hi Brian, I have been playing some more with the benchmark you
provided here:
https://gist.github.com/broneill/3a39051635e3cb758d0cca5a963c685e
The following patch seems to help quite significantly:
https://cr.openjdk.org/~mcimadamore/panama/segment_normalize_offsets.patch
Here are the benchmark results:
(before)
|Benchmark Mode Cnt Score Error Units
BinarySearch.binarySearch_panama avgt 30 61.404 ? 1.315 ns/op
BinarySearch.binarySearch_unsafe avgt 30 28.099 ? 0.330 ns/op
|
(after)
|Benchmark Mode Cnt Score Error Units
BinarySearch.binarySearch_panama avgt 30 38.103 ? 0.471 ns/op
BinarySearch.binarySearch_unsafe avgt 30 27.556 ? 0.208 ns/op
|
We believe there must be some C2 issue lurking in there - basically
we’re forcing all offsets to be positive using this:
|@ForceInline private static long normalize(long offset) { if
(offset < 0) { throw new IndexOutOfBoundsException(“Offset is < 0”);
} return offset & Long.MAX_VALUE; }
|
Now, surprisingly, if the “& Long.MAX_VALUE” is dropped, performance
dips again. I believe we lose track of the positivity here and the
above patch provides some kind of workaround.
I’d be interested to know if this patch helps the situation with
your bigger benchmark, or if things stay the same (in which case, we
would have to conclude that this benchmark, while interesting, is
perhaps not reflective of the issues in your bigger codebase).
Thanks Maurizio
On 06/07/2023 23:19, Brian S O’Neill wrote:
When I use the “ALL” MemorySegment instead of allocating
MemorySegments on the fly for copies, the performance regression
drops to ~2%.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230725/714ed8c6/attachment-0001.htm>
More information about the panama-dev
mailing list