[External] : Re: MemorySegment.ofAddress(...).reinterpret(...)
Jorn Vernee
jorn.vernee at oracle.com
Thu Jul 6 18:31:29 UTC 2023
I took a look at the generated assembly [1] for the payload method in:
static final int ELEM_SIZE = 10;
static final int CARRIER_SIZE = (int)JAVA_INT.byteSize();
static final int BYTE_SIZE = ELEM_SIZE * CARRIER_SIZE;
static final MemorySegment ALL =
MemorySegment.NULL.reinterpret(Long.MAX_VALUE);
public static void main(String[] args) {
int[] ints = new int[ELEM_SIZE];
long unsafe_addr = Arena.ofAuto().allocate(BYTE_SIZE).address();
for (int i = 0; i < ints.length ; i++) {
ints[i] = i;
}
State state = new State(ints, unsafe_addr);
for (int i = 0; i < 20_000; i++) {
payload(state);
}
}
record State(int[] ints, long unsafe_addr) {}
public static void payload(State state) {
MemorySegment.copy(state.ints(), 0, ALL, JAVA_INT,
state.unsafe_addr(), ELEM_SIZE);
}
There are three checks we do to ensure safety:
1. a bounds check, which checks if the accessed address range is in bounds
2. an alignment check to see that the base address is aligned according
to the alignment of JAVA_INT
3. a liveness check to see that the segment being accessed is still alive
#2 can be removed on your end by using the JAVA_INT_UNALIGNED layout
instead. #3 shouldn't really be happening, since the scope of the ALL
segment is the global scope, which is always alive. I have a fix that
overrides the liveness check method in the GlobalSession class to do
nothing [2]. On my machine, addressing those 2 brings the performance of
the Panama-based copy very close to that of Unsafe (within 0.2 ns):
Benchmark Mode Cnt Score Error Units
MemorySegmentCopy.segment_copy_static_small avgt 30 7.741 ± 0.017 ns/op
MemorySegmentCopy.unsafe_copy_small avgt 30 7.534 ± 0.011 ns/op
Also, keep in mind that I'm just copying 10 elements here, for larger
copies the overhead would be smaller.
#1 could technically also be removed by exposing a special memory
segment implementation that doesn't do any bounds checking. I'm not sure
we want to go there though, given the API surface/complexity we'd
introduce, for a very marginal gain. At least I'd like to discuss that
with Maurizio first (who's currently on vacation).
#2 is definitely something we can address I think. I've filed:
https://bugs.openjdk.org/browse/JDK-8311594
HTH,
Jorn
[1]:
... setup frame, etc.
0x0000022d46bc52ba: mov r10d, dword ptr [rdx + 0xc]
// load 'ints'
; implicit
exception: dispatches to 0x0000022d46bc540c
0x0000022d46bc52be: mov r9, qword ptr [rdx + 0x10] //
load 'unsafe_address
0x0000022d46bc52c2: mov r11d, dword ptr [r12 + r10*8 +
0xc] // load array length
; implicit
exception: dispatches to 0x0000022d46bc5420
0x0000022d46bc52c7: mov ebp, r11d
0x0000022d46bc52ca: or ebp, 0xa
0x0000022d46bc52cd: test ebp, ebp // array bounds
check part 1 (check for less than 0) (this also looks redundant, but not
directly related to Panama)
0x0000022d46bc52cf: jl 0x22d46bc5378
0x0000022d46bc52d5: cmp r11d, 0xa // array bounds check
part 2 (check for less than 10)
0x0000022d46bc52d9: jl 0x22d46bc53d0
0x0000022d46bc52df: mov r11, r9
0x0000022d46bc52e2: and r11, 3
0x0000022d46bc52e6: test r11, r11 // alignment check
0x0000022d46bc52e9: jne 0x22d46bc539c
0x0000022d46bc52ef: movabs r11, 0x7fffffffffffffd8
0x0000022d46bc52f9: cmp r9, r11 // bounds check
0x0000022d46bc52fc: jae 0x22d46bc535d
0x0000022d46bc52fe: movabs r11, 0x4408c4cb8 ; {oop(a
'jdk/internal/foreign/GlobalSession'{0x00000004408c4cb8})}
0x0000022d46bc5308: mov ebp, dword ptr [r11 + 0xc] //
load 'state' field
0x0000022d46bc530c: test ebp, ebp // liveness check
0x0000022d46bc530e: jl 0x22d46bc53f0
0x0000022d46bc5314: mov rdx, r9 // args0 = unsafe_address
0x0000022d46bc5317: lea r11, [r12 + r10*8]
0x0000022d46bc531b: lea rcx, [r12 + r10*8 + 0x10] //
arg1 = ints base address
0x0000022d46bc5320: mov r8d, 0x28 // arg2 = 40 (bytes)
0x0000022d46bc5326: mov byte ptr [r15 + 0x478], 1 //
set JavaThread::_doing_unsafe_access
0x0000022d46bc532e: movabs r10, 0x22d46b443a0
0x0000022d46bc5338: call r10
0x0000022d46bc533b: nop dword ptr [rax + rax]; {other}
0x0000022d46bc5343: mov byte ptr [r15 + 0x478], r12b
0x0000022d46bc534a: add rsp, 0x40
0x0000022d46bc534e: pop rbp
0x0000022d46bc534f: cmp rsp, qword ptr [r15 + 0x450]
; {poll_return}
0x0000022d46bc5356: ja 0x22d46bc5444
0x0000022d46bc535c: ret
... slow paths
On 06/07/2023 04:40, Brian S O'Neill wrote:
> When I copy against the "ALL" segment, the overall performance
> regression goes from 3% to 2%. How I can I reduce this overhead
> further? What I really want is something like the Unsafe class. It's
> simple, direct, and efficient. Unfortunately, it's going away at some
> point.
>
> On 2023-07-05 06:26 PM, Brian S O'Neill wrote:
>> The application is a high performance database which scales to
>> support very large cache sizes with very little GC overhead. The
>> database can operate in pure "in memory" mode, or be backed by a
>> file, or be memory mapped, or be backed by a block device using
>> O_DIRECT.
>>
>> The bulk of the cache is allocated up front via mmap(anonymous), and
>> page references into the cache are just plain long pointers. The
>> cache can grow over time, and so not all pointers refer to the same
>> region.
>>
>> The critical operations are memory copies and accessing primitive
>> types: MemorySegment.get/set(JAVA_BYTE, etc). Memory copies are
>> performed between pages and also to/from Java byte arrays.
>>
>> I could replace all pointer references with MemorySegment references
>> (with great difficulty), but this would bloat the total heap size and
>> increase GC overhead.
>>
>> Currently I'm seeing a 3% performance regression using the foreign
>> memory API compared to using Unsafe. One trick that I'm using is to
>> refer to the entire address space using a static final field: ALL =
>> MemorySegment.NULL.reinterpret(Long.MAX_VALUE). I don't know if this
>> has any overhead or not. It seems like access to it should be
>> completely unchecked.
>>
>> For memory copies I can try copying to/from the "ALL" segment, in
>> which case I don't need to allocate temporary MemorySegments at all.
>> I'll report back with my findings. Perhaps I just really need a
>> convenient reference to all memory, unchecked, and with low overhead.
>>
>>
>> On 2023-07-05 04:12 PM, Jorn Vernee wrote:
>>> Hello,
>>>
>>> The usual way of attaching a size to a memory segment would be
>>> through a target address layout [1]. MS::reinterpret is meant for
>>> cases where the layout of the pointee is not statically known (and
>>> can not be captured with a memory layout). But, both cases assume
>>> you're reading/writing MemorySegments directly, instead of going
>>> through longs. Are neither of these an option for you? Could you
>>> explain your use case a bit more?
>>>
>>> I think we briefly considered having multiple MS::ofAddress
>>> overloads. IIRC we decided against it, because the current
>>> MS::ofAddress is not a restricted method, but, an overload that sets
>>> the size of the returned segment would need to be restricted since
>>> the segment would be accessible. So, adding more overloads would
>>> create a split between them in terms of restricted-ness, which is
>>> not great.
>>>
>>> I'll also note that, someone could define a utility method that
>>> fuses the two operations, e.g.:
>>>
>>> static MemorySegment longToSizedSegment(long baseAddress, long
>>> byteSize) {
>>> return MemorySegment.ofAddress(baseAddress).reinterpret(byteSize);
>>> }
>>>
>>> That is to say: I think we have the right primitives available in
>>> the API. Though, your experience is a useful data point. We might
>>> want to add some useful methods in the future if we see enough
>>> demand for them.
>>>
>>> Thanks,
>>> Jorn
>>>
>>> [1]:
>>> https://urldefense.com/v3/__https://download.java.net/java/early_access/jdk21/docs/api/java.base/java/lang/foreign/AddressLayout.html*withTargetLayout(java.lang.foreign.MemoryLayout)__;Iw!!ACWV5N9M2RV99hQ!NqXhGO7hslqwYSPuIqBy_WYnCiukjPodKayTq96V-myxdL7L3HTdHQkzrTfcQtgLrlWxQWevkBLXh50$
>>>
>>> On 06/07/2023 00:19, Brian S O'Neill wrote:
More information about the panama-dev
mailing list