[External] : Re: MemorySegment.ofAddress(...).reinterpret(...)

Thu Jul 6 18:31:29 UTC 2023

I took a look at the generated assembly [1] for the payload method in:

     static final int ELEM_SIZE = 10;
     static final int CARRIER_SIZE = (int)JAVA_INT.byteSize();
     static final int BYTE_SIZE = ELEM_SIZE * CARRIER_SIZE;
     static final MemorySegment ALL = 
MemorySegment.NULL.reinterpret(Long.MAX_VALUE);

     public static void main(String[] args) {
         int[] ints = new int[ELEM_SIZE];
         long unsafe_addr = Arena.ofAuto().allocate(BYTE_SIZE).address();
         for (int i = 0; i < ints.length ; i++) {
             ints[i] = i;
         }

         State state = new State(ints, unsafe_addr);
         for (int i = 0; i < 20_000; i++) {
             payload(state);
         }
     }

     record State(int[] ints, long unsafe_addr) {}

     public static void payload(State state) {
         MemorySegment.copy(state.ints(), 0, ALL, JAVA_INT, 
state.unsafe_addr(), ELEM_SIZE);
     }

There are three checks we do to ensure safety:
1. a bounds check, which checks if the accessed address range is in bounds
2. an alignment check to see that the base address is aligned according 
to the alignment of JAVA_INT
3. a liveness check to see that the segment being accessed is still alive

#2 can be removed on your end by using the JAVA_INT_UNALIGNED layout 
instead. #3 shouldn't really be happening, since the scope of the ALL 
segment is the global scope, which is always alive. I have a fix that 
overrides the liveness check method in the GlobalSession class to do 
nothing [2]. On my machine, addressing those 2 brings the performance of 
the Panama-based copy very close to that of Unsafe (within 0.2 ns):

Benchmark Mode  Cnt  Score   Error  Units
MemorySegmentCopy.segment_copy_static_small  avgt   30  7.741 ± 0.017  ns/op
MemorySegmentCopy.unsafe_copy_small          avgt   30  7.534 ± 0.011  ns/op

Also, keep in mind that I'm just copying 10 elements here, for larger 
copies the overhead would be smaller.

#1 could technically also be removed by exposing a special memory 
segment implementation that doesn't do any bounds checking. I'm not sure 
we want to go there though, given the API surface/complexity we'd 
introduce, for a very marginal gain. At least I'd like to discuss that 
with Maurizio first (who's currently on vacation).

#2 is definitely something we can address I think. I've filed: 
https://bugs.openjdk.org/browse/JDK-8311594

HTH,
Jorn

[1]:

   ... setup frame, etc.
   0x0000022d46bc52ba:   mov             r10d, dword ptr [rdx + 0xc]   
// load 'ints'
                                                             ; implicit 
exception: dispatches to 0x0000022d46bc540c
   0x0000022d46bc52be:   mov             r9, qword ptr [rdx + 0x10]   // 
load 'unsafe_address
   0x0000022d46bc52c2:   mov             r11d, dword ptr [r12 + r10*8 + 
0xc]  // load array length
                                                             ; implicit 
exception: dispatches to 0x0000022d46bc5420
   0x0000022d46bc52c7:   mov             ebp, r11d
   0x0000022d46bc52ca:   or              ebp, 0xa
   0x0000022d46bc52cd:   test            ebp, ebp    // array bounds 
check part 1 (check for less than 0) (this also looks redundant, but not 
directly related to Panama)
   0x0000022d46bc52cf:   jl              0x22d46bc5378
   0x0000022d46bc52d5:   cmp             r11d, 0xa // array bounds check 
part 2 (check for less than 10)
   0x0000022d46bc52d9:   jl              0x22d46bc53d0
   0x0000022d46bc52df:   mov             r11, r9
   0x0000022d46bc52e2:   and             r11, 3
   0x0000022d46bc52e6:   test            r11, r11   // alignment check
   0x0000022d46bc52e9:   jne             0x22d46bc539c
   0x0000022d46bc52ef:   movabs          r11, 0x7fffffffffffffd8
   0x0000022d46bc52f9:   cmp             r9, r11   // bounds check
   0x0000022d46bc52fc:   jae             0x22d46bc535d
   0x0000022d46bc52fe:   movabs          r11, 0x4408c4cb8    ; {oop(a 
'jdk/internal/foreign/GlobalSession'{0x00000004408c4cb8})}
   0x0000022d46bc5308:   mov             ebp, dword ptr [r11 + 0xc]   // 
load 'state' field
   0x0000022d46bc530c:   test            ebp, ebp   // liveness check
   0x0000022d46bc530e:   jl              0x22d46bc53f0
   0x0000022d46bc5314:   mov             rdx, r9   // args0 = unsafe_address
   0x0000022d46bc5317:   lea             r11, [r12 + r10*8]
   0x0000022d46bc531b:   lea             rcx, [r12 + r10*8 + 0x10]    // 
arg1 = ints base address
   0x0000022d46bc5320:   mov             r8d, 0x28    // arg2 = 40 (bytes)
   0x0000022d46bc5326:   mov             byte ptr [r15 + 0x478], 1    // 
set JavaThread::_doing_unsafe_access
   0x0000022d46bc532e:   movabs          r10, 0x22d46b443a0
   0x0000022d46bc5338:   call            r10
   0x0000022d46bc533b:   nop             dword ptr [rax + rax]; {other}
   0x0000022d46bc5343:   mov             byte ptr [r15 + 0x478], r12b
   0x0000022d46bc534a:   add             rsp, 0x40
   0x0000022d46bc534e:   pop             rbp
   0x0000022d46bc534f:   cmp             rsp, qword ptr [r15 + 0x450]
                                                             ; {poll_return}
   0x0000022d46bc5356:   ja              0x22d46bc5444
   0x0000022d46bc535c:   ret
   ... slow paths

On 06/07/2023 04:40, Brian S O'Neill wrote:
> When I copy against the "ALL" segment, the overall performance 
> regression goes from 3% to 2%. How I can I reduce this overhead 
> further? What I really want is something like the Unsafe class. It's 
> simple, direct, and efficient. Unfortunately, it's going away at some 
> point.
>
> On 2023-07-05 06:26 PM, Brian S O'Neill wrote:
>> The application is a high performance database which scales to 
>> support very large cache sizes with very little GC overhead. The 
>> database can operate in pure "in memory" mode, or be backed by a 
>> file, or be memory mapped, or be backed by a block device using 
>> O_DIRECT.
>>
>> The bulk of the cache is allocated up front via mmap(anonymous), and 
>> page references into the cache are just plain long pointers. The 
>> cache can grow over time, and so not all pointers refer to the same 
>> region.
>>
>> The critical operations are memory copies and accessing primitive 
>> types: MemorySegment.get/set(JAVA_BYTE, etc). Memory copies are 
>> performed between pages and also to/from Java byte arrays.
>>
>> I could replace all pointer references with MemorySegment references 
>> (with great difficulty), but this would bloat the total heap size and 
>> increase GC overhead.
>>
>> Currently I'm seeing a 3% performance regression using the foreign 
>> memory API compared to using Unsafe. One trick that I'm using is to 
>> refer to the entire address space using a static final field: ALL = 
>> MemorySegment.NULL.reinterpret(Long.MAX_VALUE). I don't know if this 
>> has any overhead or not. It seems like access to it should be 
>> completely unchecked.
>>
>> For memory copies I can try copying to/from the "ALL" segment, in 
>> which case I don't need to allocate temporary MemorySegments at all. 
>> I'll report back with my findings. Perhaps I just really need a 
>> convenient reference to all memory, unchecked, and with low overhead.
>>
>>
>> On 2023-07-05 04:12 PM, Jorn Vernee wrote:
>>> Hello,
>>>
>>> The usual way of attaching a size to a memory segment would be 
>>> through a target address layout [1]. MS::reinterpret is meant for 
>>> cases where the layout of the pointee is not statically known (and 
>>> can not be captured with a memory layout). But, both cases assume 
>>> you're reading/writing MemorySegments directly, instead of going 
>>> through longs. Are neither of these an option for you? Could you 
>>> explain your use case a bit more?
>>>
>>> I think we briefly considered having multiple MS::ofAddress 
>>> overloads. IIRC we decided against it, because the current 
>>> MS::ofAddress is not a restricted method, but, an overload that sets 
>>> the size of the returned segment would need to be restricted since 
>>> the segment would be accessible. So, adding more overloads would 
>>> create a split between them in terms of restricted-ness, which is 
>>> not great.
>>>
>>> I'll also note that, someone could define a utility method that 
>>> fuses the two operations, e.g.:
>>>
>>> static MemorySegment longToSizedSegment(long baseAddress, long 
>>> byteSize) {
>>>      return MemorySegment.ofAddress(baseAddress).reinterpret(byteSize);
>>> }
>>>
>>> That is to say: I think we have the right primitives available in 
>>> the API. Though, your experience is a useful data point. We might 
>>> want to add some useful methods in the future if we see enough 
>>> demand for them.
>>>
>>> Thanks,
>>> Jorn
>>>
>>> [1]: 
>>> https://urldefense.com/v3/__https://download.java.net/java/early_access/jdk21/docs/api/java.base/java/lang/foreign/AddressLayout.html*withTargetLayout(java.lang.foreign.MemoryLayout)__;Iw!!ACWV5N9M2RV99hQ!NqXhGO7hslqwYSPuIqBy_WYnCiukjPodKayTq96V-myxdL7L3HTdHQkzrTfcQtgLrlWxQWevkBLXh50$ 
>>>
>>> On 06/07/2023 00:19, Brian S O'Neill wrote: