RFR (14) 8235837: Memory access API refinements

Fri Jan 17 16:22:10 UTC 2020

On 17/01/2020 16:01, Andrew Haley wrote:
> On 1/16/20 3:15 PM, Maurizio Cimadamore wrote:
>> On 16/01/2020 14:50, Andrew Haley wrote:
>>> On 1/15/20 6:48 PM, Maurizio Cimadamore wrote:
>>>> Maybe this would be best moved on panama-dev?
>>> Here we are.
>>>
>>>> In any case, for obtaining best performances, it is best to use an
>>>> indexed (or strided) var handle - your loop will create a new memory
>>>> address on each new iteration, which will not be a problem once
>>>> MemoryAddress will be an inline type, but in the meantime...
>>> OK. It's rather unfortunate that your example at the very top of the
>>> API description uses an anti-pattern, at least from a performance
>>> point of view. Might I suggest that it should be changed to the best
>>> way to do the job?
>> That's a good suggestion, thanks. The rational behind the current
>> example was to provide a simple example so that people could familiarize
>> with the API - I'm worried that throwing strided access into the mix in
>> the first example could be a bit steep, but I'll see what I can do.
> I think that's a big clue itself. Perhaps the API is the problem here
> -- if the simple way to do something isn't the best way to do it, some
> changes may be called for.
>
> People are going to want to replace simple ByteBuffer operations with
> this API, and we'll need a straightforward way to do that.
Possibly - although this API might be slightly lower-level than your 
typical BB audience, I think. Anyway, the focus, for now, is on the 
fundamentals - we want to get the primitive right - later on we can 
discuss as to whether we should throw in *usability candies* such as 
having indexed accessors on the memory segment itself which does the 
right thing VarHandle-wise.
>
>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
>>> I guess that change still isn't committed to the Panama repo? But
>>> anyway, the loops are all ints, with a constant bound.
>> This change has been pushed to the panama repo. The loops might be int,
>> but internally the memory access machinery uses long operations to sum
>> offsets together and multiply them, which is causing BCE and other
>> optimization to fail (since these optimizations looks for IADD, IMUL
>> opcodes).
> OK, more work to be done there.
Yep, indeed.
>
>> The numbers you report above seems consistent to what we have in JDK 14,
>> and in panama before the fix/workaround I mentioned. Are you sure you
>> are testing with the latest bits? Could you be on the panama 'default'
>> branch instead of 'foreign-memaccess' ?
> Oh, why did I not think of that! :-)
>
> I updated to 'foreign-memaccess'.
>
>> Here I get this:
>>
>> Benchmark                 Mode  Cnt  Score   Error  Units
>> LoopOverNew.buffer_loop   avgt   30  0.623 ? 0.003  ms/op
>> LoopOverNew.segment_loop  avgt   30  0.624 ? 0.005  ms/op
>> LoopOverNew.unsafe_loop   avgt   30  0.400 ? 0.002  ms/op

This is normal - unsafe is faster because it does not zero memory. If 
you run with "-Djdk.internal.foreign.skipZeroMemory=true" Unsafe and 
segments should get same numbers.

<snip>

>> Sorry, I misread your original email - you said that to get
>> vectorization you updated the benchmark so that it always stored the
>> same value. I indeed do get vectorization in that case - but I also
>> get "vmovdqu" to be generated if I change the memory segment
>> benchmark to do the same thing - and get again similar perf numbers:
>>
>> Benchmark                 Mode  Cnt  Score   Error  Units
>> LoopOverNew.buffer_loop   avgt   30  0.418 ? 0.004  ms/op
>> LoopOverNew.segment_loop  avgt   30  0.415 ? 0.001  ms/op
>> LoopOverNew.unsafe_loop   avgt   30  0.396 ? 0.002  ms/op
> Now I really am confused. Same source tree, same branch, very
> different results. I know that C2 is somewhat probabilistic and uses a
> ton of heuristics, but this is ridiculous.
Perhaps we tweaked the benchmarks in slightly different ways? Can you 
please share your modifications, so that at least we can make sure we're 
running the same thing.
>
>> Also, please do take into account that the bytebuffer benchmark is
>> giving good numbers, but it does so by 'cheating' and use the Unsafe
>> way to force cleanup of the off-heap memory (see call to
>> Unsafe::invokeCleaner). If we remove that call (not to depend on
>> Unsafe), then the numbers are quite different:
> Does anyone repeatedly clean up ByteBuffers? As far as I know they
> open them and leave them open.
Well, if that's the case, then the number below is more representative 
of the "average" use case. Note that one of the main drivers of the 
memory segment API, since we're in Panama-land, is to support the kind 
of allocations that we see frequently when interfacing with native code. 
E.g. allocate a small struct, populate it, pass it to a native function, 
clean it up. The only supported way to do this has been the ByteBuffer 
API, but, as you can see from these numbers, ByteBuffers are really bad 
when it comes to repeatedly allocating (and cleaning) small structs - 
because of all the Cleaner baggage. Which means if you use ByteBuffers 
for native interop, your best option would be to use the Unsafe cleaner 
- but then you are on your own when it comes to ensuring that calling 
the cleaner doesn't break other buffer clients in other threads. The big 
advantage of the memory segment API is that it has been built from the 
ground up with deterministic deallocation in mind, specifically to 
support these kind of use cases.
>
>> Benchmark                Mode  Cnt  Score   Error  Units
>> LoopOverNew.buffer_loop  avgt   30  2.120 ? 0.667  ms/op
>>
>
Cheers
Maurizio