RFR (14) 8235837: Memory access API refinements
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jan 17 16:22:10 UTC 2020
On 17/01/2020 16:01, Andrew Haley wrote:
> On 1/16/20 3:15 PM, Maurizio Cimadamore wrote:
>> On 16/01/2020 14:50, Andrew Haley wrote:
>>> On 1/15/20 6:48 PM, Maurizio Cimadamore wrote:
>>>> Maybe this would be best moved on panama-dev?
>>> Here we are.
>>>
>>>> In any case, for obtaining best performances, it is best to use an
>>>> indexed (or strided) var handle - your loop will create a new memory
>>>> address on each new iteration, which will not be a problem once
>>>> MemoryAddress will be an inline type, but in the meantime...
>>> OK. It's rather unfortunate that your example at the very top of the
>>> API description uses an anti-pattern, at least from a performance
>>> point of view. Might I suggest that it should be changed to the best
>>> way to do the job?
>> That's a good suggestion, thanks. The rational behind the current
>> example was to provide a simple example so that people could familiarize
>> with the API - I'm worried that throwing strided access into the mix in
>> the first example could be a bit steep, but I'll see what I can do.
> I think that's a big clue itself. Perhaps the API is the problem here
> -- if the simple way to do something isn't the best way to do it, some
> changes may be called for.
>
> People are going to want to replace simple ByteBuffer operations with
> this API, and we'll need a straightforward way to do that.
Possibly - although this API might be slightly lower-level than your
typical BB audience, I think. Anyway, the focus, for now, is on the
fundamentals - we want to get the primitive right - later on we can
discuss as to whether we should throw in *usability candies* such as
having indexed accessors on the memory segment itself which does the
right thing VarHandle-wise.
>
>>>> https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
>>> I guess that change still isn't committed to the Panama repo? But
>>> anyway, the loops are all ints, with a constant bound.
>> This change has been pushed to the panama repo. The loops might be int,
>> but internally the memory access machinery uses long operations to sum
>> offsets together and multiply them, which is causing BCE and other
>> optimization to fail (since these optimizations looks for IADD, IMUL
>> opcodes).
> OK, more work to be done there.
Yep, indeed.
>
>> The numbers you report above seems consistent to what we have in JDK 14,
>> and in panama before the fix/workaround I mentioned. Are you sure you
>> are testing with the latest bits? Could you be on the panama 'default'
>> branch instead of 'foreign-memaccess' ?
> Oh, why did I not think of that! :-)
>
> I updated to 'foreign-memaccess'.
>
>> Here I get this:
>>
>> Benchmark Mode Cnt Score Error Units
>> LoopOverNew.buffer_loop avgt 30 0.623 ? 0.003 ms/op
>> LoopOverNew.segment_loop avgt 30 0.624 ? 0.005 ms/op
>> LoopOverNew.unsafe_loop avgt 30 0.400 ? 0.002 ms/op
This is normal - unsafe is faster because it does not zero memory. If
you run with "-Djdk.internal.foreign.skipZeroMemory=true" Unsafe and
segments should get same numbers.
<snip>
>> Sorry, I misread your original email - you said that to get
>> vectorization you updated the benchmark so that it always stored the
>> same value. I indeed do get vectorization in that case - but I also
>> get "vmovdqu" to be generated if I change the memory segment
>> benchmark to do the same thing - and get again similar perf numbers:
>>
>> Benchmark Mode Cnt Score Error Units
>> LoopOverNew.buffer_loop avgt 30 0.418 ? 0.004 ms/op
>> LoopOverNew.segment_loop avgt 30 0.415 ? 0.001 ms/op
>> LoopOverNew.unsafe_loop avgt 30 0.396 ? 0.002 ms/op
> Now I really am confused. Same source tree, same branch, very
> different results. I know that C2 is somewhat probabilistic and uses a
> ton of heuristics, but this is ridiculous.
Perhaps we tweaked the benchmarks in slightly different ways? Can you
please share your modifications, so that at least we can make sure we're
running the same thing.
>
>> Also, please do take into account that the bytebuffer benchmark is
>> giving good numbers, but it does so by 'cheating' and use the Unsafe
>> way to force cleanup of the off-heap memory (see call to
>> Unsafe::invokeCleaner). If we remove that call (not to depend on
>> Unsafe), then the numbers are quite different:
> Does anyone repeatedly clean up ByteBuffers? As far as I know they
> open them and leave them open.
Well, if that's the case, then the number below is more representative
of the "average" use case. Note that one of the main drivers of the
memory segment API, since we're in Panama-land, is to support the kind
of allocations that we see frequently when interfacing with native code.
E.g. allocate a small struct, populate it, pass it to a native function,
clean it up. The only supported way to do this has been the ByteBuffer
API, but, as you can see from these numbers, ByteBuffers are really bad
when it comes to repeatedly allocating (and cleaning) small structs -
because of all the Cleaner baggage. Which means if you use ByteBuffers
for native interop, your best option would be to use the Unsafe cleaner
- but then you are on your own when it comes to ensuring that calling
the cleaner doesn't break other buffer clients in other threads. The big
advantage of the memory segment API is that it has been built from the
ground up with deterministic deallocation in mind, specifically to
support these kind of use cases.
>
>> Benchmark Mode Cnt Score Error Units
>> LoopOverNew.buffer_loop avgt 30 2.120 ? 0.667 ms/op
>>
>
Cheers
Maurizio
More information about the panama-dev
mailing list