RFR (14) 8235837: Memory access API refinements

Fri Jan 17 16:01:12 UTC 2020

On 1/16/20 3:15 PM, Maurizio Cimadamore wrote:
>
> On 16/01/2020 14:50, Andrew Haley wrote:
>> On 1/15/20 6:48 PM, Maurizio Cimadamore wrote:
>>> Maybe this would be best moved on panama-dev?
>> Here we are.
>>
>>> In any case, for obtaining best performances, it is best to use an
>>> indexed (or strided) var handle - your loop will create a new memory
>>> address on each new iteration, which will not be a problem once
>>> MemoryAddress will be an inline type, but in the meantime...
>>
>> OK. It's rather unfortunate that your example at the very top of the
>> API description uses an anti-pattern, at least from a performance
>> point of view. Might I suggest that it should be changed to the best
>> way to do the job?
>
> That's a good suggestion, thanks. The rational behind the current
> example was to provide a simple example so that people could familiarize
> with the API - I'm worried that throwing strided access into the mix in
> the first example could be a bit steep, but I'll see what I can do.

I think that's a big clue itself. Perhaps the API is the problem here
-- if the simple way to do something isn't the best way to do it, some
changes may be called for.

People are going to want to replace simple ByteBuffer operations with
this API, and we'll need a straightforward way to do that.

>>> https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
>>
>> I guess that change still isn't committed to the Panama repo? But
>> anyway, the loops are all ints, with a constant bound.
>
> This change has been pushed to the panama repo. The loops might be int,
> but internally the memory access machinery uses long operations to sum
> offsets together and multiply them, which is causing BCE and other
> optimization to fail (since these optimizations looks for IADD, IMUL
> opcodes).

OK, more work to be done there.

> The numbers you report above seems consistent to what we have in JDK 14,
> and in panama before the fix/workaround I mentioned. Are you sure you
> are testing with the latest bits? Could you be on the panama 'default'
> branch instead of 'foreign-memaccess' ?

Oh, why did I not think of that! :-)

I updated to 'foreign-memaccess'.

> Here I get this:
>
> Benchmark                 Mode  Cnt  Score   Error  Units
> LoopOverNew.buffer_loop   avgt   30  0.623 ? 0.003  ms/op
> LoopOverNew.segment_loop  avgt   30  0.624 ? 0.005  ms/op
> LoopOverNew.unsafe_loop   avgt   30  0.400 ? 0.002  ms/op

And I'm seeing this, with the benchmark changed to storing a constant,
using the foreign-memaccess branch, and

Benchmark                      Mode  Cnt  Score   Error  Units
LoopOverNew.buffer_loop        avgt   10  0.295 ± 0.007  ms/op
LoopOverNew.segment_loop       avgt   10  0.356 ± 0.006  ms/op
LoopOverNew.unsafe_loop        avgt   10  0.347 ± 0.007  ms/op

But perhaps the benchmark isn't telling the whole story about code
quality. It walks over 4 megabytes of memory, so it's forcing all
accesses out to the L3 cache, and memory access time swamps everything
else. Change that loop from 1 iteration of 4000 kbytes to 10
iterations of 400 kbytes, and I get:

Benchmark                      Mode  Cnt  Score   Error  Units
LoopOverNew.buffer_loop        avgt   10  0.117 ± 0.001  ms/op
   (34Gb/sec !)
LoopOverNew.segment_loop       avgt   10  0.363 ± 0.001  ms/op
LoopOverNew.unsafe_loop        avgt   10  0.345 ± 0.003  ms/op

Vectorization is a really big deal as long as we don't splurge so much
memory that the per-core caches become ineffective. I don't know:
maybe all of the off-heap memory will tend to have

I will have to follow up to try to understand why Unsafe doesn't
vectorize this loop.

> Unsafe comes out on top, but that's because of memory zeroing - if you
> disable it using -Djdk.internal.foreign.skipZeroMemory=true then numbers
> of memory access API are on par with Unsafe.

OK.

> Btw, on my machine I see lots of unrolling, but no vectorization, not
> > even for ByteBuffer.
>
> Sorry, I misread your original email - you said that to get
> vectorization you updated the benchmark so that it always stored the
> same value. I indeed do get vectorization in that case - but I also
> get "vmovdqu" to be generated if I change the memory segment
> benchmark to do the same thing - and get again similar perf numbers:

Excellent, good to hear. Short of updating to the foriegn-memaccss
branch and rebuilding I'm not sure what else I can do, though.

You may be aware that five years or so ago I did a chunk of work
optimizing off-heap ByteBuffer accesses. Ever since it's been a game
of Whack-A-Mole every time C2 broke something.

On 16/01/2020 15:15, Maurizio Cimadamore wrote:
> > Btw, on my machine I see lots of unrolling, but no vectorization, not
> > even for ByteBuffer.
>
> Sorry, I misread your original email - you said that to get
> vectorization you updated the benchmark so that it always stored the
> same value. I indeed do get vectorization in that case - but I also
> get "vmovdqu" to be generated if I change the memory segment
> benchmark to do the same thing - and get again similar perf numbers:
>
> Benchmark                 Mode  Cnt  Score   Error  Units
> LoopOverNew.buffer_loop   avgt   30  0.418 ? 0.004  ms/op
> LoopOverNew.segment_loop  avgt   30  0.415 ? 0.001  ms/op
> LoopOverNew.unsafe_loop   avgt   30  0.396 ? 0.002  ms/op

Now I really am confused. Same source tree, same branch, very
different results. I know that C2 is somewhat probabilistic and uses a
ton of heuristics, but this is ridiculous.

> Also, please do take into account that the bytebuffer benchmark is
> giving good numbers, but it does so by 'cheating' and use the Unsafe
> way to force cleanup of the off-heap memory (see call to
> Unsafe::invokeCleaner). If we remove that call (not to depend on
> Unsafe), then the numbers are quite different:

Does anyone repeatedly clean up ByteBuffers? As far as I know they
open them and leave them open.

> Benchmark                Mode  Cnt  Score   Error  Units
> LoopOverNew.buffer_loop  avgt   30  2.120 ? 0.667  ms/op
>
> Which is ~5x worse. Now, I agree with you that we should strive to
> generate the best possible code (since that seems to happen for
> other JDK APIs :-) ), but I think that when evaluating the
> performances of the new memory API we should also factor other
> considerations in (such as the cost of actually allocating a segment
> vs. a buffer).

Maybe, if any real-world program is actually doing that.

Thank you for reading to the end of this email.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671