testing support loops with long (64b) trip counts
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Wed Jun 10 01:12:49 UTC 2020
Hi,
first of all, thanks Roland for working on this; this has been a
considerable bottleneck for the Panama Foreign Memory Access API over
the last year or so.
In the implementation of the API we are actually applying few tricks, so
that we detect "small" segment at creation (e.g. segment whose size fits
into an int). If that's the case, then we use some logic to perform int
computations instead of long computations, which speeds up thing a bit.
But this trick alone is not enough: it only works if the client writes
the loop using an `int` loop variable; if a `long` variable is used,
performance is significantly degraded. This is a pity, because the
VarHandle used by the Foreign Memory Access API use long coordinates, so
the client has to be very careful in casting the int variable back into
a long (or an inexact var handle call will take place).
I tried several combination w/ and w/o your patch, to see if any
improvements were possible. I also tried, for each combo, to run with
and without the small segment hack, to see the difference. The patch I
applied on top of the Panama foreign-memaccess branch [1] can be found
at [2]. Here are some results:
baseline w/ workaround
Benchmark Mode Cnt Score Error Units
LoopOverNonConstant.segment_loop avgt 30 0.330 ? 0.015 ms/op
LoopOverNonConstant.segment_loop_readonly avgt 30 0.354 ? 0.004 ms/op
LoopOverNonConstant.segment_loop_slice avgt 30 0.347 ? 0.006 ms/op
Benchmark Mode Cnt Score Error Units
LoopOverNonConstantLong.segment_loop avgt 30 1.695 ? 0.061
ms/op
LoopOverNonConstantLong.segment_loop_readonly avgt 30 1.660 ? 0.089
ms/op
LoopOverNonConstantLong.segment_loop_slice avgt 30 1.684 ? 0.057
ms/op
baseline w/o workaround
Benchmark Mode Cnt Score Error Units
LoopOverNonConstant.segment_loop avgt 30 0.484 ? 0.034 ms/op
LoopOverNonConstant.segment_loop_readonly avgt 30 0.502 ? 0.012 ms/op
LoopOverNonConstant.segment_loop_slice avgt 30 0.501 ? 0.012 ms/op
Benchmark Mode Cnt Score Error Units
LoopOverNonConstantLong.segment_loop avgt 30 1.377 ? 0.026
ms/op
LoopOverNonConstantLong.segment_loop_readonly avgt 30 1.173 ? 0.023
ms/op
LoopOverNonConstantLong.segment_loop_slice avgt 30 1.170 ? 0.029
ms/op
baseline w/o workaround + proposed patch
Benchmark Mode Cnt Score Error Units
LoopOverNonConstant.segment_loop avgt 30 0.530 ? 0.042 ms/op
LoopOverNonConstant.segment_loop_readonly avgt 30 0.508 ? 0.013 ms/op
LoopOverNonConstant.segment_loop_slice avgt 30 0.520 ? 0.016 ms/op
Benchmark Mode Cnt Score Error Units
LoopOverNonConstantLong.segment_loop avgt 30 1.575 ? 0.066
ms/op
LoopOverNonConstantLong.segment_loop_readonly avgt 30 1.517 ? 0.020
ms/op
LoopOverNonConstantLong.segment_loop_slice avgt 30 1.496 ? 0.042
ms/op
Overall, unless I did some mistake, it doesn't look like the patch is
changing much. The baseline + workaround for small segment remains the
fastest version around, which performs on par with unsafe. If we remove
the workaround, we get some 1.5-2x slower; but if we start looping on
longs (see the LoopOverNonConstantLong suite), then performances get
much, much worse.
Am I missing something? Is our implementation doing something that is
confusing your optimization?
Cheers
Maurizio
[1] - https://github.com/openjdk/panama-foreign/tree/foreign-memaccess
[2] -
http://cr.openjdk.java.net/~mcimadamore/panama/long_loop%2bpanama.patch
More information about the hotspot-compiler-dev
mailing list