testing support loops with long (64b) trip counts

Wed Jun 10 01:12:49 UTC 2020

Hi,
first of all, thanks Roland for working on this; this has been a 
considerable bottleneck for the Panama Foreign Memory Access API over 
the last year or so.

In the implementation of the API we are actually applying few tricks, so 
that we detect "small" segment at creation (e.g. segment whose size fits 
into an int). If that's the case, then we use some logic to perform int 
computations instead of long computations, which speeds up thing a bit. 
But this trick alone is not enough: it only works if the client writes 
the loop using an `int` loop variable; if a `long` variable is used, 
performance is significantly degraded. This is a pity, because the 
VarHandle used by the Foreign Memory Access API use long coordinates, so 
the client has to be very careful in casting the int variable back into 
a long (or an inexact var handle call will take place).

I tried several combination w/ and w/o your patch, to see if any 
improvements were possible. I also tried, for each combo, to run with 
and without the small segment hack, to see the difference. The patch I 
applied on top of the Panama foreign-memaccess branch [1] can be found 
at [2]. Here are some results:

baseline w/ workaround

Benchmark                                  Mode  Cnt  Score Error  Units
LoopOverNonConstant.segment_loop           avgt   30  0.330 ? 0.015  ms/op
LoopOverNonConstant.segment_loop_readonly  avgt   30  0.354 ? 0.004  ms/op
LoopOverNonConstant.segment_loop_slice     avgt   30  0.347 ? 0.006  ms/op

Benchmark                                      Mode  Cnt  Score Error  Units
LoopOverNonConstantLong.segment_loop           avgt   30  1.695 ? 0.061  
ms/op
LoopOverNonConstantLong.segment_loop_readonly  avgt   30  1.660 ? 0.089  
ms/op
LoopOverNonConstantLong.segment_loop_slice     avgt   30  1.684 ? 0.057  
ms/op

baseline w/o workaround

Benchmark                                  Mode  Cnt  Score Error  Units
LoopOverNonConstant.segment_loop           avgt   30  0.484 ? 0.034  ms/op
LoopOverNonConstant.segment_loop_readonly  avgt   30  0.502 ? 0.012  ms/op
LoopOverNonConstant.segment_loop_slice     avgt   30  0.501 ? 0.012  ms/op

Benchmark                                      Mode  Cnt  Score Error  Units
LoopOverNonConstantLong.segment_loop           avgt   30  1.377 ? 0.026  
ms/op
LoopOverNonConstantLong.segment_loop_readonly  avgt   30  1.173 ? 0.023  
ms/op
LoopOverNonConstantLong.segment_loop_slice     avgt   30  1.170 ? 0.029  
ms/op

baseline w/o workaround + proposed patch

Benchmark                                  Mode  Cnt  Score Error  Units
LoopOverNonConstant.segment_loop           avgt   30  0.530 ? 0.042  ms/op
LoopOverNonConstant.segment_loop_readonly  avgt   30  0.508 ? 0.013  ms/op
LoopOverNonConstant.segment_loop_slice     avgt   30  0.520 ? 0.016  ms/op

Benchmark                                      Mode  Cnt  Score Error  Units
LoopOverNonConstantLong.segment_loop           avgt   30  1.575 ? 0.066  
ms/op
LoopOverNonConstantLong.segment_loop_readonly  avgt   30  1.517 ? 0.020  
ms/op
LoopOverNonConstantLong.segment_loop_slice     avgt   30  1.496 ? 0.042  
ms/op

Overall, unless I did some mistake, it doesn't look like the patch is 
changing much. The baseline + workaround for small segment remains the 
fastest version around, which performs on par with unsafe. If we remove 
the workaround, we get some 1.5-2x slower; but if we start looping on 
longs (see the LoopOverNonConstantLong suite), then performances get 
much, much worse.

Am I missing something? Is our implementation doing something that is 
confusing your optimization?

Cheers
Maurizio

[1] - https://github.com/openjdk/panama-foreign/tree/foreign-memaccess
[2] - 
http://cr.openjdk.java.net/~mcimadamore/panama/long_loop%2bpanama.patch