RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)

Wed Apr 29 19:19:40 UTC 2020

Hi Maurizio,

On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
> The current implementation has performances that are on par with the 
> previous acquire-based implementation, and also on par with what can 
> be achieved with Unsafe. We do have a micro benchmark in the patch 
> (see ParallelSum (**)) which tests this, and I get _identical_ numbers 
> even if I _comment_ the body of acquire/release - so that no 
> contention can happen; so, I'm a bit skeptical overall that contention 
> on acquire/release is the main factor at play here - but perhaps we 
> need more targeted benchmarks. 

So I modified your benchmark (just took out the relevant parts) and 
added some benchmarks that exhibit Stream.findAny() and 
Stream.findFirst(). As I anticipated, the results for parallel stream 
variants were slowing the benchmark down, so I had to reduce the number 
of elements by a factor of 16 to get results in reasonable time:

http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/ParallelSum.java

I then swapped-in the alternative implementation of MemoryScope (note 
that this is not a whole implementation - the dup() method is missing):

http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/MemoryScope.java

...and I got these results:

i7 2600K (4 cores / 8 threads)

with proposed MemoryScope:

Benchmark                               Mode  Cnt Score     Error  Units
ParallelSum.find_any_stream_parallel    avgt   10  1332.687 ± 733.535  ms/op
ParallelSum.find_any_stream_serial      avgt   10   440.260 ±   3.110  ms/op
ParallelSum.find_first_loop_serial      avgt   10     5.809 ±   0.044  ms/op
ParallelSum.find_first_stream_parallel  avgt   10  2070.318 ±  41.072  ms/op
ParallelSum.find_first_stream_serial    avgt   10   440.034 ±   4.672  ms/op
ParallelSum.sum_loop_serial             avgt   10     5.647 ±   0.055  ms/op
ParallelSum.sum_stream_parallel         avgt   10     5.314 ±   0.294  ms/op
ParallelSum.sum_stream_serial           avgt   10    19.179 ±   0.136  ms/op

with alternative MemoryScope:

Benchmark                               Mode  Cnt Score    Error  Units
ParallelSum.find_any_stream_parallel    avgt   10   80.280 ± 13.183  ms/op
ParallelSum.find_any_stream_serial      avgt   10  317.388 ±  2.787  ms/op
ParallelSum.find_first_loop_serial      avgt   10    5.790 ±  0.038  ms/op
ParallelSum.find_first_stream_parallel  avgt   10  117.925 ±  1.747  ms/op
ParallelSum.find_first_stream_serial    avgt   10  315.076 ±  5.725  ms/op
ParallelSum.sum_loop_serial             avgt   10    5.652 ±  0.042  ms/op
ParallelSum.sum_stream_parallel         avgt   10    4.881 ±  0.053  ms/op
ParallelSum.sum_stream_serial           avgt   10   19.143 ±  0.035  ms/op

So here it is. The proof that contention does occur.

Regards, Peter