Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)

29 Apr 2020

      Many thanks Peter,
my preference would be for adding the benchmark now, and come back to 
fix this post integration. The MemoryScope code is finicky and getting 
confidence that the code is race-free takes time and I'd prefer not to 
change that during the course of this RFR.

Your approach seems promising, so let's keep working and thinking on it 
on the side (and perhaps let's maybe go for it in the panama repo first, 
to make sure we don't add regressions).

Sounds like a plan?

Cheers
Maurizio

On 29/04/2020 20:19, Peter Levart wrote:
...
Hi Maurizio,
On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
...
The current implementation has performances that are on par with the 
previous acquire-based implementation, and also on par with what can 
be achieved with Unsafe. We do have a micro benchmark in the patch 
(see ParallelSum (**)) which tests this, and I get _identical_ 
numbers even if I _comment_ the body of acquire/release - so that no 
contention can happen; so, I'm a bit skeptical overall that 
contention on acquire/release is the main factor at play here - but 
perhaps we need more targeted benchmarks.
So I modified your benchmark (just took out the relevant parts) and 
added some benchmarks that exhibit Stream.findAny() and 
Stream.findFirst(). As I anticipated, the results for parallel stream 
variants were slowing the benchmark down, so I had to reduce the 
number of elements by a factor of 16 to get results in reasonable time:
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/ParallelSum....
I then swapped-in the alternative implementation of MemoryScope (note 
that this is not a whole implementation - the dup() method is missing):
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/MemoryScope....
...and I got these results:
i7 2600K (4 cores / 8 threads)
with proposed MemoryScope:
Benchmark                               Mode  Cnt Score     Error Units
ParallelSum.find_any_stream_parallel    avgt   10  1332.687 ± 733.535  
ms/op
ParallelSum.find_any_stream_serial      avgt   10   440.260 ± 3.110  
ms/op
ParallelSum.find_first_loop_serial      avgt   10     5.809 ± 0.044  
ms/op
ParallelSum.find_first_stream_parallel  avgt   10  2070.318 ± 41.072  
ms/op
ParallelSum.find_first_stream_serial    avgt   10   440.034 ± 4.672  
ms/op
ParallelSum.sum_loop_serial             avgt   10     5.647 ± 0.055  
ms/op
ParallelSum.sum_stream_parallel         avgt   10     5.314 ± 0.294  
ms/op
ParallelSum.sum_stream_serial           avgt   10    19.179 ± 0.136  
ms/op
with alternative MemoryScope:
Benchmark                               Mode  Cnt Score    Error Units
ParallelSum.find_any_stream_parallel    avgt   10   80.280 ± 13.183  
ms/op
ParallelSum.find_any_stream_serial      avgt   10  317.388 ± 2.787  ms/op
ParallelSum.find_first_loop_serial      avgt   10    5.790 ± 0.038  ms/op
ParallelSum.find_first_stream_parallel  avgt   10  117.925 ± 1.747  ms/op
ParallelSum.find_first_stream_serial    avgt   10  315.076 ± 5.725  ms/op
ParallelSum.sum_loop_serial             avgt   10    5.652 ± 0.042  ms/op
ParallelSum.sum_stream_parallel         avgt   10    4.881 ± 0.053  ms/op
ParallelSum.sum_stream_serial           avgt   10   19.143 ± 0.035  ms/op
So here it is. The proof that contention does occur.
Regards, Peter