Hi Maurizio, On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
The current implementation has performances that are on par with the previous acquire-based implementation, and also on par with what can be achieved with Unsafe. We do have a micro benchmark in the patch (see ParallelSum (**)) which tests this, and I get _identical_ numbers even if I _comment_ the body of acquire/release - so that no contention can happen; so, I'm a bit skeptical overall that contention on acquire/release is the main factor at play here - but perhaps we need more targeted benchmarks.
So I modified your benchmark (just took out the relevant parts) and added some benchmarks that exhibit Stream.findAny() and Stream.findFirst(). As I anticipated, the results for parallel stream variants were slowing the benchmark down, so I had to reduce the number of elements by a factor of 16 to get results in reasonable time: http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/ParallelSum.... I then swapped-in the alternative implementation of MemoryScope (note that this is not a whole implementation - the dup() method is missing): http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/MemoryScope.... ...and I got these results: i7 2600K (4 cores / 8 threads) with proposed MemoryScope: Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 1332.687 ± 733.535 ms/op ParallelSum.find_any_stream_serial avgt 10 440.260 ± 3.110 ms/op ParallelSum.find_first_loop_serial avgt 10 5.809 ± 0.044 ms/op ParallelSum.find_first_stream_parallel avgt 10 2070.318 ± 41.072 ms/op ParallelSum.find_first_stream_serial avgt 10 440.034 ± 4.672 ms/op ParallelSum.sum_loop_serial avgt 10 5.647 ± 0.055 ms/op ParallelSum.sum_stream_parallel avgt 10 5.314 ± 0.294 ms/op ParallelSum.sum_stream_serial avgt 10 19.179 ± 0.136 ms/op with alternative MemoryScope: Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 80.280 ± 13.183 ms/op ParallelSum.find_any_stream_serial avgt 10 317.388 ± 2.787 ms/op ParallelSum.find_first_loop_serial avgt 10 5.790 ± 0.038 ms/op ParallelSum.find_first_stream_parallel avgt 10 117.925 ± 1.747 ms/op ParallelSum.find_first_stream_serial avgt 10 315.076 ± 5.725 ms/op ParallelSum.sum_loop_serial avgt 10 5.652 ± 0.042 ms/op ParallelSum.sum_stream_parallel avgt 10 4.881 ± 0.053 ms/op ParallelSum.sum_stream_serial avgt 10 19.143 ± 0.035 ms/op So here it is. The proof that contention does occur. Regards, Peter