[foreign-memaccess] [Rev 01] RFR: Alternative scalable MemoryScope

Tue May 5 10:10:10 UTC 2020

On Tue, 5 May 2020 10:09:28 GMT, Peter Levart <plevart at openjdk.org> wrote:

>> This is an alternative MemoryScope which is more scalable when used in a scenario where child scope is frequently
>> acquired and closed concurrently from multiple threads (for example in parallel Stream.findAny())
>
> Peter Levart has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Don't re-use acquires/releases LongAdder(s) fot duped scope

My comment here is mostly non-code-related; I've already seen the code and attempted to prove its correctness in this
email:

https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-May/066190.html

My concerns are:

* some of the HB relationships in the code are quite non-obvious, subtle and hard to reason about (well, at least they
  were for me) - this means that, compared to current approach, there will be an high maintenance cost associated with
  this

* In some of the benchmarks provided by Peter, the stream version of findAny is going several orders of magnitude
  _slower_ than a serial for/each loop anyway. So, are we sure we're solving the right problem here? I.e. is there a use
  case where, by reducing contention, we get back performances that would be considered *acceptable* for a
  performance-savy user? Or is this something that just make something 100x worse as opposed to 1000x worse? Data is
  reported below:

w/o patch
ParallelSum.find_any_stream_parallel    avgt   10  1332.687 ± 733.535  ms/op
ParallelSum.find_any_stream_serial      avgt   10   440.260 ±   3.110  ms/op
ParallelSum.find_first_loop_serial      avgt   10     5.809 ±   0.044  ms/op

w/ patch
ParallelSum.find_any_stream_parallel    avgt   10   80.280 ± 13.183  ms/op
ParallelSum.find_any_stream_serial      avgt   10  317.388 ±  2.787  ms/op
ParallelSum.find_first_loop_serial      avgt   10    5.790 ±  0.038  ms/op
(full email here: https://mail.openjdk.java.net/pipermail/core-libs-dev/2020-April/066136.html)

Few questions come up naturally here - do these number reflect some intrinsic problem with the memory segment
spliterator per se, or do they reflect some more general problem with using shortcircuiting operations (such as find
any) which consume almost zero CPU time in a stream (and, worse, a parallel stream) ?

With this I'm not of course saying that the patch (and associated improvement) is not important, but I'm wondering
whether the added benefit would be worth the added maintenance cost given that, from the numbers above, it still
doesn't look like `findAny` is the way to go for processing a segment efficiently anyway?

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/142