j.u.s.Stream and Fiber

Mon Sep 23 14:12:27 UTC 2019

I agree that this is a good time to bring this up, but I also want to set expectations that adjusting the stream execution model is a significant project.  The stream implementation is highly biased towards data-parallel computations; there are hand-coded parallel implementations of all the terminal operations that reflect these biases.  There is a significant API design project, as well as an implementation project, to bring streams up to workloads with mixed data and IO parallelism.  

The opposite extreme from data-parallel is where computation of each element requires IO, such as something like:

    Stream.of(document.imagesUrls())
        .map(Url::get)
        .collect(toList());

The current splitting heuristic is aimed at keeping the cores busy, but not oversubscribed; in the extreme IO-bound case, the resource we’re looking to keep busy-but-not-oversubscribed is IO capacity.  And if there is a mix of the two, we will want to mix accordingly.  

One might think that the answer is to have some sort of pluggable Strategy pattern to select whether to split or compute, but here the physics are not with us; the indirection costs feed into the serial fraction of Amdahl’s law, so we want to keep this computation fast.  

> On Sep 23, 2019, at 9:55 AM, Alan Bateman <Alan.Bateman at oracle.com> wrote:
> 
> On 21/09/2019 09:02, Arkadiusz Gasiński wrote:
>> I actually meant (a) when I started this thread, but I think (b) is the next question to ask if we (I mean you :)) want to ever consider running parallel streams in fibers.
>> 
>> The use case that made me start this thread was: what would happen if I start processing a large number of files in a parallel stream in the context of a Fiber? I've finally found some time to check it and now I know that (at least in the current prototype) some processing will be done in the context of the fiber that triggered terminal operation on the pipeline, but most work will actually be performed by ForkJoinPool.commonPool workers. I have to admit that I was a bit surprised by this - not saying this is good or bad, but for some weird reason was expecting all processing to be done in the context of fiber(s).
>> 
> This is a great topic. There hasn't been any discussion in this project to date on this, mostly because everyone has been busy on the bring up of user mode threads and all the issues around that. So for now at least, a parallel stream will submit to the FJ common pool. That will be right for existing code operating on data, it might not be right if there are stream operations doing networking I/O. Whether the choice to use fibers is implicit or explicit isn't clear. There is a potential mini project here, esp. with Brian's comment that there may be different splitting choices.
> 
> Someone else brought up CompleteableFuture where the xxxAsync will use the FJ common pool when an Executor isn't specified specified. A suggestion at one point was to add variants that would schedule a fiber and that would at least be explicit in the API. This is another topic that will need attention at some point.
> 
> -Alan.