Regression (b88): OOM on parallel limit operation

Fri May 10 10:43:05 PDT 2013

> public class OOM {
>      public static void main(String... ignored) {
>
>          // prints 200_000
>          System.out.println(java.util.stream.LongStream.iterate(1L, n -> n + 1L)
>                  .filter(l -> l % 100 == 0).limit(200_000).count());
>
>          // prints 100_000
>          System.out.println(java.util.stream.LongStream.iterate(1L, n -> n + 1L)
>                  .parallel()
>                  .filter(l -> l % 100 == 0).limit(100_000).count());
>
>          // immediate OOM
>          System.out.println(java.util.stream.LongStream.iterate(1L, n -> n + 1L)
>                  .parallel()
>                  .filter(l -> l % 100 == 0).limit(200_000).count());
>      }
> }

There are definitely things we need to tweak here.  But realistically, 
there are also things we just need to document, and count on users to 
not do silly things.  The above examples are, sadly, silly, though this 
is in large part our fault for having not enough documentation about 
what is going on to make the silliness more obvious.

There are two interacting factors that make this case silly.  The first 
is generating a stream with iteration, which is sequential and for which 
parallelism can only be reconstructed badly.

If a stream source is iterative, the split operation must be some form 
of (some, rest), and has very little chance of being balanced since you 
have no idea how many elements are on the right side.  This means that 
we generate right-heavy computation trees, which in the worst case look 
like:

  (a, (b, (c, (d, ....))))

which is far far worse than sequential, because the computation is still 
fundamentally sequential but now you're additionally multiplying the 
per-element work by a lot (because of the task and splitting overhead). 
  We can do somewhat better, by splitting off more than one at a time, 
but ultimately we're still stuck with heavily lopsided and deep trees.

On the other hand, if your stream were generated by a range, that splits 
evenly (generating balanced trees) and preserves sizing information. 
This generates far better behaved parallel computation trees.  It would 
be great if the library could "just handle it" (and the library does 
just handle a lot), but when things start to interact with challenged 
ops like limit(), there's really no answer except "the user has to know 
what he's doing."

The second factor is in the nature of limit(), which is constrained to 
yield its results in encounter order.  Encounter-order-constrained 
parallel operations are always iffy; you want to use parallelism for 
cases like:

   source.filter(expensiveFn).limit(N)

to do the filtering in parallel, but the constraint to encounter order 
(assuming the source is ordered) is a killer.  Because it means you have 
to buffer the whole thing until you've found at least N.

Combining these two -- pathologically unbalanced source with the need to 
buffer the whole result -- is what leads to OOME.

What's missing is enough of the above explanation of the cost model in 
the docs to make it obvious to users why:

   iterate(0, f).filter(..).limit(BIG)

is silly and why adding limit(BIG) to a parallel pipeline changes its 
behavior so drastically.