RFR: 8154387 - Parallel unordered Stream.limit() tries to collect 128 elements even if limit is less

Sat Apr 16 13:05:43 UTC 2016

Hello!

Please review and sponsor the following patch:
https://bugs.openjdk.java.net/browse/JDK-8154387
http://cr.openjdk.java.net/~tvaleev/webrev/8154387/r1/

The rationale is to speed-up the parallel processing for unordered
streams with low limit value. Such problems occur when you want to
perform expensive filtering and select at most x elements which pass
the filter (order does not matter). Currently unordered limit
operation buffers up to 128 elements for each parallel task before it
checks whether limit is reached. This is actually harmful when
requested limit is lower: much more elements are requested from the
upstream than necessary. Here's simple JMH test which illustrates the
problem:

http://cr.openjdk.java.net/~tvaleev/webrev/8154387/jmh/
It extracts the requested number of probable-primes from the list of
10000 BigInteger numbers. The results with 9ea+111:

Benchmark                    (limit)  Mode  Cnt      Score      Error  Units
LimitTest.parLimit                 2  avgt   30    108,971 ±    0,643  us/op
LimitTest.parLimit                20  avgt   30    934,176 ±   14,003  us/op
LimitTest.parLimit               200  avgt   30   8772,417 ±  190,609  us/op
LimitTest.parLimit              2000  avgt   30  41775,463 ± 1800,537  us/op
LimitTest.parUnorderedLimit        2  avgt   30   2557,798 ±   13,161  us/op
LimitTest.parUnorderedLimit       20  avgt   30   2578,283 ±   23,547  us/op
LimitTest.parUnorderedLimit      200  avgt   30   4577,318 ±   40,793  us/op
LimitTest.parUnorderedLimit     2000  avgt   30  12279,346 ±  523,823  us/op
LimitTest.seqLimit                 2  avgt   30     34,831 ±    0,190  us/op
LimitTest.seqLimit                20  avgt   30    369,729 ±    1,427  us/op
LimitTest.seqLimit               200  avgt   30   3690,544 ±   13,907  us/op
LimitTest.seqLimit              2000  avgt   30  36681,637 ±  156,538  us/op

When the limit is 2 or 20, parallel unordered version is slower than
parallel ordered! Even for limit = 200 it's still slower than
sequential operation.

The idea of the patch is to tweak the CHUNK_SIZE using the given limit and
parallelism level. I used the following formula:

this.chunkSize = limit >= 0 ? (int)Math.min(CHUNK_SIZE,
     (skip + limit) / ForkJoinPool.getCommonPoolParallelism() + 1) : CHUNK_SIZE;

This does not affect cases when limit is big or not set at all (in
skip mode). However it greatly improves cases when limit is small:

Benchmark                    (limit)  Mode  Cnt      Score      Error  Units
LimitTest.parLimit                 2  avgt   30    109,502 ±    0,750  us/op
LimitTest.parLimit                20  avgt   30    954,716 ±   39,276  us/op
LimitTest.parLimit               200  avgt   30   8706,226 ±  184,330  us/op
LimitTest.parLimit              2000  avgt   30  42126,346 ± 3163,444  us/op
LimitTest.parUnorderedLimit        2  avgt   30     39,303 ±    0,177  us/op !!!
LimitTest.parUnorderedLimit       20  avgt   30    266,107 ±    0,492  us/op !!!
LimitTest.parUnorderedLimit      200  avgt   30   2547,177 ±   58,538  us/op !!!
LimitTest.parUnorderedLimit     2000  avgt   30  12216,402 ±  430,574  us/op
LimitTest.seqLimit                 2  avgt   30     34,993 ±    0,704  us/op
LimitTest.seqLimit                20  avgt   30    369,497 ±    1,754  us/op
LimitTest.seqLimit               200  avgt   30   3716,059 ±   61,054  us/op
LimitTest.seqLimit              2000  avgt   30  36814,356 ±  161,531  us/op

Here you can see that unordered cases are significantly improved. Now
they are always faster than parallel ordered and faster than
sequential for limit >= 20.

I did not think up how to test this patch as it does not change
visible behavior, only speed. However all the existing tests pass.

What do you think?

With best regards,
Tagir Valeev.