RFR: 8154387 - Parallel unordered Stream.limit() tries to collect 128 elements even if limit is less

Mon Apr 18 11:12:58 UTC 2016

Hi Tagir,

nice catch. I think this optimization is worthwile.

I'm a bit surprised about the JMH results for limits 200 and 2000.

limit = 200 is significantly faster than the unpatched code (with
higher variance, though) and limit = 2000 is about the same, but
with a significantly reduced variance. Maybe you'd need to increase
the number of iterations / forks to get more stable results that
are in line with expectations - or do I miss something here?

Regards,
Stefan

2016-04-16 15:05 GMT+02:00 Tagir F. Valeev <amaembo at gmail.com>:
> Hello!
>
> Please review and sponsor the following patch:
> https://bugs.openjdk.java.net/browse/JDK-8154387
> http://cr.openjdk.java.net/~tvaleev/webrev/8154387/r1/
>
> The rationale is to speed-up the parallel processing for unordered
> streams with low limit value. Such problems occur when you want to
> perform expensive filtering and select at most x elements which pass
> the filter (order does not matter). Currently unordered limit
> operation buffers up to 128 elements for each parallel task before it
> checks whether limit is reached. This is actually harmful when
> requested limit is lower: much more elements are requested from the
> upstream than necessary. Here's simple JMH test which illustrates the
> problem:
>
> http://cr.openjdk.java.net/~tvaleev/webrev/8154387/jmh/
> It extracts the requested number of probable-primes from the list of
> 10000 BigInteger numbers. The results with 9ea+111:
>
> Benchmark                    (limit)  Mode  Cnt      Score      Error  Units
> LimitTest.parLimit                 2  avgt   30    108,971 ±    0,643  us/op
> LimitTest.parLimit                20  avgt   30    934,176 ±   14,003  us/op
> LimitTest.parLimit               200  avgt   30   8772,417 ±  190,609  us/op
> LimitTest.parLimit              2000  avgt   30  41775,463 ± 1800,537  us/op
> LimitTest.parUnorderedLimit        2  avgt   30   2557,798 ±   13,161  us/op
> LimitTest.parUnorderedLimit       20  avgt   30   2578,283 ±   23,547  us/op
> LimitTest.parUnorderedLimit      200  avgt   30   4577,318 ±   40,793  us/op
> LimitTest.parUnorderedLimit     2000  avgt   30  12279,346 ±  523,823  us/op
> LimitTest.seqLimit                 2  avgt   30     34,831 ±    0,190  us/op
> LimitTest.seqLimit                20  avgt   30    369,729 ±    1,427  us/op
> LimitTest.seqLimit               200  avgt   30   3690,544 ±   13,907  us/op
> LimitTest.seqLimit              2000  avgt   30  36681,637 ±  156,538  us/op
>
> When the limit is 2 or 20, parallel unordered version is slower than
> parallel ordered! Even for limit = 200 it's still slower than
> sequential operation.
>
> The idea of the patch is to tweak the CHUNK_SIZE using the given limit and
> parallelism level. I used the following formula:
>
> this.chunkSize = limit >= 0 ? (int)Math.min(CHUNK_SIZE,
>      (skip + limit) / ForkJoinPool.getCommonPoolParallelism() + 1) : CHUNK_SIZE;
>
> This does not affect cases when limit is big or not set at all (in
> skip mode). However it greatly improves cases when limit is small:
>
> Benchmark                    (limit)  Mode  Cnt      Score      Error  Units
> LimitTest.parLimit                 2  avgt   30    109,502 ±    0,750  us/op
> LimitTest.parLimit                20  avgt   30    954,716 ±   39,276  us/op
> LimitTest.parLimit               200  avgt   30   8706,226 ±  184,330  us/op
> LimitTest.parLimit              2000  avgt   30  42126,346 ± 3163,444  us/op
> LimitTest.parUnorderedLimit        2  avgt   30     39,303 ±    0,177  us/op !!!
> LimitTest.parUnorderedLimit       20  avgt   30    266,107 ±    0,492  us/op !!!
> LimitTest.parUnorderedLimit      200  avgt   30   2547,177 ±   58,538  us/op !!!
> LimitTest.parUnorderedLimit     2000  avgt   30  12216,402 ±  430,574  us/op
> LimitTest.seqLimit                 2  avgt   30     34,993 ±    0,704  us/op
> LimitTest.seqLimit                20  avgt   30    369,497 ±    1,754  us/op
> LimitTest.seqLimit               200  avgt   30   3716,059 ±   61,054  us/op
> LimitTest.seqLimit              2000  avgt   30  36814,356 ±  161,531  us/op
>
> Here you can see that unordered cases are significantly improved. Now
> they are always faster than parallel ordered and faster than
> sequential for limit >= 20.
>
> I did not think up how to test this patch as it does not change
> visible behavior, only speed. However all the existing tests pass.
>
> What do you think?
>
> With best regards,
> Tagir Valeev.
>