RFR: 8154387 - Parallel unordered Stream.limit() tries to collect 128 elements even if limit is less
Tagir F. Valeev
amaembo at gmail.com
Sat Apr 16 13:05:43 UTC 2016
Hello!
Please review and sponsor the following patch:
https://bugs.openjdk.java.net/browse/JDK-8154387
http://cr.openjdk.java.net/~tvaleev/webrev/8154387/r1/
The rationale is to speed-up the parallel processing for unordered
streams with low limit value. Such problems occur when you want to
perform expensive filtering and select at most x elements which pass
the filter (order does not matter). Currently unordered limit
operation buffers up to 128 elements for each parallel task before it
checks whether limit is reached. This is actually harmful when
requested limit is lower: much more elements are requested from the
upstream than necessary. Here's simple JMH test which illustrates the
problem:
http://cr.openjdk.java.net/~tvaleev/webrev/8154387/jmh/
It extracts the requested number of probable-primes from the list of
10000 BigInteger numbers. The results with 9ea+111:
Benchmark (limit) Mode Cnt Score Error Units
LimitTest.parLimit 2 avgt 30 108,971 ± 0,643 us/op
LimitTest.parLimit 20 avgt 30 934,176 ± 14,003 us/op
LimitTest.parLimit 200 avgt 30 8772,417 ± 190,609 us/op
LimitTest.parLimit 2000 avgt 30 41775,463 ± 1800,537 us/op
LimitTest.parUnorderedLimit 2 avgt 30 2557,798 ± 13,161 us/op
LimitTest.parUnorderedLimit 20 avgt 30 2578,283 ± 23,547 us/op
LimitTest.parUnorderedLimit 200 avgt 30 4577,318 ± 40,793 us/op
LimitTest.parUnorderedLimit 2000 avgt 30 12279,346 ± 523,823 us/op
LimitTest.seqLimit 2 avgt 30 34,831 ± 0,190 us/op
LimitTest.seqLimit 20 avgt 30 369,729 ± 1,427 us/op
LimitTest.seqLimit 200 avgt 30 3690,544 ± 13,907 us/op
LimitTest.seqLimit 2000 avgt 30 36681,637 ± 156,538 us/op
When the limit is 2 or 20, parallel unordered version is slower than
parallel ordered! Even for limit = 200 it's still slower than
sequential operation.
The idea of the patch is to tweak the CHUNK_SIZE using the given limit and
parallelism level. I used the following formula:
this.chunkSize = limit >= 0 ? (int)Math.min(CHUNK_SIZE,
(skip + limit) / ForkJoinPool.getCommonPoolParallelism() + 1) : CHUNK_SIZE;
This does not affect cases when limit is big or not set at all (in
skip mode). However it greatly improves cases when limit is small:
Benchmark (limit) Mode Cnt Score Error Units
LimitTest.parLimit 2 avgt 30 109,502 ± 0,750 us/op
LimitTest.parLimit 20 avgt 30 954,716 ± 39,276 us/op
LimitTest.parLimit 200 avgt 30 8706,226 ± 184,330 us/op
LimitTest.parLimit 2000 avgt 30 42126,346 ± 3163,444 us/op
LimitTest.parUnorderedLimit 2 avgt 30 39,303 ± 0,177 us/op !!!
LimitTest.parUnorderedLimit 20 avgt 30 266,107 ± 0,492 us/op !!!
LimitTest.parUnorderedLimit 200 avgt 30 2547,177 ± 58,538 us/op !!!
LimitTest.parUnorderedLimit 2000 avgt 30 12216,402 ± 430,574 us/op
LimitTest.seqLimit 2 avgt 30 34,993 ± 0,704 us/op
LimitTest.seqLimit 20 avgt 30 369,497 ± 1,754 us/op
LimitTest.seqLimit 200 avgt 30 3716,059 ± 61,054 us/op
LimitTest.seqLimit 2000 avgt 30 36814,356 ± 161,531 us/op
Here you can see that unordered cases are significantly improved. Now
they are always faster than parallel ordered and faster than
sequential for limit >= 20.
I did not think up how to test this patch as it does not change
visible behavior, only speed. However all the existing tests pass.
What do you think?
With best regards,
Tagir Valeev.
More information about the core-libs-dev
mailing list