RFR: 8280915: Better parallelization for AbstractSpliterator and IteratorSpliterator when size is unknown [v4]
Tagir F.Valeev
tvaleev at openjdk.java.net
Thu Feb 10 04:30:36 UTC 2022
On Thu, 10 Feb 2022 03:22:42 GMT, Tagir F. Valeev <tvaleev at openjdk.org> wrote:
>> See the bug description for details.
>>
>> I propose a simple solution. Let's allow ArraySpliterator to be non-SIZED and report artificial estimatedSize(), much bigger than the real one. This will allow AbstractSpliterator and IteratorSpliterator to produce prefix whose size is comparable to Long.MAX_VALUE (say, starting with Long.MAX_VALUE/2), and this will enable further splitting of the prefix. This change will drastically improve parallel streaming for affected streams of size <= 1024 and significantly improve for streams of size 1025..20000. The cost is higher-grained splitting for huge streams of unknown size. This might add a minor overhead for such scenarios which, I believe, is completely tolerable.
>>
>> No public API changes are necessary, sequential processing should not be affected, except an extra field in ArraySpliterator which increases a footprint by 8 bytes.
>>
>> I added a simple test using an artificial collector to ensure that at least two non-empty parts are created when parallelizing Stream.iterate source. More testing ideas are welcome.
>
> Tagir F. Valeev has updated the pull request incrementally with one additional commit since the last revision:
>
> Benchmark to demonstrate the patch usefulness
I added a microbenchmark to demonstrate the speed improvement achievable by this patch. For demonstration, I used `Pattern.splitAsStream` source (though as listed in JDK-8280915, many other standard sources are also affected). For downstream operation I selected `BigInteger.pow(1000)` which is moderately CPU-intensive (takes ~10us on my machine for input numbers within 1000..2000). The benchmarking results are as follows (my machine has 8 cores). Here's non-patched version:
Benchmark (size) Mode Cnt Score Error Units
sumOf1000thPowers 10 avgt 30 100.367 ± 0.793 us/op
sumOf1000thPowers 100 avgt 30 963.857 ± 6.252 us/op
sumOf1000thPowers 1000 avgt 30 10012.923 ± 81.091 us/op
sumOf1000thPowers 10000 avgt 30 99546.370 ± 625.105 us/op
sumOf1000thPowersParallel 10 avgt 30 104.561 ± 1.118 us/op
sumOf1000thPowersParallel 100 avgt 30 995.400 ± 26.116 us/op
sumOf1000thPowersParallel 1000 avgt 30 9969.296 ± 85.166 us/op
sumOf1000thPowersParallel 10000 avgt 30 55856.516 ± 2315.530 us/op
We observe that the results depend on the input size linearly for sequential run, which is quite expected. Parallel doesn't help at all for size = 10, 100, and 1000, which validates my claim that these spliterators cannot split at all for sizes <= 1024. For size = 10000, parallel version starts performing somewhat better (1.78x), though it's not nearly close to the available machine parallelism.
Here's patched version:
Benchmark (size) Mode Cnt Score Error Units
sumOf1000thPowers 10 avgt 30 101.380 ± 0.961 us/op
sumOf1000thPowers 100 avgt 30 970.843 ± 8.360 us/op
sumOf1000thPowers 1000 avgt 30 9837.397 ± 93.656 us/op
sumOf1000thPowers 10000 avgt 30 97741.823 ± 1276.116 us/op
sumOf1000thPowersParallel 10 avgt 30 41.597 ± 0.910 us/op
sumOf1000thPowersParallel 100 avgt 30 189.735 ± 1.911 us/op
sumOf1000thPowersParallel 1000 avgt 30 1776.337 ± 31.034 us/op
sumOf1000thPowersParallel 10000 avgt 30 17685.723 ± 127.846 us/op
We observe no regression for sequential run and drastic improvement for parallel. Even for size = 10, we see 2.46x speedup and 41 us total time. For bigger sizes, we see 5.11x..5.54x speedup.
Please review my patch. I'll be happy to discuss any concerns about this optimization you may have. Thanks in advance!
-------------
PR: https://git.openjdk.java.net/jdk/pull/7279
More information about the core-libs-dev
mailing list