RFR: 8280915: Better parallelization for AbstractSpliterator and IteratorSpliterator when size is unknown [v4]
Paul Sandoz
psandoz at openjdk.java.net
Thu Mar 3 17:49:05 UTC 2022
On Thu, 10 Feb 2022 04:22:34 GMT, Tagir F. Valeev <tvaleev at openjdk.org> wrote:
>> Tagir F. Valeev has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Benchmark to demonstrate the patch usefulness
>
> I added a microbenchmark to demonstrate the speed improvement achievable by this patch. For demonstration, I used `Pattern.splitAsStream` source (though as listed in JDK-8280915, many other standard sources are also affected). For downstream operation I selected `BigInteger.pow(1000)` which is moderately CPU-intensive (takes ~10us on my machine for input numbers within 1000..2000). The benchmarking results are as follows (my machine has 8 cores). Here's non-patched version:
>
>
> Benchmark (size) Mode Cnt Score Error Units
> sumOf1000thPowers 10 avgt 30 100.367 ± 0.793 us/op
> sumOf1000thPowers 100 avgt 30 963.857 ± 6.252 us/op
> sumOf1000thPowers 1000 avgt 30 10012.923 ± 81.091 us/op
> sumOf1000thPowers 10000 avgt 30 99546.370 ± 625.105 us/op
> sumOf1000thPowersParallel 10 avgt 30 104.561 ± 1.118 us/op
> sumOf1000thPowersParallel 100 avgt 30 995.400 ± 26.116 us/op
> sumOf1000thPowersParallel 1000 avgt 30 9969.296 ± 85.166 us/op
> sumOf1000thPowersParallel 10000 avgt 30 55856.516 ± 2315.530 us/op
>
>
> We observe that the results depend on the input size linearly for sequential run, which is quite expected. Parallel doesn't help at all for size = 10, 100, and 1000, which validates my claim that these spliterators cannot split at all for sizes <= 1024. For size = 10000, parallel version starts performing somewhat better (1.78x), though it's not nearly close to the available machine parallelism.
>
> Here's patched version:
>
>
> Benchmark (size) Mode Cnt Score Error Units
> sumOf1000thPowers 10 avgt 30 101.380 ± 0.961 us/op
> sumOf1000thPowers 100 avgt 30 970.843 ± 8.360 us/op
> sumOf1000thPowers 1000 avgt 30 9837.397 ± 93.656 us/op
> sumOf1000thPowers 10000 avgt 30 97741.823 ± 1276.116 us/op
> sumOf1000thPowersParallel 10 avgt 30 41.597 ± 0.910 us/op
> sumOf1000thPowersParallel 100 avgt 30 189.735 ± 1.911 us/op
> sumOf1000thPowersParallel 1000 avgt 30 1776.337 ± 31.034 us/op
> sumOf1000thPowersParallel 10000 avgt 30 17685.723 ± 127.846 us/op
>
>
> We observe no regression for sequential run and drastic improvement for parallel. Even for size = 10, we see 2.46x speedup and 41 us total time. For bigger sizes, we see 5.11x..5.54x speedup.
>
> Please review my patch. I'll be happy to discuss any concerns about this optimization you may have. Thanks in advance!
> @amaembo It's on my queue to review, i may get to it tomorrow, otherwise next week.
I took last week off at short notice, since it was the school holidays. Looking today.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7279
More information about the core-libs-dev
mailing list