RFR: 8280915: Better parallelization for AbstractSpliterator and IteratorSpliterator when size is unknown [v4]

Thu Feb 10 17:16:09 UTC 2022

On Thu, 10 Feb 2022 04:22:34 GMT, Tagir F. Valeev <tvaleev at openjdk.org> wrote:

>> Tagir F. Valeev has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Benchmark to demonstrate the patch usefulness
>
> I added a microbenchmark to demonstrate the speed improvement achievable by this patch. For demonstration, I used `Pattern.splitAsStream` source (though as listed in JDK-8280915, many other standard sources are also affected). For downstream operation I selected `BigInteger.pow(1000)` which is moderately CPU-intensive (takes ~10us on my machine for input numbers within 1000..2000). The benchmarking results are as follows (my machine has 8 cores). Here's non-patched version:
> 
> 
> Benchmark                  (size)  Mode  Cnt      Score      Error  Units
> sumOf1000thPowers              10  avgt   30    100.367 ±    0.793  us/op
> sumOf1000thPowers             100  avgt   30    963.857 ±    6.252  us/op
> sumOf1000thPowers            1000  avgt   30  10012.923 ±   81.091  us/op
> sumOf1000thPowers           10000  avgt   30  99546.370 ±  625.105  us/op
> sumOf1000thPowersParallel      10  avgt   30    104.561 ±    1.118  us/op
> sumOf1000thPowersParallel     100  avgt   30    995.400 ±   26.116  us/op
> sumOf1000thPowersParallel    1000  avgt   30   9969.296 ±   85.166  us/op
> sumOf1000thPowersParallel   10000  avgt   30  55856.516 ± 2315.530  us/op
> 
> 
> We observe that the results depend on the input size linearly for sequential run, which is quite expected. Parallel doesn't help at all for size = 10, 100, and 1000, which validates my claim that these spliterators cannot split at all for sizes <= 1024. For size = 10000, parallel version starts performing somewhat better (1.78x), though it's not nearly close to the available machine parallelism.
> 
> Here's patched version:
> 
> 
> Benchmark                  (size)  Mode  Cnt      Score      Error  Units
> sumOf1000thPowers              10  avgt   30    101.380 ±    0.961  us/op
> sumOf1000thPowers             100  avgt   30    970.843 ±    8.360  us/op
> sumOf1000thPowers            1000  avgt   30   9837.397 ±   93.656  us/op
> sumOf1000thPowers           10000  avgt   30  97741.823 ± 1276.116  us/op
> sumOf1000thPowersParallel      10  avgt   30     41.597 ±    0.910  us/op
> sumOf1000thPowersParallel     100  avgt   30    189.735 ±    1.911  us/op
> sumOf1000thPowersParallel    1000  avgt   30   1776.337 ±   31.034  us/op
> sumOf1000thPowersParallel   10000  avgt   30  17685.723 ±  127.846  us/op
> 
> 
> We observe no regression for sequential run and drastic improvement for parallel. Even for size = 10, we see 2.46x speedup and 41 us total time. For bigger sizes, we see 5.11x..5.54x speedup.
> 
> Please review my patch. I'll be happy to discuss any concerns about this optimization you may have. Thanks in advance!

@amaembo It's on my queue to review, i may get to it tomorrow, otherwise next week.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7279