RFR: 8280915: Better parallelization for AbstractSpliterator and IteratorSpliterator when size is unknown [v4]

Thu Feb 10 04:30:36 UTC 2022

On Thu, 10 Feb 2022 03:22:42 GMT, Tagir F. Valeev <tvaleev at openjdk.org> wrote:

>> See the bug description for details.
>> 
>> I propose a simple solution. Let's allow ArraySpliterator to be non-SIZED and report artificial estimatedSize(), much bigger than the real one. This will allow AbstractSpliterator and IteratorSpliterator to produce prefix whose size is comparable to Long.MAX_VALUE (say, starting with Long.MAX_VALUE/2), and this will enable further splitting of the prefix. This change will drastically improve parallel streaming for affected streams of size <= 1024 and significantly improve for streams of size 1025..20000. The cost is higher-grained splitting for huge streams of unknown size. This might add a minor overhead for such scenarios which, I believe, is completely tolerable.
>> 
>> No public API changes are necessary, sequential processing should not be affected, except an extra field in ArraySpliterator which increases a footprint by 8 bytes.
>> 
>> I added a simple test using an artificial collector to ensure that at least two non-empty parts are created when parallelizing Stream.iterate source. More testing ideas are welcome.
>
> Tagir F. Valeev has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Benchmark to demonstrate the patch usefulness

I added a microbenchmark to demonstrate the speed improvement achievable by this patch. For demonstration, I used `Pattern.splitAsStream` source (though as listed in JDK-8280915, many other standard sources are also affected). For downstream operation I selected `BigInteger.pow(1000)` which is moderately CPU-intensive (takes ~10us on my machine for input numbers within 1000..2000). The benchmarking results are as follows (my machine has 8 cores). Here's non-patched version:

Benchmark                  (size)  Mode  Cnt      Score      Error  Units
sumOf1000thPowers              10  avgt   30    100.367 ±    0.793  us/op
sumOf1000thPowers             100  avgt   30    963.857 ±    6.252  us/op
sumOf1000thPowers            1000  avgt   30  10012.923 ±   81.091  us/op
sumOf1000thPowers           10000  avgt   30  99546.370 ±  625.105  us/op
sumOf1000thPowersParallel      10  avgt   30    104.561 ±    1.118  us/op
sumOf1000thPowersParallel     100  avgt   30    995.400 ±   26.116  us/op
sumOf1000thPowersParallel    1000  avgt   30   9969.296 ±   85.166  us/op
sumOf1000thPowersParallel   10000  avgt   30  55856.516 ± 2315.530  us/op

We observe that the results depend on the input size linearly for sequential run, which is quite expected. Parallel doesn't help at all for size = 10, 100, and 1000, which validates my claim that these spliterators cannot split at all for sizes <= 1024. For size = 10000, parallel version starts performing somewhat better (1.78x), though it's not nearly close to the available machine parallelism.

Here's patched version:

Benchmark                  (size)  Mode  Cnt      Score      Error  Units
sumOf1000thPowers              10  avgt   30    101.380 ±    0.961  us/op
sumOf1000thPowers             100  avgt   30    970.843 ±    8.360  us/op
sumOf1000thPowers            1000  avgt   30   9837.397 ±   93.656  us/op
sumOf1000thPowers           10000  avgt   30  97741.823 ± 1276.116  us/op
sumOf1000thPowersParallel      10  avgt   30     41.597 ±    0.910  us/op
sumOf1000thPowersParallel     100  avgt   30    189.735 ±    1.911  us/op
sumOf1000thPowersParallel    1000  avgt   30   1776.337 ±   31.034  us/op
sumOf1000thPowersParallel   10000  avgt   30  17685.723 ±  127.846  us/op

We observe no regression for sequential run and drastic improvement for parallel. Even for size = 10, we see 2.46x speedup and 41 us total time. For bigger sizes, we see 5.11x..5.54x speedup.

Please review my patch. I'll be happy to discuss any concerns about this optimization you may have. Thanks in advance!

-------------

PR: https://git.openjdk.java.net/jdk/pull/7279