RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v5]
kabutz
duke at openjdk.java.net
Wed Nov 24 16:27:05 UTC 2021
On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <duke at openjdk.java.net> wrote:
>> BigInteger currently uses three different algorithms for multiply. The simple quadratic algorithm, then the slightly better Karatsuba if we exceed a bit count and then Toom Cook 3 once we go into the several thousands of bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. I have demonstrated this several times in conference talks. In order to be consistent with other classes such as Arrays and Collection, I have added a parallelMultiply() method. Internally we have added a parameter to the private multiply method to indicate whether the calculation should be done in parallel.
>>
>> The performance improvements are as should be expected. Fibonacci of 100 million (using a single-threaded Dijkstra's sum of squares version) completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the sequential multiply() method. This is on my 1-8-2 laptop. The final multiplications are with very large numbers, which then benefit from the parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>>
>> We have also parallelized the private square() method. Internally, the square() method defaults to be sequential.
>>
>> Some benchmark results, run on my 1-6-2 server:
>>
>>
>> Benchmark (n) Mode Cnt Score Error Units
>> BigIntegerParallelMultiply.multiply 1000000 ss 4 51.707 ± 11.194 ms/op
>> BigIntegerParallelMultiply.multiply 10000000 ss 4 988.302 ± 235.977 ms/op
>> BigIntegerParallelMultiply.multiply 100000000 ss 4 24662.063 ± 1123.329 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 49.337 ± 26.611 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 527.560 ± 268.903 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 9076.551 ± 1899.444 ms/op
>>
>>
>> We can see that for larger calculations (fib 100m), the execution is 2.7x faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for small (fib 1m) it is roughly the same. Considering that the fibonacci algorithm that we used was in itself sequential, and that the last 3 calculations would dominate, 2.7x faster should probably be considered quite good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit since the last revision:
>
> Made forkOrInvoke() method protected to avoid strange compiler error
Based on the excellent feedback in this thread, I have added a depth threshold for the parallelMultiply(). This is based on log2(processors)
(int)Math.ceil(Math.log(Runtime.getRuntime().availableProcessors()) / Math.log(2))
and can be overridden with `-Djava.math.BigInteger.parallelForkThreshold=num`
We can also influence this number with `-XX:ActiveProcessorCount=num`, but that would have a wider impact.
Here are results for running the benchmark `BigIntegerParallelMultiply` out of the box on my 1-6-2 server:
Benchmark (n) Mode Cnt Score Error Units
BigIntegerParallelMultiply.multiply 1000000 ss 4 77.314 ± 35.690 ms/op
BigIntegerParallelMultiply.multiply 10000000 ss 4 938.650 ± 69.232 ms/op
BigIntegerParallelMultiply.multiply 100000000 ss 4 24811.339 ± 1457.943 ms/op
BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 67.550 ± 39.059 ms/op
BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 489.738 ± 140.965 ms/op
BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 8814.412 ± 542.567 ms/op
Setting ActiveProcessorCount to 1 means that we do not fork at all:
Benchmark (n) Mode Cnt Score Error Units
BigIntegerParallelMultiply.multiply 1000000 ss 4 72.752 ± 26.336 ms/op
BigIntegerParallelMultiply.multiply 10000000 ss 4 949.417 ± 19.686 ms/op
BigIntegerParallelMultiply.multiply 100000000 ss 4 24530.183 ± 1319.775 ms/op
BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 71.492 ± 21.860 ms/op
BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 926.025 ± 41.539 ms/op
BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 23947.256 ± 90.076 ms/op
Similarly, when we set `-Djava.math.BigInteger.parallelForkThreshold=0` we do not fork either:
Benchmark (n) Mode Cnt Score Error Units
BigIntegerParallelMultiply.multiply 1000000 ss 4 73.121 ± 30.315 ms/op
BigIntegerParallelMultiply.multiply 10000000 ss 4 931.979 ± 34.988 ms/op
BigIntegerParallelMultiply.multiply 100000000 ss 4 24779.179 ± 5167.362 ms/op
BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 72.280 ± 32.011 ms/op
BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 957.168 ± 37.957 ms/op
BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 25032.332 ± 994.664 ms/op
On the 1-6-2 machine, the maximum parallel forked tasks that we get with our Fibonacci algorithm is 468 (4 x 9 x 13). The parallel recursion depth is 4 in this case. Note that before this change we could get millions of forks, depending on the size of the numbers multiplied.
Once we get to Toom-Cook 3, we are multiplying very large numbers. Also, for convenience we create RecursiveTasks whether we fork or invoke directly. This means that object allocation goes up slightly (less than 1%) when doing sequential multiply() on very large numbers.
-------------
PR: https://git.openjdk.java.net/jdk/pull/6409
More information about the core-libs-dev
mailing list