RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v5]
kabutz
duke at openjdk.java.net
Wed Dec 15 20:15:01 UTC 2021
On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <duke at openjdk.java.net> wrote:
>> BigInteger currently uses three different algorithms for multiply. The simple quadratic algorithm, then the slightly better Karatsuba if we exceed a bit count and then Toom Cook 3 once we go into the several thousands of bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. I have demonstrated this several times in conference talks. In order to be consistent with other classes such as Arrays and Collection, I have added a parallelMultiply() method. Internally we have added a parameter to the private multiply method to indicate whether the calculation should be done in parallel.
>>
>> The performance improvements are as should be expected. Fibonacci of 100 million (using a single-threaded Dijkstra's sum of squares version) completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the sequential multiply() method. This is on my 1-8-2 laptop. The final multiplications are with very large numbers, which then benefit from the parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>>
>> We have also parallelized the private square() method. Internally, the square() method defaults to be sequential.
>>
>> Some benchmark results, run on my 1-6-2 server:
>>
>>
>> Benchmark (n) Mode Cnt Score Error Units
>> BigIntegerParallelMultiply.multiply 1000000 ss 4 51.707 ± 11.194 ms/op
>> BigIntegerParallelMultiply.multiply 10000000 ss 4 988.302 ± 235.977 ms/op
>> BigIntegerParallelMultiply.multiply 100000000 ss 4 24662.063 ± 1123.329 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 1000000 ss 4 49.337 ± 26.611 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 10000000 ss 4 527.560 ± 268.903 ms/op
>> BigIntegerParallelMultiply.parallelMultiply 100000000 ss 4 9076.551 ± 1899.444 ms/op
>>
>>
>> We can see that for larger calculations (fib 100m), the execution is 2.7x faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for small (fib 1m) it is roughly the same. Considering that the fibonacci algorithm that we used was in itself sequential, and that the last 3 calculations would dominate, 2.7x faster should probably be considered quite good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit since the last revision:
>
> Made forkOrInvoke() method protected to avoid strange compiler error
> IIRC you are not overly concerned about the additional object creation of `RecursiveOp` instances?
>
> If that changes the operation method could return `Object` and choose to perform the operation directly returning the `BigInteger` result or returning the particular `RecursiveOp`.
>
> ```java
> private static Object multiply(BigInteger a, BigInteger b, boolean parallel, int depth) {
> if (isParallel(parallel, depth)) {
> return new RecursiveMultiply(a, b, parallel, depth).fork();
> } else {
> // Also called by RecursiveMultiply.compute()
> return RecursiveMultiply.compute(a, b, false, depth);
> }
> }
> ```
>
> Then we could have another method on `RecursiveOp` that pattern matches e.g.:
>
> ```java
> static BigInteger join(Object o) {
> // replace with pattern match switch when it exits preview
> if (o instanceof BigInteger b) {
> return b;
> } else if (o instanceof RecursiveOp r) {
> return r.join();
> } else {
> throw new InternalError("Cannot reach here);
> }
> }
> ```
>
> That seems simple enough it might be worth doing anyway.
If we need a demonstration as to why pattern matching switch is necessary, then I guess we could do this. Each RecursiveOp instance is 40 bytes. To calculate Fibonacci (1_000_000_000) we create 7_324_218 tasks, thus we are allocating an additional 292_968_720 bytes of memory. In addition, I believe that calling invoke() allocates some more bytes. According to the GC logs, we allocate an additional 2.33 GB of memory. That might sound like a lot, but it takes 2.25 TB to calculate Fibonacci of 1 billion using our algorithm. The additional memory allocated is thus roughly 0.1%. The performance of the old sequential multiply and the new one, with the additional object creation, seems equivalent.
I would thus recommend that we keep it the way it is at the moment, with the new RecursiveOp task creation. Considering the volume of objects that we will be allocating once we get to Toom Cook 3, a 0.1% reduction in object allocation will not be noticed.
Old BigInteger#multiply()
Fibonacci memory bytes
100 11.5KB 11808
1k 119.0KB 121856
10k 1.2MB 1238552
100k 13.0MB 13634608
1m 177.5MB 186104688
10m 3.3GB 3574666840
100m 85.6GB 91866740256
1000m 2.3TB 2475468459952
New BigInteger#multiply()
Fibonacci memory bytes increase %
100 11.5KB 11808 0.0
1k 119.0KB 121856 0.0
10k 1.2MB 1238552 0.0
100k 13.0MB 13649176 0.1067
1m 177.6MB 186197472 0.0498
10m 3.3GB 3577835016 0.088
100m 85.6GB 91960170576 0.1017
1000m 2.3TB 2477979788992 0.1014
-------------
PR: https://git.openjdk.java.net/jdk/pull/6409
More information about the core-libs-dev
mailing list