RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v3]

Fri Nov 19 10:46:47 UTC 2021

On Wed, 17 Nov 2021 19:48:25 GMT, kabutz <duke at openjdk.java.net> wrote:

>> BigInteger currently uses three different algorithms for multiply. The simple quadratic algorithm, then the slightly better Karatsuba if we exceed a bit count and then Toom Cook 3 once we go into the several thousands of bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. I have demonstrated this several times in conference talks. In order to be consistent with other classes such as Arrays and Collection, I have added a parallelMultiply() method. Internally we have added a parameter to the private multiply method to indicate whether the calculation should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 million (using a single-threaded Dijkstra's sum of squares version) completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the sequential multiply() method. This is on my 1-8-2 laptop. The final multiplications are with very large numbers, which then benefit from the parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark                                          (n)  Mode  Cnt      Score      Error  Units
>> BigIntegerParallelMultiply.multiply            1000000    ss    4     51.707 ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply           10000000    ss    4    988.302 ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply          100000000    ss    4  24662.063 ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     49.337 ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    527.560 ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   9076.551 ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for small (fib 1m) it is roughly the same. Considering that the fibonacci algorithm that we used was in itself sequential, and that the last 3 calculations would dominate, 2.7x faster should probably be considered quite good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Removed JVM flags from benchmark

> I think the functionality in this PR is worth pursuing, but with the JDK 18 rampdown 1 date fast approaching, as a non-urgent issue, I think we shouldn't try to rush it into JDK 18.

> I looked more closely and now understand a bit more about the threshold. It would be useful to have some implementation notes detailing approximately when the parallel execution kicks in. For `a*a` it's `>= TOOM_COOK_SQUARE_THRESHOLD(216)` and for `a*b` it's `>= TOOM_COOK_THRESHOLD(240)`, so we could refer to certain bit lengths.
> 
> The branching factor is 4 (well 3 i guess, but its easier to think in powers of 2). It might be reasonable to assume the problem gets split equally in 4 parts. I don't know in practice what the depth of recursion might, its hard to see this getting completely out of control, but we could always keep track of the depth and cut off in proportion to the # runtime processors if need be.
> 
> Given the existing degree of specialization, the minimal changes to code, and the fact that the creation of recursive tasks is in the noise this PR looks quite reasonable.

I have run some more tests. For my fibonacci algorithm, here are the worst cases for the various calculations.

            n   most_bits     tasks  time_ms
         1000         694         0      1
       10_000       6_942         0      1
      100_000      69_424        18     13
    1_000_000     694_241       468    143
   10_000_000   6_942_418    11_718   1049
  100_000_000  69_424_191   292_968  13695
1_000_000_000 694_241_913 7_324_218 237282

Each data point has 10x the number of bits in the final result and the number of tasks in the final calculation is 25x more. Perhaps I can make the threshold the recursive depth up to which we would run in parallel. And that recursive depth could the availableProcessors() or some multiple thereof.

> 
> I think we can simplify the creation of the recursive tasks (assuming we don't track the depth):
> 
> ```java
>     private static final class RecursiveOp {
>         private static <T> RecursiveTask<T> exec(RecursiveTask<T> op, boolean parallel) {
>             if (parallel) {
>                 op.fork();
>             } else {
>                 op.invoke();
>             }
>             return op;
>         }
> 
>         private static RecursiveTask<BigInteger> multiply(BigInteger a, BigInteger b, boolean parallel) {
>             var op = new RecursiveTask<BigInteger>() {
>                 @Override
>                 protected BigInteger compute() {
>                     return a.multiply(b, true, parallel);
>                 }
>             };
> 
>             return exec(op, parallel);
>         }
> 
>         private static RecursiveTask<BigInteger> square(BigInteger a, boolean parallel) {
>             var op = new RecursiveTask<BigInteger>() {
>                 @Override
>                 protected BigInteger compute() {
>                     return a.square(true, parallel);
>                 }
>             };
> 
>             return exec(op, parallel);
>         }
>     }
> ```

Good idea, will change that.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409