RFR: JDK-8277175 : Add a parallel multiply method to BigInteger [v5]

Wed Dec 1 00:10:29 UTC 2021

On Wed, 24 Nov 2021 15:20:42 GMT, kabutz <duke at openjdk.java.net> wrote:

>> BigInteger currently uses three different algorithms for multiply. The simple quadratic algorithm, then the slightly better Karatsuba if we exceed a bit count and then Toom Cook 3 once we go into the several thousands of bits. Since Toom Cook 3 is a recursive algorithm, it is trivial to parallelize it. I have demonstrated this several times in conference talks. In order to be consistent with other classes such as Arrays and Collection, I have added a parallelMultiply() method. Internally we have added a parameter to the private multiply method to indicate whether the calculation should be done in parallel.
>> 
>> The performance improvements are as should be expected. Fibonacci of 100 million (using a single-threaded Dijkstra's sum of squares version) completes in 9.2 seconds with the parallelMultiply() vs 25.3 seconds with the sequential multiply() method. This is on my 1-8-2 laptop. The final multiplications are with very large numbers, which then benefit from the parallelization of Toom-Cook 3. Fibonacci 100 million is a 347084 bit number.
>> 
>> We have also parallelized the private square() method. Internally, the square() method defaults to be sequential.
>> 
>> Some benchmark results, run on my 1-6-2 server:
>> 
>> 
>> Benchmark                                          (n)  Mode  Cnt      Score      Error  Units
>> BigIntegerParallelMultiply.multiply            1000000    ss    4     51.707 ±   11.194  ms/op
>> BigIntegerParallelMultiply.multiply           10000000    ss    4    988.302 ±  235.977  ms/op
>> BigIntegerParallelMultiply.multiply          100000000    ss    4  24662.063 ± 1123.329  ms/op
>> BigIntegerParallelMultiply.parallelMultiply    1000000    ss    4     49.337 ±   26.611  ms/op
>> BigIntegerParallelMultiply.parallelMultiply   10000000    ss    4    527.560 ±  268.903  ms/op
>> BigIntegerParallelMultiply.parallelMultiply  100000000    ss    4   9076.551 ± 1899.444  ms/op
>> 
>> 
>> We can see that for larger calculations (fib 100m), the execution is 2.7x faster in parallel. For medium size (fib 10m) it is 1.873x faster. And for small (fib 1m) it is roughly the same. Considering that the fibonacci algorithm that we used was in itself sequential, and that the last 3 calculations would dominate, 2.7x faster should probably be considered quite good on a 1-6-2 machine.
>
> kabutz has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Made forkOrInvoke() method protected to avoid strange compiler error

IIRC you are not overly concerned about the additional object creation of `RecursiveOp` instances?

If that changes the operation method could return `Object` and choose to perform the operation directly returning the `BigInteger` result or returning the particular `RecursiveOp`.

private static Object multiply(BigInteger a, BigInteger b, boolean parallel, int depth) {
    if (isParallel(parallel, depth)) {
        return new RecursiveMultiply(a, b, parallel, depth).fork();
    } else {
        // Also called by RecursiveMultiply.compute()
        return RecursiveMultiply.compute(a, b, false, depth);
    }
}

Then we could have another method on `RecursiveOp` that pattern matches e.g.:

  static BigInteger join(Object o) {
    // replace with pattern match switch when it exits preview
    if (o instanceof BigInteger b) {
        return b;
    } else if (o instanceof RecursiveOp r) {
        return r.join();
    } else {
        throw new InternalError("Cannot reach here);
    }
  }

That seems simple enough it might be worth doing anyway.

src/java.base/share/classes/java/math/BigInteger.java line 1874:

> 1872:          */
> 1873:         private static final int PARALLEL_FORK_THRESHOLD = Integer.getInteger(
> 1874:                 "java.math.BigInteger.parallelForkThreshold",

I suspect we don't need this system property, there is no such property for streams. We use `ForkJoinPool.getCommonPoolParallelism()`, which is configurable as an escape hatch.

src/java.base/share/classes/java/math/BigInteger.java line 1875:

> 1873:         private static final int PARALLEL_FORK_THRESHOLD = Integer.getInteger(
> 1874:                 "java.math.BigInteger.parallelForkThreshold",
> 1875:                 (int) Math.ceil(Math.log(Runtime.getRuntime().availableProcessors()) / Math.log(2)));

We can simplify to `32 - Integer.numberOfLeadingZeros(ForkJoinPool.getCommonPoolParallelism() - 1))`.

`ForkJoinPool.getCommonPoolParallelism()` is guaranteed to return a value `> 0`

src/java.base/share/classes/java/math/BigInteger.java line 1878:

> 1876: 
> 1877:         @Serial
> 1878:         private static final long serialVersionUID = 0L;

I don't think you need to declare these, the instances are never intended to support serialization e.g. in the stream implementation for `AbstractTask` that extends `CountedCompleter` we state:

 * <p>Serialization is not supported as there is no intention to serialize
 * tasks managed by stream ops.

src/java.base/share/classes/java/math/BigInteger.java line 1909:

> 1907:             @Override
> 1908:             public BigInteger compute() {
> 1909:                 return a.multiply(b, true, super.parallel, super.depth);

Using `super.` is a little unusual, it's not wrong :-) but not really idiomatic in the source code base.

src/java.base/share/classes/java/math/BigInteger.java line 1930:

> 1928:         }
> 1929: 
> 1930:         private static RecursiveTask<BigInteger> exec(RecursiveOp op) {

Unused.

src/java.base/share/classes/java/math/BigInteger.java line 2000:

> 1998:         da1 = a2.add(a0);
> 1999:         db1 = b2.add(b0);
> 2000:         var vm1_task = RecursiveOp.multiply(da1.subtract(a1), db1.subtract(b1), parallel, depth + 1);

I recommend incrementing the depth in the `RecursiveOp` constructor, thereby reducing the repetition.

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409