RFR: JDK-8277175 : Add a parallel multiply method to BigInteger

Wed Nov 17 19:51:49 UTC 2021

On Wed, 17 Nov 2021 17:51:03 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

> I would be wary to make any API use multiple threads behind the scenes without the user explicitly asking for it. While latency of the given operation might improve in isolation, parallelization always incur some (often significant) additional cost of computation. This might reduce power efficiency, reduce CPU availability for other tasks, and play tricks with scalability.
> 
> Cost: On my system `multiply` burns about half as many cycles as `parallelMultiply`, even though it takes 2.4x longer (measured with `-prof perfnorm`).
> 
> Scalability: To simulate how well the solution scales you could try running the `multiply` and `parallelMultiply` micros with increasingly large `-t` values. (`-t` controls the number of threads running the benchmarks in parallel, default 1). On my system the advantage of `parallelMultiply` diminishes markedly as I saturate the system. On a test where I see a 2.4x speed-up at `-t 1` (`n = 50000000`), the advantage drops to 1.5x at `-t 4` and only gives me a 1.1x speed-up at `-t 8`.

Furthermore, since it uses the common FJP by default, any simultaneous parallel execution (parallel sort, parallel streams, etc.) would decrease the performance. 

> I'd favor a public `parallelMultiply()`. Alternatively a flag to opt-in to parallelization.

The reason I would prefer a public method to a flag is that it might be useful to enable parallel multiply on a case-by-case basis. With a flag, it is an all-or-nothing approach. If it has to be a flag, then I'd agree that we should have to opt it in.

> (Nit: avoid appending flags to microbenchmarks that aren't strictly necessary for the tests, or such that can be reasonably expected to be within bounds on any test system. I didn't have 16Gb of free RAM.)

Thanks, will change that. Fun fact - the book Optimizing Java has a graph in the introduction that refers to some tests I did on Fibonacci a long time ago. The GC was dominating because the spaces were too small. However, in that case I was calculating Fibonacci of 1 billion. For "just" 100m, we don't need as much memory.

Here is the amount of memory that each of the Fibonacci calculations allocate. For the 100m calculation, the resident set size for multiply() is about 125mb and for parallelMultiply() about 190mb.

BigIntegerParallelMultiply.multiply:              1_000_000    177 MB
BigIntegerParallelMultiply.multiply:             10_000_000   3411 MB
BigIntegerParallelMultiply.multiply:            100_000_000  87689 MB
BigIntegerParallelMultiply.parallelMultiply:      1_000_000    177 MB
BigIntegerParallelMultiply.parallelMultiply:     10_000_000   3411 MB
BigIntegerParallelMultiply.parallelMultiply:    100_000_000  87691 MB

-------------

PR: https://git.openjdk.java.net/jdk/pull/6409