RFR: 8282541: AArch64: Auto-vectorize Math.round API

Wed Apr 13 07:10:09 UTC 2022

On Tue, 12 Apr 2022 13:26:02 GMT, Andrew Haley <aph at openjdk.org> wrote:

> Before, Apple M1:
> 
> +-----------------------------------------+---------------------------------+
> |Benchmark                                | (TESTSIZE) Mode     Score  Units|
> +-----------------------------------------+---------------------------------+
> |FpRoundingBenchmark.test_round_double    |   1024  thrpt    1612.391 ops/ms|
> |FpRoundingBenchmark.test_round_double    |   2048  thrpt     804.291 ops/ms|
> |FpRoundingBenchmark.test_round_float     |   1024  thrpt    1558.202 ops/ms|
> |FpRoundingBenchmark.test_round_float     |   2048  thrpt     775.730 ops/ms|
> +------------------------------------------+--------------------------------+
> 
> After:
> 
> +-----------------------------------------+----------------------------------+
> |Benchmark                                | (TESTSIZE) Mode      Score  Units|
> +-----------------------------------------+----------------------------------+
> |FpRoundingBenchmark.test_round_double    |    1024  thrpt   2720.153  ops/ms|
> |FpRoundingBenchmark.test_round_double    |    2048  thrpt   1371.750  ops/ms|
> |FpRoundingBenchmark.test_round_float     |    1024  thrpt   5940.263  ops/ms|
> |FpRoundingBenchmark.test_round_float     |    2048  thrpt   3036.201  ops/ms|
> +-----------------------------------------+----------------------------------+
> 
> About the algorithm:
> 
> `Math.round()` is tricky. Its specification corresponds to no standard
> rounding mode: it "returns the closest long to the argument, with ties
> rounding to positive infinity." For positive inputs this is the same
> as IEEE-754's `convertToIntegerTiesToAway` operation, which rounds
> away from zero, but there's no equivalent for negative inputs.
> 
> `Math.round()` used simply to add 0.5 and convert to integer by taking
> the floor of the result, but that wasn't right because it suffers from
> double rounding. This breaks several cases, in particular because
> 
>  `0.4999999... (+) 0.5 == 1.0`
>  
>  (Here, `(+)` indicates an addition followed by roundTiesToEven.)
>  
> There is no corresponding problem with `-0.4999999...` or `-0.9999999...`
>  
> Also, in the 32-bit `float` case,
>  
>   `8388609 (+) 0.5 == 8388610`
>   
> because 8388609 (0x1.000002p+23) as a 32-bit integer has no fraction
> bits, so adding 0.5, followed by roundTiesToEven, rounds upwards. This
> problem manifests for every odd integer within the binade from
> 0x1.000002p+23 to 0x1.fffffep+23, whether positive or negative. There
> is a corresponding problem for the `double` range.
> 
> The patch for JDK-8279508 handles this by flipping the floating-point
> rounding mode to roundTowardNegative, adding 0.5, and taking the
> floor. While this does work on AArch64, the performance is
> tragic. AArch64 implementations seem to wait for all instructions in
> flight to retire, change the rounding mode, and do the operation. This
> effectively serializes the entire thread.
> 
> This patch takes a different approach. Firstly, we can observe that we
> can use the `frinta` instruction for the entire positive range. The
> negative range is a bit trickier, but we can observe that any x,
> abs{x) >= -0x1.000000p+23, has no fractional bits so it must be an
> integer. For convenence, we can convert that range with the `frinta`
> instruction as well.
> 
> All that remains are x < 0, abs{x) < -0x1.000000p+23. Adding 0.5
> followed by roundTiesToEven doesn't lead to a problem because for
> x < 0 && abs{x) >= 0.5, adding 0.5 only reduces the magnitude of x;
> for all x < 0 && abs{x) < 0.5, adding 0.5 followed by roundTiesToEven
> return 0.

src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 374:

> 372: VECTOR_JAVA_FROUND(F, 4F,  I, T4S, 4,  INT, vReg)
> 373: VECTOR_JAVA_FROUND(D, 2D,  L, T2D, 2, LONG, vReg)
> 374: 

I don't know why do we need these rules. Should "UseSVE > 0" all go to the rules in sve ad file which call to vector_round_sve()?

src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 2130:

> 2128: %{
> 2129:   predicate(UseSVE > 0 &&
> 2130:             n->as_Vector()->length() == $5);

Remove `n->as_Vector()->length() == $5` ? I think there is no need to limit vector length for SVE, i.e. for all SVE vector lengths, we should generate the same code. For example, you have limited the size to 8F below, which is 256 bits but there's no rule for other bits (512) of vector then.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8204