RFR: 8282541: AArch64: Auto-vectorize Math.round API
Ningsheng Jian
njian at openjdk.java.net
Wed Apr 13 07:10:09 UTC 2022
On Tue, 12 Apr 2022 13:26:02 GMT, Andrew Haley <aph at openjdk.org> wrote:
> Before, Apple M1:
>
> +-----------------------------------------+---------------------------------+
> |Benchmark | (TESTSIZE) Mode Score Units|
> +-----------------------------------------+---------------------------------+
> |FpRoundingBenchmark.test_round_double | 1024 thrpt 1612.391 ops/ms|
> |FpRoundingBenchmark.test_round_double | 2048 thrpt 804.291 ops/ms|
> |FpRoundingBenchmark.test_round_float | 1024 thrpt 1558.202 ops/ms|
> |FpRoundingBenchmark.test_round_float | 2048 thrpt 775.730 ops/ms|
> +------------------------------------------+--------------------------------+
>
> After:
>
> +-----------------------------------------+----------------------------------+
> |Benchmark | (TESTSIZE) Mode Score Units|
> +-----------------------------------------+----------------------------------+
> |FpRoundingBenchmark.test_round_double | 1024 thrpt 2720.153 ops/ms|
> |FpRoundingBenchmark.test_round_double | 2048 thrpt 1371.750 ops/ms|
> |FpRoundingBenchmark.test_round_float | 1024 thrpt 5940.263 ops/ms|
> |FpRoundingBenchmark.test_round_float | 2048 thrpt 3036.201 ops/ms|
> +-----------------------------------------+----------------------------------+
>
> About the algorithm:
>
> `Math.round()` is tricky. Its specification corresponds to no standard
> rounding mode: it "returns the closest long to the argument, with ties
> rounding to positive infinity." For positive inputs this is the same
> as IEEE-754's `convertToIntegerTiesToAway` operation, which rounds
> away from zero, but there's no equivalent for negative inputs.
>
> `Math.round()` used simply to add 0.5 and convert to integer by taking
> the floor of the result, but that wasn't right because it suffers from
> double rounding. This breaks several cases, in particular because
>
> `0.4999999... (+) 0.5 == 1.0`
>
> (Here, `(+)` indicates an addition followed by roundTiesToEven.)
>
> There is no corresponding problem with `-0.4999999...` or `-0.9999999...`
>
> Also, in the 32-bit `float` case,
>
> `8388609 (+) 0.5 == 8388610`
>
> because 8388609 (0x1.000002p+23) as a 32-bit integer has no fraction
> bits, so adding 0.5, followed by roundTiesToEven, rounds upwards. This
> problem manifests for every odd integer within the binade from
> 0x1.000002p+23 to 0x1.fffffep+23, whether positive or negative. There
> is a corresponding problem for the `double` range.
>
> The patch for JDK-8279508 handles this by flipping the floating-point
> rounding mode to roundTowardNegative, adding 0.5, and taking the
> floor. While this does work on AArch64, the performance is
> tragic. AArch64 implementations seem to wait for all instructions in
> flight to retire, change the rounding mode, and do the operation. This
> effectively serializes the entire thread.
>
> This patch takes a different approach. Firstly, we can observe that we
> can use the `frinta` instruction for the entire positive range. The
> negative range is a bit trickier, but we can observe that any x,
> abs{x) >= -0x1.000000p+23, has no fractional bits so it must be an
> integer. For convenence, we can convert that range with the `frinta`
> instruction as well.
>
> All that remains are x < 0, abs{x) < -0x1.000000p+23. Adding 0.5
> followed by roundTiesToEven doesn't lead to a problem because for
> x < 0 && abs{x) >= 0.5, adding 0.5 only reduces the magnitude of x;
> for all x < 0 && abs{x) < 0.5, adding 0.5 followed by roundTiesToEven
> return 0.
src/hotspot/cpu/aarch64/aarch64_neon_ad.m4 line 374:
> 372: VECTOR_JAVA_FROUND(F, 4F, I, T4S, 4, INT, vReg)
> 373: VECTOR_JAVA_FROUND(D, 2D, L, T2D, 2, LONG, vReg)
> 374:
I don't know why do we need these rules. Should "UseSVE > 0" all go to the rules in sve ad file which call to vector_round_sve()?
src/hotspot/cpu/aarch64/aarch64_sve_ad.m4 line 2130:
> 2128: %{
> 2129: predicate(UseSVE > 0 &&
> 2130: n->as_Vector()->length() == $5);
Remove `n->as_Vector()->length() == $5` ? I think there is no need to limit vector length for SVE, i.e. for all SVE vector lengths, we should generate the same code. For example, you have limited the size to 8F below, which is 256 bits but there's no rule for other bits (512) of vector then.
-------------
PR: https://git.openjdk.java.net/jdk/pull/8204
More information about the hotspot-dev
mailing list