RFR: 8282541: AArch64: Auto-vectorize Math.round API

Wed Apr 13 10:14:15 UTC 2022

On Tue, 12 Apr 2022 13:26:02 GMT, Andrew Haley <aph at openjdk.org> wrote:

> Before, Apple M1:
> 
> +-----------------------------------------+---------------------------------+
> |Benchmark                                | (TESTSIZE) Mode     Score  Units|
> +-----------------------------------------+---------------------------------+
> |FpRoundingBenchmark.test_round_double    |   1024  thrpt    1612.391 ops/ms|
> |FpRoundingBenchmark.test_round_double    |   2048  thrpt     804.291 ops/ms|
> |FpRoundingBenchmark.test_round_float     |   1024  thrpt    1558.202 ops/ms|
> |FpRoundingBenchmark.test_round_float     |   2048  thrpt     775.730 ops/ms|
> +------------------------------------------+--------------------------------+
> 
> After:
> 
> +-----------------------------------------+----------------------------------+
> |Benchmark                                | (TESTSIZE) Mode      Score  Units|
> +-----------------------------------------+----------------------------------+
> |FpRoundingBenchmark.test_round_double    |    1024  thrpt   2720.153  ops/ms|
> |FpRoundingBenchmark.test_round_double    |    2048  thrpt   1371.750  ops/ms|
> |FpRoundingBenchmark.test_round_float     |    1024  thrpt   5940.263  ops/ms|
> |FpRoundingBenchmark.test_round_float     |    2048  thrpt   3036.201  ops/ms|
> +-----------------------------------------+----------------------------------+
> 
> About the algorithm:
> 
> `Math.round()` is tricky. Its specification corresponds to no standard
> rounding mode: it "returns the closest long to the argument, with ties
> rounding to positive infinity." For positive inputs this is the same
> as IEEE-754's `convertToIntegerTiesToAway` operation, which rounds
> away from zero, but there's no equivalent for negative inputs.
> 
> `Math.round()` used simply to add 0.5 and convert to integer by taking
> the floor of the result, but that wasn't right because it suffers from
> double rounding. This breaks several cases, in particular because
> 
>  `0.4999999... (+) 0.5 == 1.0`
>  
>  (Here, `(+)` indicates an addition followed by roundTiesToEven.)
>  
> There is no corresponding problem with `-0.4999999...` or `-0.9999999...`
>  
> Also, in the 32-bit `float` case,
>  
>   `8388609 (+) 0.5 == 8388610`
>   
> because 8388609 (0x1.000002p+23) as a 32-bit integer has no fraction
> bits, so adding 0.5, followed by roundTiesToEven, rounds upwards. This
> problem manifests for every odd integer within the binade from
> 0x1.000002p+23 to 0x1.fffffep+23, whether positive or negative. There
> is a corresponding problem for the `double` range.
> 
> The patch for JDK-8279508 handles this by flipping the floating-point
> rounding mode to roundTowardNegative, adding 0.5, and taking the
> floor. While this does work on AArch64, the performance is
> tragic. AArch64 implementations seem to wait for all instructions in
> flight to retire, change the rounding mode, and do the operation. This
> effectively serializes the entire thread.
> 
> This patch takes a different approach. Firstly, we can observe that we
> can use the `frinta` instruction for the entire positive range. The
> negative range is a bit trickier, but we can observe that any x,
> abs{x) >= -0x1.000000p+23, has no fractional bits so it must be an
> integer. For convenence, we can convert that range with the `frinta`
> instruction as well.
> 
> All that remains are x < 0, abs{x) < -0x1.000000p+23. Adding 0.5
> followed by roundTiesToEven doesn't lead to a problem because for
> x < 0 && abs{x) >= 0.5, adding 0.5 only reduces the magnitude of x;
> for all x < 0 && abs{x) < 0.5, adding 0.5 followed by roundTiesToEven
> return 0.

src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5198:

> 5196:   fcvtasd(dst, src);
> 5197:   // Test if src >= 0 || abs(src) >= 0x1.0p52
> 5198:   eor(rscratch1, rscratch1, 1ul << 63); // flip sign bit

This doesn't compile on Windows AArch64:

d:\a\jdk\jdk\jdk\src\hotspot\cpu\aarch64\macroAssembler_aarch64.cpp(5198): error C2220: the following warning is treated as an error
d:\a\jdk\jdk\jdk\src\hotspot\cpu\aarch64\macroAssembler_aarch64.cpp(5198): warning C4293: '<<': shift count negative or too big, undefined behavior

Windows is LLP64 isn't it? So you probably want 1ull or `UCONST64(1)` here.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8204