RFR: 8312569: RISC-V: Missing intrinsics for Math.ceil, floor, rint

Wed Jul 26 09:58:52 UTC 2023

On Wed, 26 Jul 2023 08:59:47 GMT, Ludovic Henry <luhenry at openjdk.org> wrote:

>> Please review this changes into risc-v double rounding intrinsic.
>> 
>> On risc-v intrinsics for rounding doubles with mode (like Math.ceil/floor/rint) were missing. On risc-v we don`t have special instruction for such conversion, so two times conversion was used: double -> long int -> double (using fcvt.l.d, fcvt.d.l).
>> 
>> Also, we should provide some rounding mode to fcvt.x.x instruction.
>> 
>> Rounding mode selection on ceil (similar for floor and rint): according to Math.ceil requirements: 
>> 
>>> Returns the smallest (closest to negative infinity) value that is greater than or equal to the argument and is equal to a mathematical integer (Math.java:475).
>> 
>> For double -> long int we choose rup (round towards +inf) mode to get the integer that more than or equal to the input value. 
>> For long int -> double we choose rdn (rounds towards -inf) mode to get the smallest (closest to -inf) representation of integer that we got after conversion.
>> 
>> For cases when we got inf, nan, or value more than 2^63 return input value (double value which more than 2^63 is guaranteed integer).
>> As well when we store result we copy sign from input value (need for cases when for (-1.0, 0.0) ceil need to return -0.0).
>> 
>> We have observed significant improvement on hifive and thead boards.
>> 
>> testing: tier1, tier2 and hotspot:tier3 on hifive
>> 
>> Performance results on hifive (FpRoundingBenchmark.testceil/floor/rint):
>> 
>> Without intrinsic:
>> 
>> Benchmark                      (TESTSIZE)   Mode  Cnt   Score   Error   Units
>> FpRoundingBenchmark.testceil         1024  thrpt   25  39.297 ± 0.037  ops/ms
>> FpRoundingBenchmark.testfloor        1024  thrpt   25  39.398 ± 0.018  ops/ms
>> FpRoundingBenchmark.testrint         1024  thrpt   25  36.388 ± 0.844  ops/ms
>> 
>> With intrinsic:
>> 
>> Benchmark                      (TESTSIZE)   Mode  Cnt   Score   Error   Units
>> FpRoundingBenchmark.testceil         1024  thrpt   25  80.560 ± 0.053  ops/ms
>> FpRoundingBenchmark.testfloor        1024  thrpt   25  80.541 ± 0.081  ops/ms
>> FpRoundingBenchmark.testrint         1024  thrpt   25  80.603 ± 0.071  ops/ms
>
> src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 4283:
> 
>> 4281:   // generating constant (tmp2)
>> 4282:   // tmp2 = 100...0000
>> 4283:   addi(mask, zr, 1);
> 
> There are other ways to [implement these functions with less instructions](https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1AB9U8lJL6yAngGVG6AMKpaAVxYM9DgDJ4GmADl3ACNMYhAAVgBmUgAHVAVCWwZnNw89eMSbAV9/IJZQ8OiLTCtshiECJmICVPdPLhKy5MrqglzAkLDImIUqmrr0xr62jvzCnoBKC1RXYmR2DgBSACYov2Q3LABqJajHPvxBADoEPewljQBBVfWGTdcdvccWJgIEU/PLm%2BuCAE9YpgsFRtolgP50NtkAhqttgqgXHsAELfb7oWbBejfMyYOjbCDo1yYzDbVSTVEAdhR12220JxO2r2AuyiABFtlQmMEFPiycjUTTtngQRAmSzHPiuHJvN5xc9tqZWUjvMYALJXAIAFWMrIAkgBxUyTcmC2lLKnfWlWoWCABskmMBCF/NNVvp9G2/gA7sZVC6fldrbSmAoWNsAG4uN50EkQVYrKjIcMEY60Y6Q1YRDSkXYrCKNbbEVyxePbEC5lZ7VnEUsQPCTMsVqi1snkqLUwNBkNhyMGGweuMrBNJlPoVO5rM5zMFosloeN%2BNV5vziDe30N8vxmsr%2BttjuWoMAegAVFDcbQIABaDQNqtV7bXnOGSF%2BbbARhhMTbL2YMAcWi0NsYhekwfw8jC4Yku8JIKKwmACgAnKCeDgkBPKzMQNqxK4Ka0seh4HtaqgsuyaCxGBKEMPia6qDmfLtoRuwUqyAqdsQmAEHMVF%2Bgx1zmixfHXO68HXKYVC0AimEEhiHp8nxFqCsJjJMMy96ctyvJ7qxtLCviYryhAUoynK%2BwKjqypqhq2p6oaxjGox5odkGNo
 EPajrOrxnbWkpNH%2BoxwahhGUb9rG8aJsmqbphO2YTjO6AMKWm5DlW24rPi9YLkOy5pRArb%2Bs53ZBX2Mb4mFI7puOmYxdOObEPFiUVkutY0RuFapel9mef52wnhyEkkFeN4kfej5AQwL5Ue%2B/jEF%2BP5/gBQG0CBYHbBBUEIDBcGIchqEhtsGFYThxx4QRrq0sR95kRRqGrpgPq0aSWmuvx2mFhxXGkn5zGsWiMkiVcpjELM434kpck3ApnZKfpbIclyPK5U9AY6SKMMSoZ0qys88qKhZ6pajqBpGiaXlMU5QZ%2BK5DpOng%2BVBj5d2%2Bn5Z1oT2wUlYOw4RWmGZ5tVeYziwLANYubLtXWrVhS2nXk9ahW9tGA5lRFY6AVVU4C7VQsi8lbLZdRjNkplKzi7uzOk7SvVA6442DbebIjTFz42m%2BH4zYBc3/oBwGgeBTCQds0GgltglIWCVF7QdfjYbh2z4d1F2w1d4cG/ddFI9aL2CWx73ENxX0CVcHDTLQnARLwngcFopCoJwEoKBhCy5lEPCkAQmjF9MADWkTZqXHCSLwLASBo2aV9XtccLwCggNm7dV8XpBwLASBoCwsQxmQFAQGvG/0OExDhval7IJshjAKQWDhngCwAGp4HdADygKV63NC0AQYQzxAwQd6QwR%2BGqH8TgrcAHMGIH8R%2BwRtCYGsCA3ga82CCEfgwWgwCF6X0wMEVwwBHBiFoDPbgvAsCvCMOIDB%2BB2LWDwJBQh1dMCqFgThRYrdKalD/rQPAwQZoQOcFgP%2BBBiB4GHkQ6Y4kVIKHvk/F%2B8CZCCBEGIdgUg5HyCUGoP%2BuhGgGCMCAUwxhzCcO5PAaYqBYjlEIZeR%2BUReCoEgsQIRWAZ6QGmJYWB5R7DjUGJ4KI2YfB%2BE6AUboXAVhxASEkAQXiMhhPKGMLoRQmhuJaP0WoLh6ggB8Qk6hAhWg1FiYE8IwSLDJMicMZJeS
 JgRBcY3JRJcy4Vz/pPbYAAlXUQhHCXlvoWI%2BkhgDIChNo5kEBBE2y7g2CAuBCAkGblwSYvB55aGNKQDaTAsDhAgN3Xu%2BhOCD1IOPGxnBp6zzbh3RZ/cVj1IwZPOZJzph2MSHYSQQA). Also, given you're not checking for NaN (and neither is it done for other architectures), I assume that this is done before this is called?

Hi, thanks for your review. 

- About NaN, INF and other special values:

According to RISC-V ISA paragraph 8.7 for fcvt.l instruction (Table 8.4) we have some special return values: 
1) if we exceed minimum input value, or  got -INF it returns (-2^63)
2) if we exceed maximum input value, or got +INF or NaN it returns (2^63-1)

(Also, if we exceed maximum/minimum input values on input we have already integer value, 
because according to IEEE754 double-precision f.p. format all doubles more +/- 2^52 are already integer values)

So we need to check if we got -2^63 or 2^63-1 after double->long int conversion,
comment lines from [src/hotspot/cpu/riscv/macroAssembler_riscv.cpp:4288](https://github.com/openjdk/jdk/pull/14991/files#diff-7a5c3ed05b6f3f06ed1c59f5fc2a14ec566a6a5bd1d09606115767daa99115bdR4288-R4291) describes how do we change result (converted_dbl -> converted_dbl_masked) and constant (mask) to check this values. I have tried to check for NaN and +-inf with just one conditional branch.
If we got -2^63 or 2^63-1 we return input value, therefore NaN -> NaN; +/- INF -> +/- INF; double that is already integer stays same as required by the ceil/floor/rint function descriptions.

- About case when we can use less instructions:

Of course, we can use a bit less instructions, but the main goal during intrinsic writing was minimizing count of expensive instructions (so instead of flt.d was used integer instructions on converted_dbl etc.)
We already have some cases when less expensive instructions were chosen instead of reducing their number. For example: https://bugs.openjdk.org/browse/JDK-8297359

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14991#discussion_r1274701998