RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9]
Srinivas Vamsi Parasa
duke at openjdk.java.net
Tue May 24 20:56:57 UTC 2022
On Sat, 21 May 2022 15:42:34 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:
>> Hi Vladimir (@vnkozlov)
>>
>> For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE.
>>
>> Is it Ok to skip support for **non** `vfpclassd` for 32bit?
>>
>>
>> void C2_MacroAssembler::double_class_check_sse(int opcode, XMMRegister src, Register dst, Register temp, Register temp1) {
>> int32_t POS_INF_HI = 0x7ff00000; // hi 32bits
>> int32_t KILL_SIGN_MASK_HI = 0x7fffffff; // hi 32 bits
>>
>> pshuflw(src, src, 0x4e); //switch hi to lo
>> movdl(temp, src);
>> movl(temp1, KILL_SIGN_MASK_HI);
>> andl(temp, temp1);
>> movl(temp1, POS_INF_HI);
>> cmpl(temp, temp1);
>> switch (opcode) {
>> case Op_IsFiniteD:
>> setb(Assembler::below, dst);
>> break;
>> case Op_IsInfiniteD:
>> setb(Assembler::equal, dst);
>> break;
>> case Op_IsNaND:
>> setb(Assembler::above, dst);
>> break;
>> default:
>> assert(false, "%s", NodeClassNames[opcode]);
>> }
>> andl(dst, 0xff);
>> }
>
>> For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE.
>>
>> Is it Ok to skip support for **non** `vfpclassd` for 32bit?
>
> Yes, but add comment about that. Also for 32-bit you need to check SSE2 support which is required by `pshuflw`.
Hi Vladimir (@vnkozlov),
Could you pls review this updated PR? In this updated patch, we **removed** the intrinsics using **non**`-vpfclasss/d` instructions.
- Got the new performance data for `vfpclasss/d` intrinsics, after rebasing with the latest changes (which include #8525 submitted by @merykitty).
- Using `vfpclasss/d` instruction gives upto `70%` speedup over the existing baseline.
- This works for both 64 bit and 32 bit as well.
Please see the updated data shown below (also updated the RFE main text as well)
Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%)
FloatClassCheck.testIsFinite 0.562 0.406 28%
FloatClassCheck.testIsInfinite 0.815 0.383 53%
FloatClassCheck.testIsNaN 0.63 0.382 39%
DoubleClassCheck.testIsFinite 0.565 0.409 28%
DoubleClassCheck.testIsInfinite 0.812 0.375 54%
DoubleClassCheck.testIsNaN 0.631 0.38 40%
FPComparison.isFiniteDouble 332.638 272.577 18%
FPComparison.isFiniteFloat 413.217 331.825 20%
FPComparison.isInfiniteDouble 874.897 240.632 72%
FPComparison.isInfiniteFloat 872.279 321.269 63%
FPComparison.isNanDouble 286.566 240.36 16%
FPComparison.isNanFloat 346.123 316.923 8%
Thanks,
Vamsi
-------------
PR: https://git.openjdk.java.net/jdk/pull/8459
More information about the hotspot-compiler-dev
mailing list