RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9]

Tue May 24 20:56:57 UTC 2022

On Sat, 21 May 2022 15:42:34 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Hi Vladimir (@vnkozlov)
>> 
>> For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE.
>> 
>> Is it Ok to skip support for **non** `vfpclassd` for 32bit?
>> 
>> 
>> void C2_MacroAssembler::double_class_check_sse(int opcode, XMMRegister src, Register dst, Register temp, Register temp1) {
>>   int32_t POS_INF_HI = 0x7ff00000; // hi 32bits
>>   int32_t KILL_SIGN_MASK_HI = 0x7fffffff; // hi 32 bits
>> 
>>   pshuflw(src, src, 0x4e); //switch hi to lo
>>   movdl(temp, src);
>>   movl(temp1, KILL_SIGN_MASK_HI);
>>   andl(temp, temp1);
>>   movl(temp1, POS_INF_HI);
>>   cmpl(temp, temp1);
>>   switch (opcode) {
>>     case Op_IsFiniteD:
>>       setb(Assembler::below, dst);
>>       break;
>>     case Op_IsInfiniteD:
>>       setb(Assembler::equal, dst);
>>       break;
>>     case Op_IsNaND:
>>       setb(Assembler::above, dst);
>>       break;
>>     default:
>>       assert(false, "%s", NodeClassNames[opcode]);
>>   }
>>   andl(dst, 0xff);
>> }
>
>> For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE.
>> 
>> Is it Ok to skip support for **non** `vfpclassd` for 32bit?
> 
> Yes, but add comment about that. Also for 32-bit you need to check SSE2 support which is required by `pshuflw`.

Hi Vladimir (@vnkozlov),

Could you pls review this updated PR? In this updated patch, we **removed** the intrinsics using **non**`-vpfclasss/d` instructions.

- Got the new performance data for `vfpclasss/d`  intrinsics, after rebasing with the latest changes (which include #8525 submitted by @merykitty). 
- Using `vfpclasss/d` instruction gives upto `70%` speedup over the existing baseline.
- This works for both 64 bit and 32 bit as well.   

Please see the updated data shown below (also updated the RFE main text as well)

Benchmark (ns/op)	        Baseline Intrinsic(vfpclasss/d)	Speedup(%)
FloatClassCheck.testIsFinite	0.562	          0.406	         28%
FloatClassCheck.testIsInfinite	0.815	          0.383	         53%
FloatClassCheck.testIsNaN	0.63	          0.382	         39%
DoubleClassCheck.testIsFinite	0.565	          0.409	         28%
DoubleClassCheck.testIsInfinite	0.812	          0.375	         54%
DoubleClassCheck.testIsNaN	0.631	          0.38	         40%
FPComparison.isFiniteDouble	332.638	          272.577	 18%
FPComparison.isFiniteFloat	413.217	          331.825	 20%
FPComparison.isInfiniteDouble	874.897	          240.632	 72%
FPComparison.isInfiniteFloat	872.279	          321.269	 63%
FPComparison.isNanDouble	286.566	          240.36	 16%
FPComparison.isNanFloat	        346.123	          316.923	  8%

Thanks,
Vamsi

-------------

PR: https://git.openjdk.java.net/jdk/pull/8459