RFR: 8338021: Support saturating vector operators in VectorAPI [v4]

Thu Aug 29 23:41:22 UTC 2024

On Mon, 19 Aug 2024 07:19:30 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> As per the discussion on panama-dev mailing list[1], patch adds the support following new vector operators.
>> 
>> 
>>      . SUADD   : Saturating unsigned addition.
>>      . SADD    : Saturating signed addition. 
>>      . SUSUB   : Saturating unsigned subtraction.
>>      . SSUB    : Saturating signed subtraction.
>>      . UMAX    : Unsigned max
>>      . UMIN    : Unsigned min.
>>      
>> 
>> New vector operators are applicable to only integral types since their values wraparound in over/underflowing scenarios after setting appropriate status flags. For floating point types, as per IEEE 754 specs there are multiple schemes to handler underflow, one of them is gradual underflow which transitions the value to subnormal range. Similarly, overflow implicitly saturates the floating-point value to an Infinite value.
>> 
>> As the name suggests, these are saturating operations, i.e. the result of the computation is strictly capped by lower and upper bounds of the result type and is not wrapped around in underflowing or overflowing scenarios.
>> 
>> Summary of changes:
>> - Java side implementation of new vector operators.
>> - Add new scalar saturating APIs for each of the above saturating vector operator in corresponding primitive box classes, fallback implementation of vector operators is based over it.
>> - C2 compiler IR and inline expander changes.
>> - Optimized x86 backend implementation for new vector operators and their predicated counterparts.
>> - Extends existing VectorAPI Jtreg test suite to cover new operations.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> PS: Intrinsification and auto-vectorization of new core-lib API will be addressed separately in a follow-up patch.
>> 
>> [1] https://mail.openjdk.org/pipermail/panama-dev/2024-May/020408.html
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6674:

> 6672:   // Res = Mask ? Zero : Res
> 6673:   evmovdqu(etype, ktmp, dst, dst, false, vlen_enc);
> 6674: }

We could directly do masked evpsubd/evpsubq here with merge as false.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6698:

> 6696:   // Unsigned values ranges comprise of only +ve numbers, thus there exist only an upper bound saturation.
> 6697:   // overflow = ((UMAX - MAX(SRC1 & SRC2)) <u MIN(SRC1, SRC2)) >>> 31 == 1
> 6698:   // Res = Signed Add INP1, INP2

The >>> 31 is not coded so comment could be improved to match the code.
Comment has SRC1/INP1 term mixed.
Also, could overflow not be implemented based on much simpler Java scalar algo:
Overflow = Res <u  (INP1 | INP2)
This is much straight forward, also evex supports unsigned comparison.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6716:

> 6714: //
> 6715: // Adaptation of unsigned addition overflow detection from hacker's delight
> 6716: // section 2-13 : overflow = ((a & b) | ((a | b) & ~(s))) >>> 31 == 1

Not clear what is s here?  I think it is s = a + b. Could you please update the comments to indicate this.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6738:

> 6736:                                                        XMMRegister xtmp1, XMMRegister xtmp2, XMMRegister xtmp3,
> 6737:                                                        XMMRegister xtmp4, int vlen_enc) {
> 6738:   // Res = Signed Add INP1, INP2

Wondering if we could implement overflow here also based on much simpler Java scalar algo:
Overflow = Res <u (INP1 | INP2) which is same as:
                   Res <s ((INP1 + MIN) | (INP2 + MIN))

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6745:

> 6743:   vpcmpeqd(xtmp3, xtmp3, xtmp3, vlen_enc);
> 6744:   // T2 = ~Res
> 6745:   vpxor(xtmp2, xtmp3, dst, vlen_enc);

Did you mean this to be T3 = ~Res

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6749:

> 6747:   vpor(xtmp2, xtmp2, src2, vlen_enc);
> 6748:   // Compute mask for muxing T1 with T3 using SRC1.
> 6749:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);

I don't think we need to do the sign extension. The blend instruction uses most significant bit to do the blend.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6932:

> 6930: 
> 6931:   // Sign-extend to compute overflow detection mask.
> 6932:   vpsign_extend_dq(etype, xtmp3, xtmp2, vlen_enc);

Sign extend to lower bits not needed as blend uses msbit only.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6939:

> 6937: 
> 6938:   // Compose saturating min/max vector using first input polarity mask.
> 6939:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);

Sign extend to lower bits not needed as blend uses msbit only.

src/hotspot/cpu/x86/x86.ad line 10656:

> 10654:   match(Set dst (SaturatingSubVI src1 src2));
> 10655:   match(Set dst (SaturatingSubVL src1 src2));
> 10656:   effect(TEMP ktmp);

This needs TEMP dst as well.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737116841
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737272705
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737306541
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737307396
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737325898
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737338765
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467234
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737467902
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1737489758