RFR: 8338021: Support saturating vector operators in VectorAPI [v4]

Mon Sep 2 12:21:00 UTC 2024

On Mon, 26 Aug 2024 22:17:55 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Review comments resolutions.
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 10229:
> 
>> 10227:   InstructionMark im(this);
>> 10228:   assert(VM_Version::supports_avx512bw() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), "");
>> 10229:   InstructionAttr attributes(vector_len, /* vex_w */ true,/* legacy_mode */ false, /* no_mask_reg */ false,/* uses_vl */ true);
> 
> vex_w could be false here.

Encoding specification mentions W bit gets ignored, so no functional issues, will make it false to comply with our convention.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6698:
> 
>> 6696:   // Unsigned values ranges comprise of only +ve numbers, thus there exist only an upper bound saturation.
>> 6697:   // overflow = ((UMAX - MAX(SRC1 & SRC2)) <u MIN(SRC1, SRC2)) >>> 31 == 1
>> 6698:   // Res = Signed Add INP1, INP2
> 
> The >>> 31 is not coded so comment could be improved to match the code.
> Comment has SRC1/INP1 term mixed.
> Also, could overflow not be implemented based on much simpler Java scalar algo:
> Overflow = Res <u  (INP1 | INP2)
> This is much straight forward, also evex supports unsigned comparison.

Java scalar algo was empirically proved to hold good, I also verified with Alive2 solver which proved its semantic equivalence to HD section 2-13 based vector implementation.
Here is the link to Alive2 solver which operates on LLVM IR inputs.
[https://alive2.llvm.org/ce/z/XDQ7dY](https://alive2.llvm.org/ce/z/XDQ7dY)

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6749:
> 
>> 6747:   vpor(xtmp2, xtmp2, src2, vlen_enc);
>> 6748:   // Compute mask for muxing T1 with T3 using SRC1.
>> 6749:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);
> 
> I don't think we need to do the sign extension. The blend instruction uses most significant bit to do the blend.

Original vector is has double / quad word lanes which are being blended using byte level mask, sign extension will ensure that sign bit is propagated to MSB bits of each constituent byte mask corresponding double / quad word source lane.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 6939:
> 
>> 6937: 
>> 6938:   // Compose saturating min/max vector using first input polarity mask.
>> 6939:   vpsign_extend_dq(etype, xtmp4, src1, vlen_enc);
> 
> Sign extend to lower bits not needed as blend uses msbit only.

Original vector is has double / quad word lanes which are being blended using byte level mask, sign extension will ensure that sign bit is propagated to MSB bits of each constituent byte mask corresponding double / quad word source lane.

> src/hotspot/cpu/x86/x86.ad line 1953:
> 
>> 1951:        if (UseAVX < 1 || size_in_bits < 128 || (size_in_bits == 512 && !VM_Version::supports_avx512bw())) {
>> 1952:          return false;
>> 1953:        }
> 
> UseAVX < 1 could be written as UseAVX == 0. Could we not do register version for size_in_bit < 128?

I get your point, but constraints ensure we only address cases with vector size >= 128 bit in this patch..

> src/hotspot/cpu/x86/x86.ad line 10635:
> 
>> 10633: %}
>> 10634: 
>> 10635: instruct saturating_unsigned_add_reg_avx(vec dst, vec src1, vec src2, vec xtmp1, vec xtmp2, vec xtmp3, vec xtmp4)
> 
> Should the temp here and all the places related to !avx512vl() be legVec instead of vec?

Predicate already has AVX512VL check and so does dynamic register classes associated with its operands.

> src/hotspot/cpu/x86/x86.ad line 10656:
> 
>> 10654:   match(Set dst (SaturatingSubVI src1 src2));
>> 10655:   match(Set dst (SaturatingSubVL src1 src2));
>> 10656:   effect(TEMP ktmp);
> 
> This needs TEMP dst as well.

There is no use of either of the source operands after assignment to dst in the macro assembly routine.

> src/java.base/share/classes/java/lang/Byte.java line 647:
> 
>> 645:      */
>> 646:     public static byte subSaturating(byte a, byte b) {
>> 647:         byte res = (byte)(a - b);
> 
> Could we not do subSaturating as an int operation on similar lines as addSaturating?

Yes, core libs also have {add/subtract}Exact API which instead of saturating over / underflowing values throws ArithmeticException. Streamlining overflow checks for saturating long operations on the same lines to address Joe's concerns on new constants.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740820063
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819360
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740823487
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740823011
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819742
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740819629
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740822270
PR Review Comment: https://git.openjdk.org/jdk/pull/20507#discussion_r1740821085