RFR: 8277997: Intrinsic creation for VectorMask.fromLong API [v3]

Tue Dec 7 02:22:14 UTC 2021

On Mon, 6 Dec 2021 17:44:01 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Summary of changes:
>> 
>> 1) Inline expansion of VectorMask.fromLong API, this includes Java API implementation and C2 IR changes.
>> 2) X86 backend support for AVX512 and AVX2 targets.
>> 3) New IR transformation to handle following patterns:-
>>   a) Mask2Long + Long2Mask -> MaskCast (when source and destination mask lengths are equal)
>>   b) Long2Mask  + Mask2Long -> Long 
>> 4) Following performance data is collected for new JMH micro included with the patch:-
>> 
>> System Configuration : Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (40C 2S Icelake Server)
>> 
>> Benchmark | Baseline AVX2 (ops/ms) | Withopt AVX2 (ops/ms) | Gain factor | Baseline AVX3 (ops/ms) | Withopt AVX3(ops/ms) | Gain factor
>> -- | -- | -- | -- | -- | -- | --
>> MaskFromLongBenchmark.microMaskFromLong_Byte128 | 20050.884 | 36414.349 | 1.816096936 | 19699.631 | 36412.252 | 1.848372287
>> MaskFromLongBenchmark.microMaskFromLong_Byte256 | 17589.496 | 36418.368 | 2.070461143 | 17211.451 | 36407.44 | 2.115303352
>> MaskFromLongBenchmark.microMaskFromLong_Byte512 | 2824.411 | 2492.795 | 0.882589326 | 6359.071 | 36405.344 | 5.72494693
>> MaskFromLongBenchmark.microMaskFromLong_Byte64 | 23507.28 | 36424.668 | 1.549505855 | 22659.666 | 36420.345 | 1.607276338
>> MaskFromLongBenchmark.microMaskFromLong_Integer128 | 24567.895 | 36411.602 | 1.482080659 | 24620.619 | 36397.005 | 1.478313969
>> MaskFromLongBenchmark.microMaskFromLong_Integer256 | 23495.078 | 36411.981 | 1.549770595 | 22823.846 | 36395.703 | 1.594634971
>> MaskFromLongBenchmark.microMaskFromLong_Integer512 | 12377.022 | 11478.101 | 0.927371786 | 19701.118 | 36394.878 | 1.847350897
>> MaskFromLongBenchmark.microMaskFromLong_Integer64 | 22169.231 | 17791.849 | 0.802546962 | 23603.169 | 18055.166 | 0.76494669
>> MaskFromLongBenchmark.microMaskFromLong_Long128 | 22312.568 | 17859.474 | 0.800422166 | 22171.303 | 18106.295 | 0.816654529
>> MaskFromLongBenchmark.microMaskFromLong_Long256 | 24271.19 | 36416.883 | 1.500416049 | 24621.327 | 36390.41 | 1.478003602
>> MaskFromLongBenchmark.microMaskFromLong_Long512 | 15289.749 | 13860.775 | 0.906540389 | 23003.816 | 36396.033 | 1.582173714
>> MaskFromLongBenchmark.microMaskFromLong_Long64 | 27086.471 | 20490.828 | 0.756496777 | 27177.133 | 20441.112 | 0.752143797
>> MaskFromLongBenchmark.microMaskFromLong_Short128 | 23504.216 | 36412.66 | 1.549196961 | 22823.401 | 36417.799 | 1.595634191
>> MaskFromLongBenchmark.microMaskFromLong_Short256 | 20056.61 | 36403.277 | 1.815026418 | 19699.502 | 36412.605 | 1.84840231
>> MaskFromLongBenchmark.microMaskFromLong_Short512 | 4775.721 | 6827.594 | 1.429646749 | 17209.782 | 36388.226 | 2.114392036
>> MaskFromLongBenchmark.microMaskFromLong_Short64 | 24759.049 | 36381.539 | 1.469423927 | 24506.013 | 36413.099 | 1.48588426
>> 
>> 
>> 
>> Kindly review and share feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8277997: Review comments resolved.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4070:

> 4068:   movq(rtmp2, src);
> 4069:   mov64(rtmp1, 0x0101010101010101L);
> 4070:   pdep(rtmp1, rtmp2, rtmp1);

For masklen < 8, we could directly generate pdep(rtmp1, src, rtmp1);
rtmp2 is not required in that case.

src/hotspot/cpu/x86/x86.ad line 9516:

> 9514:     int vec_enc  = vector_length_encoding(mask_len*8);
> 9515:     __ vector_long_to_maskvec($dst$$XMMRegister, $src$$Register, $rtmp1$$Register,
> 9516:                               $rtmp2$$Register, $xtmp1$$XMMRegister, mask_len, vec_enc);

xtmp2 is not being used here?

src/hotspot/cpu/x86/x86.ad line 9529:

> 9527:     int mask_len = Matcher::vector_length(this);
> 9528:     __ movq($rtmp$$Register, $src$$Register);
> 9529:     __ kmov($dst$$KRegister, $rtmp$$Register);

why extra move to rtmp here? Cannot we generate directly kmov(dst, src)?

src/hotspot/share/opto/vectorIntrinsics.cpp line 803:

> 801:   // MODE_BITS_COERCED_BROADCAST for VectorMask.maskAll operation.
> 802:   // MODE_BITS_COERCED_LONG_TO_MASK for VectorMask.fromLong operation.
> 803:   const TypeInt*     mode         = gvn().type(argument(5))->isa_int();

Isn't mode argument(4)?

src/hotspot/share/opto/vectornode.cpp line 1506:

> 1504:      if (src->Opcode() == Op_VectorStoreMask) {
> 1505:        src = src->in(1);
> 1506:      }

What if src happened to be  a phi node here?

-------------

PR: https://git.openjdk.java.net/jdk/pull/6646