RFR (M) 8222074: Enhance auto vectorization for x86

Mon May 6 22:35:19 UTC 2019

Thanks a lot VladimirK and VladimirI. Vivek has offered to push the patch. 

Best Regards,
Sandhya

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Monday, May 06, 2019 3:05 PM
To: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>; Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86

This looks good to me too. Thank you for cleaning this up and keeping size increase very small.

Thanks,
Vladimir K

On 5/6/19 12:15 PM, Vladimir Ivanov wrote:
> 
>> http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.03/
> 
> Looks good.
> 
> Testing results are good as well.
> 
> Best regards,
> Vladimir Ivanov
> 
>> Yes, the footprint numbers continue to hold with this patch:
>>     The x86.ad file is 126 lines smaller.
>>     The libjvm size increase is only 0.24%.
>>
>> Best Regards,
>> Sandhya
>>
>> -----Original Message-----
>> From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
>> Sent: Friday, May 03, 2019 4:22 PM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net; Vladimir Kozlov <vladimir.kozlov at oracle.com>
>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>
>>
>>> http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.02/
>>
>> Much better! I like how AD files look now. I assume static footprint
>> numbers you provided earlier are still valid.
>>
>> +void MacroAssembler::vabsnegd(int opcode, XMMRegister dst, Register scr) {
>> +  if (opcode == Op_AbsVD) {
>> +    andpd(dst,
>> ExternalAddress(StubRoutines::x86::vector_double_sign_mask()), scr);
>> +  } else {
>> +    assert((opcode == Op_NegVD),"opcode should be Op_NegD");
>> +    xorpd(dst,
>> ExternalAddress(StubRoutines::x86::vector_double_sign_flip()), scr);
>> +  }
>> +}
>>
>> It's a bit odd to see C2-specific stuff in MacroAssembler, but I'm
>> perfectly fine with incrementally refactor it later.
>>
>> For now, just guard relevant code with #ifdef COMPILER2.
>>
>> Otherwise, looks very good!
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>
>>> Looking forward to your feedback.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
>>> Sent: Wednesday, May 01, 2019 5:09 PM
>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>
>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>>
>>> Sounds good, thanks!
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> On 01/05/2019 15:16, Viswanathan, Sandhya wrote:
>>>> I should add here that your suggestion of adding generic shift instruction etc to the macroAssembler is also 
>>>> wonderful instead of function pointer.  I will look into making that change as well.
>>>>
>>>> Best Regards,
>>>> Sandhya
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Viswanathan, Sandhya
>>>> Sent: Wednesday, May 01, 2019 3:10 PM
>>>> To: 'Vladimir Ivanov' <vladimir.x.ivanov at oracle.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>
>>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>>> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>
>>>> Hi Vladimir,
>>>>
>>>> I agree, I wanted to show both the approaches in this patch to get your feedback:
>>>> 1) with emit as a function
>>>> 2) with emit part in the instruct body itself
>>>>
>>>> With emit as a function it becomes hard to read and I personally prefer it in the instruct itself as is done for 
>>>> vabsneg2D etc. That is what you are recommending as well so I feel good.
>>>>
>>>> Once the adlc enhancement is done both the approaches should give similar binary size. Till then there will be small 
>>>> overhead with approach 2) as emit is duplicated per match rule.
>>>>
>>>> I will send an updated patch fixing the two issues you mentioned in your previous email plus this change of using 
>>>> approach 2).
>>>>
>>>> Please do let me know if you want to see any other change in this patch.
>>>>
>>>> Best Regards,
>>>> Sandhya
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
>>>> Sent: Wednesday, May 01, 2019 2:58 PM
>>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>
>>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>
>>>>
>>>>> http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.01/
>>>>
>>>> Nice job, Sandhya! Glad to hear the approach pays off!
>>>>
>>>> Unfortunately, I must note that AD file becomes much more obscure.
>>>> Especially with those function pointers.
>>>>
>>>> 1528 void emit_vshift16B_code(MacroAssembler& _masm, int opcode, XMMRegister dst,
>>>> 1529                         XMMRegister src, XMMRegister shift,
>>>> 1530                         XMMRegister tmp1, XMMRegister tmp2,
>>>> Register scratch) {
>>>> 1531   XX_Inst extendinst = get_extend_inst(opcode == Op_URShiftVB ?
>>>> false : true);
>>>> 1532   XX_Inst shiftinst = get_xx_inst(opcode);
>>>> 1533
>>>> 1534   (_masm.*extendinst)(tmp1, src);
>>>> 1535   (_masm.*shiftinst)(tmp1, shift);
>>>> 1536   __ pshufd(tmp2, src, 0xE);
>>>> 1537   (_masm.*extendinst)(tmp2, tmp2);
>>>> 1538   (_masm.*shiftinst)(tmp2, shift);
>>>> 1539   __ movdqu(dst, ExternalAddress(vector_short_to_byte_mask()),
>>>> scratch);
>>>> 1540   __ pand(tmp2, dst);
>>>> 1541   __ pand(dst, tmp1);
>>>> 1542   __ packuswb(dst, tmp2);
>>>> 1543 }
>>>>
>>>> Have you tried to encapsulate that into x86-specific MacroAssembler?
>>>>
>>>> 8682 instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, vecX tmp2, rRegI scratch) %{
>>>> 8683   predicate(UseSSE > 3  && UseAVX <= 1 && n->as_Vector()->length()
>>>> == 16);
>>>> 8684   match(Set dst (LShiftVB src shift));
>>>> 8685   match(Set dst (RShiftVB src shift));
>>>> 8686   match(Set dst (URShiftVB src shift));
>>>> 8687   effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
>>>> 8688   format %{"pmovxbw   $tmp1,$src\n\t"
>>>> 8689            "shiftop   $tmp1,$shift\n\t"
>>>> 8690            "pshufd    $tmp2,$src\n\t"
>>>> 8691            "pmovxbw   $tmp2,$tmp2\n\t"
>>>> 8692            "shiftop   $tmp2,$shift\n\t"
>>>> 8693            "movdqu    $dst,[0x00ff00ff0x00ff00ff]\n\t"
>>>> 8694            "pand      $tmp2,$dst\n\t"
>>>> 8695            "pand      $dst,$tmp1\n\t"
>>>> 8696            "packuswb  $dst,$tmp2\n\t! packed16B shift" %}
>>>> 8697   ins_encode %{
>>>> 8698     emit_vshift16B_code(_masm, this->as_Mach()->ideal_Opcode() ,
>>>> $dst$$XMMRegister, $src$$XMMRegister, $shift$$XMMRegister, $tmp1$$XMMRegister, $tmp2$$XMMRegister, $scratch$$Register);
>>>> 8699   %}
>>>> 8700   ins_pipe( pipe_slow );
>>>> 8701 %}
>>>>
>>>> can be turned into something like:
>>>>
>>>> instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, vecX tmp2, rRegI scratch) %{
>>>>       predicate(n->as_Vector()->length() == 16);
>>>>       match(Set dst (LShiftVB src shift));
>>>>       match(Set dst (RShiftVB src shift));
>>>>       match(Set dst (URShiftVB src shift));
>>>>       effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
>>>>       format %{"packed16B shift" %}
>>>>       ins_encode %{
>>>>         int vlen = 0; // 128-bit
>>>>         BasicType elem_type = T_BYTE;
>>>>         int shift_mode = ...; // L/R/UR or S/U + L/R
>>>>         __ vshift(vlen, elem_type, shift_mode,
>>>>                   $dst$$..., $src$$..., $shift$$...,
>>>>           $tmp1$$..., $tmp2$$..., $scratch$$...);
>>>>        %}
>>>>
>>>> Then MA::vshift can dispatch between different implementations depending on SSE/AVX level available. Do you see any 
>>>> problems with that from footprint perspective?
>>>>
>>>> Ideally, I'd prefer to see a library of operations on vectors encapsulated in MacroAssembler (or a subclass) and 
>>>> used in x86.ad. That will accommodate further reductions in AD instructions needed.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>> With this webrev the ad file has only about 60 lines effectively added.
>>>>> Also the generated product libjvm.so size only increases by about 0.26% vs the prior 1.50%.
>>>>> I have used multiple match rules in one instruct for same size shift related rules and also for the new Abs/Neg rules.
>>>>> What I noticed is that the adlc still duplicates lot of code and there is potential to further improve code size 
>>>>> for multiple match rule case by improving the adlc itself.
>>>>> The adlc improvement (like removing duplicate emits, formats, expand, pipeline etc) can be done as a separate RFE.
>>>>> In this webrev, I have also fixed the errors reported by Vladimir Ivanov and corrected the issues reported by 
>>>>> jcheck tool.
>>>>> Also taken into account reducing the temporary by using TEMP dst for multiply rules.
>>>>>
>>>>> The compiler jtreg tests and the java math tests pass on Haswell, SKX, and KNL.
>>>>>
>>>>> Your review and feedback is welcome.
>>>>>
>>>>> Best Regards,
>>>>> Sandhya
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev
>>>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>>>>> Viswanathan, Sandhya
>>>>> Sent: Wednesday, April 10, 2019 10:22 AM
>>>>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; B. Blaser
>>>>> <bsrbnd at gmail.com>
>>>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>>
>>>>> Yes good catch, in mul32B_reg_avx(), the last two instructions are the only place where dst is used:
>>>>>
>>>>>         __ vpackuswb($dst$$XMMRegister, $tmp2$$XMMRegister, $tmp1$$XMMRegister, vector_len);
>>>>>         __ vpermq($dst$$XMMRegister, $dst$$XMMRegister, 0xD8,
>>>>> vector_len);
>>>>>
>>>>> Here dst can be same as tmp2 or tmp1 in packuswb() and so the effect TEMP dst is not required.
>>>>>
>>>>> Best Regards,
>>>>> Sandhya
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>> Sent: Wednesday, April 10, 2019 9:59 AM
>>>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; B. Blaser
>>>>> <bsrbnd at gmail.com>
>>>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>>
>>>>> On 4/10/19 8:36 AM, Viswanathan, Sandhya wrote:
>>>>>> Hi Bernard,
>>>>>>
>>>>>> One could add TEMP dst in effect() to let the register allocator know that dst needs to be different from src.
>>>>>
>>>>> Yes, we use this way. Or, in mul4B_reg() case, we can use $dst instead
>>>>> $tmp2 to avoid overwriting
>>>>> $src2 before we get value from it if $dst = $src2.
>>>>>
>>>>> On other hand, mul32B_reg_avx() and other have 'TEMP dst' effect but $dst is used only for final result.
>>>>>
>>>>> It is a little mess which may cause ineffective use of registers in compiled code.
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Sandhya
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: B. Blaser [mailto:bsrbnd at gmail.com]
>>>>>> Sent: Wednesday, April 10, 2019 4:10 AM
>>>>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>>> Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>;
>>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>>>
>>>>>> Hi Sandhya and Vladimir K.,
>>>>>>
>>>>>> On Wed, 10 Apr 2019 at 03:06, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>>>>>>
>>>>>>> Hi Vladimir,
>>>>>>>
>>>>>>> Yes, I missed the question below:
>>>>>>>>> There are cases where we can use less `TEMP tmp` registers by using 'dst' register like in mul4B_reg(). Is it 
>>>>>>>>> intentional to not use 'dst' there?
>>>>>>>
>>>>>>> No it is not intentional, we can use the dst register in those cases and reduced the tmps.
>>>>>>
>>>>>> I guess we have to be careful using $dst instead of $tmp registers as the allocator sometimes provides identical 
>>>>>> $src & $dst. Also, I'm not sure this would be possible in the case of mul4B_reg():
>>>>>>
>>>>>> 7349   format %{"pmovsxbw  $tmp,$src1\n\t"
>>>>>> 7350            "pmovsxbw  $tmp2,$src2\n\t"
>>>>>>
>>>>>> I believe this couldn't work if you use $dst instead of $tmp and $dst = $src2, what do you think?
>>>>>>
>>>>>> Thanks,
>>>>>> Bernard
>>>>>>