RFR (M) 8222074: Enhance auto vectorization for x86

Tue May 7 17:41:03 UTC 2019

Good point, Eric. I agree that having microbenchmarks accompanying such 
changes would be very valuable.

There's a rich set of microbenchmarks in Project Panama for Vector API 
and some of them target auto-vectorization case as well. Sandya, can you 
share the plans, please, regarding contributing relevant ones to the 
mainline?

I'm fine with handling that as a separate RFE.

Best regards,
Vladimir Ivanov

On 06/05/2019 07:21, Eric Caspole wrote:
> Hi Sandhya,
> Could add some new JMH to this webrev that target the java code that 
> show the benefit of these changes? Or, you could look through the 
> existing ones in
>   test/micro/org/openjdk/bench/
> 
> and mention in the bug which existing ones exercise these changes. That 
> will be a big help to us in the course of working on JDK 13.
> Thanks,
> Eric
> 
> 
> 
> On 5/3/19 19:02, Viswanathan, Sandhya wrote:
>> Hi Vladimir,
>>
>> Please find below the updated webrev which implements all your inputs:
>> http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.02/
>>
>> Looking forward to your feedback.
>>
>> Best Regards,
>> Sandhya
>>
>>
>> -----Original Message-----
>> From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
>> Sent: Wednesday, May 01, 2019 5:09 PM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Vladimir 
>> Kozlov <vladimir.kozlov at oracle.com>
>> Cc: hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>
>> Sounds good, thanks!
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 01/05/2019 15:16, Viswanathan, Sandhya wrote:
>>> I should add here that your suggestion of adding generic shift 
>>> instruction etc to the macroAssembler is also wonderful instead of 
>>> function pointer.  I will look into making that change as well.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>>
>>> -----Original Message-----
>>> From: Viswanathan, Sandhya
>>> Sent: Wednesday, May 01, 2019 3:10 PM
>>> To: 'Vladimir Ivanov' <vladimir.x.ivanov at oracle.com>; Vladimir Kozlov 
>>> <vladimir.kozlov at oracle.com>
>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
>>>
>>> Hi Vladimir,
>>>
>>> I agree, I wanted to show both the approaches in this patch to get 
>>> your feedback:
>>> 1) with emit as a function
>>> 2) with emit part in the instruct body itself
>>>
>>> With emit as a function it becomes hard to read and I personally 
>>> prefer it in the instruct itself as is done for vabsneg2D etc. That 
>>> is what you are recommending as well so I feel good.
>>>
>>> Once the adlc enhancement is done both the approaches should give 
>>> similar binary size. Till then there will be small overhead with 
>>> approach 2) as emit is duplicated per match rule.
>>>
>>> I will send an updated patch fixing the two issues you mentioned in 
>>> your previous email plus this change of using approach 2).
>>>
>>> Please do let me know if you want to see any other change in this patch.
>>>
>>> Best Regards,
>>> Sandhya
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com]
>>> Sent: Wednesday, May 01, 2019 2:58 PM
>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Vladimir 
>>> Kozlov <vladimir.kozlov at oracle.com>
>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>>
>>>
>>>> http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.01/
>>>
>>> Nice job, Sandhya! Glad to hear the approach pays off!
>>>
>>> Unfortunately, I must note that AD file becomes much more obscure.
>>> Especially with those function pointers.
>>>
>>> 1528 void emit_vshift16B_code(MacroAssembler& _masm, int opcode, 
>>> XMMRegister dst,
>>> 1529                         XMMRegister src, XMMRegister shift,
>>> 1530                         XMMRegister tmp1, XMMRegister tmp2,
>>> Register scratch) {
>>> 1531   XX_Inst extendinst = get_extend_inst(opcode == Op_URShiftVB ?
>>> false : true);
>>> 1532   XX_Inst shiftinst = get_xx_inst(opcode);
>>> 1533
>>> 1534   (_masm.*extendinst)(tmp1, src);
>>> 1535   (_masm.*shiftinst)(tmp1, shift);
>>> 1536   __ pshufd(tmp2, src, 0xE);
>>> 1537   (_masm.*extendinst)(tmp2, tmp2);
>>> 1538   (_masm.*shiftinst)(tmp2, shift);
>>> 1539   __ movdqu(dst, ExternalAddress(vector_short_to_byte_mask()),
>>> scratch);
>>> 1540   __ pand(tmp2, dst);
>>> 1541   __ pand(dst, tmp1);
>>> 1542   __ packuswb(dst, tmp2);
>>> 1543 }
>>>
>>> Have you tried to encapsulate that into x86-specific MacroAssembler?
>>>
>>> 8682 instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, 
>>> vecX tmp2, rRegI scratch) %{
>>> 8683   predicate(UseSSE > 3  && UseAVX <= 1 && n->as_Vector()->length()
>>> == 16);
>>> 8684   match(Set dst (LShiftVB src shift));
>>> 8685   match(Set dst (RShiftVB src shift));
>>> 8686   match(Set dst (URShiftVB src shift));
>>> 8687   effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
>>> 8688   format %{"pmovxbw   $tmp1,$src\n\t"
>>> 8689            "shiftop   $tmp1,$shift\n\t"
>>> 8690            "pshufd    $tmp2,$src\n\t"
>>> 8691            "pmovxbw   $tmp2,$tmp2\n\t"
>>> 8692            "shiftop   $tmp2,$shift\n\t"
>>> 8693            "movdqu    $dst,[0x00ff00ff0x00ff00ff]\n\t"
>>> 8694            "pand      $tmp2,$dst\n\t"
>>> 8695            "pand      $dst,$tmp1\n\t"
>>> 8696            "packuswb  $dst,$tmp2\n\t! packed16B shift" %}
>>> 8697   ins_encode %{
>>> 8698     emit_vshift16B_code(_masm, this->as_Mach()->ideal_Opcode() ,
>>> $dst$$XMMRegister, $src$$XMMRegister, $shift$$XMMRegister, 
>>> $tmp1$$XMMRegister, $tmp2$$XMMRegister, $scratch$$Register);
>>> 8699   %}
>>> 8700   ins_pipe( pipe_slow );
>>> 8701 %}
>>>
>>> can be turned into something like:
>>>
>>> instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, vecX 
>>> tmp2, rRegI scratch) %{
>>>      predicate(n->as_Vector()->length() == 16);
>>>      match(Set dst (LShiftVB src shift));
>>>      match(Set dst (RShiftVB src shift));
>>>      match(Set dst (URShiftVB src shift));
>>>      effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
>>>      format %{"packed16B shift" %}
>>>      ins_encode %{
>>>        int vlen = 0; // 128-bit
>>>        BasicType elem_type = T_BYTE;
>>>        int shift_mode = ...; // L/R/UR or S/U + L/R
>>>        __ vshift(vlen, elem_type, shift_mode,
>>>                  $dst$$..., $src$$..., $shift$$...,
>>>           $tmp1$$..., $tmp2$$..., $scratch$$...);
>>>       %}
>>>
>>> Then MA::vshift can dispatch between different implementations 
>>> depending on SSE/AVX level available. Do you see any problems with 
>>> that from footprint perspective?
>>>
>>> Ideally, I'd prefer to see a library of operations on vectors 
>>> encapsulated in MacroAssembler (or a subclass) and used in x86.ad. 
>>> That will accommodate further reductions in AD instructions needed.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>> With this webrev the ad file has only about 60 lines effectively added.
>>>> Also the generated product libjvm.so size only increases by about 
>>>> 0.26% vs the prior 1.50%.
>>>> I have used multiple match rules in one instruct for same size shift 
>>>> related rules and also for the new Abs/Neg rules.
>>>> What I noticed is that the adlc still duplicates lot of code and 
>>>> there is potential to further improve code size for multiple match 
>>>> rule case by improving the adlc itself.
>>>> The adlc improvement (like removing duplicate emits, formats, 
>>>> expand, pipeline etc) can be done as a separate RFE.
>>>> In this webrev, I have also fixed the errors reported by Vladimir 
>>>> Ivanov and corrected the issues reported by jcheck tool.
>>>> Also taken into account reducing the temporary by using TEMP dst for 
>>>> multiply rules.
>>>>
>>>> The compiler jtreg tests and the java math tests pass on Haswell, 
>>>> SKX, and KNL.
>>>>
>>>> Your review and feedback is welcome.
>>>>
>>>> Best Regards,
>>>> Sandhya
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: hotspot-compiler-dev
>>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>>>> Viswanathan, Sandhya
>>>> Sent: Wednesday, April 10, 2019 10:22 AM
>>>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; B. Blaser
>>>> <bsrbnd at gmail.com>
>>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>>> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>
>>>> Yes good catch, in mul32B_reg_avx(), the last two instructions are 
>>>> the only place where dst is used:
>>>>
>>>>        __ vpackuswb($dst$$XMMRegister, $tmp2$$XMMRegister, 
>>>> $tmp1$$XMMRegister, vector_len);
>>>>        __ vpermq($dst$$XMMRegister, $dst$$XMMRegister, 0xD8,
>>>> vector_len);
>>>>
>>>> Here dst can be same as tmp2 or tmp1 in packuswb() and so the effect 
>>>> TEMP dst is not required.
>>>>
>>>> Best Regards,
>>>> Sandhya
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Wednesday, April 10, 2019 9:59 AM
>>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; B. Blaser
>>>> <bsrbnd at gmail.com>
>>>> Cc: hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>
>>>> On 4/10/19 8:36 AM, Viswanathan, Sandhya wrote:
>>>>> Hi Bernard,
>>>>>
>>>>> One could add TEMP dst in effect() to let the register allocator 
>>>>> know that dst needs to be different from src.
>>>>
>>>> Yes, we use this way. Or, in mul4B_reg() case, we can use $dst instead
>>>> $tmp2 to avoid overwriting
>>>> $src2 before we get value from it if $dst = $src2.
>>>>
>>>> On other hand, mul32B_reg_avx() and other have 'TEMP dst' effect but 
>>>> $dst is used only for final result.
>>>>
>>>> It is a little mess which may cause ineffective use of registers in 
>>>> compiled code.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>>
>>>>> Best Regards,
>>>>> Sandhya
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: B. Blaser [mailto:bsrbnd at gmail.com]
>>>>> Sent: Wednesday, April 10, 2019 4:10 AM
>>>>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>>>> Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>;
>>>>> hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>>>>
>>>>> Hi Sandhya and Vladimir K.,
>>>>>
>>>>> On Wed, 10 Apr 2019 at 03:06, Viswanathan, Sandhya 
>>>>> <sandhya.viswanathan at intel.com> wrote:
>>>>>>
>>>>>> Hi Vladimir,
>>>>>>
>>>>>> Yes, I missed the question below:
>>>>>>>> There are cases where we can use less `TEMP tmp` registers by 
>>>>>>>> using 'dst' register like in mul4B_reg(). Is it intentional to 
>>>>>>>> not use 'dst' there?
>>>>>>
>>>>>> No it is not intentional, we can use the dst register in those 
>>>>>> cases and reduced the tmps.
>>>>>
>>>>> I guess we have to be careful using $dst instead of $tmp registers 
>>>>> as the allocator sometimes provides identical $src & $dst. Also, 
>>>>> I'm not sure this would be possible in the case of mul4B_reg():
>>>>>
>>>>> 7349   format %{"pmovsxbw  $tmp,$src1\n\t"
>>>>> 7350            "pmovsxbw  $tmp2,$src2\n\t"
>>>>>
>>>>> I believe this couldn't work if you use $dst instead of $tmp and 
>>>>> $dst = $src2, what do you think?
>>>>>
>>>>> Thanks,
>>>>> Bernard
>>>>>