RFR (M) 8222074: Enhance auto vectorization for x86

Wed May 1 22:09:49 UTC 2019

Hi Vladimir,

I agree, I wanted to show both the approaches in this patch to get your feedback: 
1) with emit as a function
2) with emit part in the instruct body itself

With emit as a function it becomes hard to read and I personally prefer it in the instruct itself as is done for vabsneg2D etc. That is what you are recommending as well so I feel good.

Once the adlc enhancement is done both the approaches should give similar binary size. Till then there will be small overhead with approach 2) as emit is duplicated per match rule.

I will send an updated patch fixing the two issues you mentioned in your previous email plus this change of using approach 2). 

Please do let me know if you want to see any other change in this patch.

Best Regards,
Sandhya

-----Original Message-----
From: Vladimir Ivanov [mailto:vladimir.x.ivanov at oracle.com] 
Sent: Wednesday, May 01, 2019 2:58 PM
To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Vladimir Kozlov <vladimir.kozlov at oracle.com>
Cc: hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86

> http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.01/

Nice job, Sandhya! Glad to hear the approach pays off!

Unfortunately, I must note that AD file becomes much more obscure. 
Especially with those function pointers.

1528 void emit_vshift16B_code(MacroAssembler& _masm, int opcode, XMMRegister dst,
1529                         XMMRegister src, XMMRegister shift,
1530                         XMMRegister tmp1, XMMRegister tmp2, 
Register scratch) {
1531   XX_Inst extendinst = get_extend_inst(opcode == Op_URShiftVB ? 
false : true);
1532   XX_Inst shiftinst = get_xx_inst(opcode);
1533
1534   (_masm.*extendinst)(tmp1, src);
1535   (_masm.*shiftinst)(tmp1, shift);
1536   __ pshufd(tmp2, src, 0xE);
1537   (_masm.*extendinst)(tmp2, tmp2);
1538   (_masm.*shiftinst)(tmp2, shift);
1539   __ movdqu(dst, ExternalAddress(vector_short_to_byte_mask()), 
scratch);
1540   __ pand(tmp2, dst);
1541   __ pand(dst, tmp1);
1542   __ packuswb(dst, tmp2);
1543 }

Have you tried to encapsulate that into x86-specific MacroAssembler?

8682 instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, vecX tmp2, rRegI scratch) %{
8683   predicate(UseSSE > 3  && UseAVX <= 1 && n->as_Vector()->length() 
== 16);
8684   match(Set dst (LShiftVB src shift));
8685   match(Set dst (RShiftVB src shift));
8686   match(Set dst (URShiftVB src shift));
8687   effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
8688   format %{"pmovxbw   $tmp1,$src\n\t"
8689            "shiftop   $tmp1,$shift\n\t"
8690            "pshufd    $tmp2,$src\n\t"
8691            "pmovxbw   $tmp2,$tmp2\n\t"
8692            "shiftop   $tmp2,$shift\n\t"
8693            "movdqu    $dst,[0x00ff00ff0x00ff00ff]\n\t"
8694            "pand      $tmp2,$dst\n\t"
8695            "pand      $dst,$tmp1\n\t"
8696            "packuswb  $dst,$tmp2\n\t! packed16B shift" %}
8697   ins_encode %{
8698     emit_vshift16B_code(_masm, this->as_Mach()->ideal_Opcode() , 
$dst$$XMMRegister, $src$$XMMRegister, $shift$$XMMRegister, $tmp1$$XMMRegister, $tmp2$$XMMRegister, $scratch$$Register);
8699   %}
8700   ins_pipe( pipe_slow );
8701 %}

can be turned into something like:

instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, vecX tmp2, rRegI scratch) %{
   predicate(n->as_Vector()->length() == 16);
   match(Set dst (LShiftVB src shift));
   match(Set dst (RShiftVB src shift));
   match(Set dst (URShiftVB src shift));
   effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
   format %{"packed16B shift" %}
   ins_encode %{
     int vlen = 0; // 128-bit
     BasicType elem_type = T_BYTE;
     int shift_mode = ...; // L/R/UR or S/U + L/R
     __ vshift(vlen, elem_type, shift_mode,
               $dst$$..., $src$$..., $shift$$...,
	      $tmp1$$..., $tmp2$$..., $scratch$$...);
    %}

Then MA::vshift can dispatch between different implementations depending on SSE/AVX level available. Do you see any problems with that from footprint perspective?

Ideally, I'd prefer to see a library of operations on vectors encapsulated in MacroAssembler (or a subclass) and used in x86.ad. That will accommodate further reductions in AD instructions needed.

Best regards,
Vladimir Ivanov

> With this webrev the ad file has only about 60 lines effectively added.
> Also the generated product libjvm.so size only increases by about 0.26% vs the prior 1.50%.
> I have used multiple match rules in one instruct for same size shift related rules and also for the new Abs/Neg rules.
> What I noticed is that the adlc still duplicates lot of code and there is potential to further improve code size for multiple match rule case by improving the adlc itself.
> The adlc improvement (like removing duplicate emits, formats, expand, pipeline etc) can be done as a separate RFE.
>   
> In this webrev, I have also fixed the errors reported by Vladimir Ivanov and corrected the issues reported by jcheck tool.
> Also taken into account reducing the temporary by using TEMP dst for multiply rules.
> 
> The compiler jtreg tests and the java math tests pass on Haswell, SKX, and KNL.
> 
> Your review and feedback is welcome.
> 
> Best Regards,
> Sandhya
> 
> 
> -----Original Message-----
> From: hotspot-compiler-dev 
> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of 
> Viswanathan, Sandhya
> Sent: Wednesday, April 10, 2019 10:22 AM
> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; B. Blaser 
> <bsrbnd at gmail.com>
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> Yes good catch, in mul32B_reg_avx(), the last two instructions are the only place where dst is used:
> 
>      __ vpackuswb($dst$$XMMRegister, $tmp2$$XMMRegister, $tmp1$$XMMRegister, vector_len);
>      __ vpermq($dst$$XMMRegister, $dst$$XMMRegister, 0xD8, 
> vector_len);
> 
> Here dst can be same as tmp2 or tmp1 in packuswb() and so the effect TEMP dst is not required.
> 
> Best Regards,
> Sandhya
> 
> 
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Wednesday, April 10, 2019 9:59 AM
> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; B. Blaser 
> <bsrbnd at gmail.com>
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> On 4/10/19 8:36 AM, Viswanathan, Sandhya wrote:
>> Hi Bernard,
>>
>> One could add TEMP dst in effect() to let the register allocator know that dst needs to be different from src.
> 
> Yes, we use this way. Or, in mul4B_reg() case, we can use $dst instead 
> $tmp2 to avoid overwriting
> $src2 before we get value from it if $dst = $src2.
> 
> On other hand, mul32B_reg_avx() and other have 'TEMP dst' effect but $dst is used only for final result.
> 
> It is a little mess which may cause ineffective use of registers in compiled code.
> 
> Thanks,
> Vladimir
> 
>>
>> Best Regards,
>> Sandhya
>>
>>
>> -----Original Message-----
>> From: B. Blaser [mailto:bsrbnd at gmail.com]
>> Sent: Wednesday, April 10, 2019 4:10 AM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>; 
>> hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>
>> Hi Sandhya and Vladimir K.,
>>
>> On Wed, 10 Apr 2019 at 03:06, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>>
>>> Hi Vladimir,
>>>
>>> Yes, I missed the question below:
>>>>> There are cases where we can use less `TEMP tmp` registers by using 'dst' register like in mul4B_reg(). Is it intentional to not use 'dst' there?
>>>
>>> No it is not intentional, we can use the dst register in those cases and reduced the tmps.
>>
>> I guess we have to be careful using $dst instead of $tmp registers as the allocator sometimes provides identical $src & $dst. Also, I'm not sure this would be possible in the case of mul4B_reg():
>>
>> 7349   format %{"pmovsxbw  $tmp,$src1\n\t"
>> 7350            "pmovsxbw  $tmp2,$src2\n\t"
>>
>> I believe this couldn't work if you use $dst instead of $tmp and $dst = $src2, what do you think?
>>
>> Thanks,
>> Bernard
>>