RFR (M) 8222074: Enhance auto vectorization for x86

Wed May 1 21:58:01 UTC 2019

> http://cr.openjdk.java.net/~sviswanathan/8222074/webrev.01/

Nice job, Sandhya! Glad to hear the approach pays off!

Unfortunately, I must note that AD file becomes much more obscure. 
Especially with those function pointers.

1528 void emit_vshift16B_code(MacroAssembler& _masm, int opcode, 
XMMRegister dst,
1529                         XMMRegister src, XMMRegister shift,
1530                         XMMRegister tmp1, XMMRegister tmp2, 
Register scratch) {
1531   XX_Inst extendinst = get_extend_inst(opcode == Op_URShiftVB ? 
false : true);
1532   XX_Inst shiftinst = get_xx_inst(opcode);
1533
1534   (_masm.*extendinst)(tmp1, src);
1535   (_masm.*shiftinst)(tmp1, shift);
1536   __ pshufd(tmp2, src, 0xE);
1537   (_masm.*extendinst)(tmp2, tmp2);
1538   (_masm.*shiftinst)(tmp2, shift);
1539   __ movdqu(dst, ExternalAddress(vector_short_to_byte_mask()), 
scratch);
1540   __ pand(tmp2, dst);
1541   __ pand(dst, tmp1);
1542   __ packuswb(dst, tmp2);
1543 }

Have you tried to encapsulate that into x86-specific MacroAssembler?

8682 instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, vecX 
tmp2, rRegI scratch) %{
8683   predicate(UseSSE > 3  && UseAVX <= 1 && n->as_Vector()->length() 
== 16);
8684   match(Set dst (LShiftVB src shift));
8685   match(Set dst (RShiftVB src shift));
8686   match(Set dst (URShiftVB src shift));
8687   effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
8688   format %{"pmovxbw   $tmp1,$src\n\t"
8689            "shiftop   $tmp1,$shift\n\t"
8690            "pshufd    $tmp2,$src\n\t"
8691            "pmovxbw   $tmp2,$tmp2\n\t"
8692            "shiftop   $tmp2,$shift\n\t"
8693            "movdqu    $dst,[0x00ff00ff0x00ff00ff]\n\t"
8694            "pand      $tmp2,$dst\n\t"
8695            "pand      $dst,$tmp1\n\t"
8696            "packuswb  $dst,$tmp2\n\t! packed16B shift" %}
8697   ins_encode %{
8698     emit_vshift16B_code(_masm, this->as_Mach()->ideal_Opcode() , 
$dst$$XMMRegister, $src$$XMMRegister, $shift$$XMMRegister, 
$tmp1$$XMMRegister, $tmp2$$XMMRegister, $scratch$$Register);
8699   %}
8700   ins_pipe( pipe_slow );
8701 %}

can be turned into something like:

instruct vshift16B(vecX dst, vecX src, vecS shift, vecX tmp1, vecX tmp2, 
rRegI scratch) %{
   predicate(n->as_Vector()->length() == 16);
   match(Set dst (LShiftVB src shift));
   match(Set dst (RShiftVB src shift));
   match(Set dst (URShiftVB src shift));
   effect(TEMP dst, TEMP tmp1, TEMP tmp2, TEMP scratch);
   format %{"packed16B shift" %}
   ins_encode %{
     int vlen = 0; // 128-bit
     BasicType elem_type = T_BYTE;
     int shift_mode = ...; // L/R/UR or S/U + L/R
     __ vshift(vlen, elem_type, shift_mode,
               $dst$$..., $src$$..., $shift$$...,
	      $tmp1$$..., $tmp2$$..., $scratch$$...);
    %}

Then MA::vshift can dispatch between different implementations depending 
on SSE/AVX level available. Do you see any problems with that from 
footprint perspective?

Ideally, I'd prefer to see a library of operations on vectors 
encapsulated in MacroAssembler (or a subclass) and used in x86.ad. That 
will accommodate further reductions in AD instructions needed.

Best regards,
Vladimir Ivanov

> With this webrev the ad file has only about 60 lines effectively added.
> Also the generated product libjvm.so size only increases by about 0.26% vs the prior 1.50%.
> I have used multiple match rules in one instruct for same size shift related rules and also for the new Abs/Neg rules.
> What I noticed is that the adlc still duplicates lot of code and there is potential to further improve code size for multiple match rule case by improving the adlc itself.
> The adlc improvement (like removing duplicate emits, formats, expand, pipeline etc) can be done as a separate RFE.
>   
> In this webrev, I have also fixed the errors reported by Vladimir Ivanov and corrected the issues reported by jcheck tool.
> Also taken into account reducing the temporary by using TEMP dst for multiply rules.
> 
> The compiler jtreg tests and the java math tests pass on Haswell, SKX, and KNL.
> 
> Your review and feedback is welcome.
> 
> Best Regards,
> Sandhya
> 
> 
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Viswanathan, Sandhya
> Sent: Wednesday, April 10, 2019 10:22 AM
> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; B. Blaser <bsrbnd at gmail.com>
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: RE: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> Yes good catch, in mul32B_reg_avx(), the last two instructions are the only place where dst is used:
> 
>      __ vpackuswb($dst$$XMMRegister, $tmp2$$XMMRegister, $tmp1$$XMMRegister, vector_len);
>      __ vpermq($dst$$XMMRegister, $dst$$XMMRegister, 0xD8, vector_len);
> 
> Here dst can be same as tmp2 or tmp1 in packuswb() and so the effect TEMP dst is not required.
> 
> Best Regards,
> Sandhya
> 
> 
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Wednesday, April 10, 2019 9:59 AM
> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; B. Blaser <bsrbnd at gmail.com>
> Cc: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
> 
> On 4/10/19 8:36 AM, Viswanathan, Sandhya wrote:
>> Hi Bernard,
>>
>> One could add TEMP dst in effect() to let the register allocator know that dst needs to be different from src.
> 
> Yes, we use this way. Or, in mul4B_reg() case, we can use $dst instead $tmp2 to avoid overwriting
> $src2 before we get value from it if $dst = $src2.
> 
> On other hand, mul32B_reg_avx() and other have 'TEMP dst' effect but $dst is used only for final result.
> 
> It is a little mess which may cause ineffective use of registers in compiled code.
> 
> Thanks,
> Vladimir
> 
>>
>> Best Regards,
>> Sandhya
>>
>>
>> -----Original Message-----
>> From: B. Blaser [mailto:bsrbnd at gmail.com]
>> Sent: Wednesday, April 10, 2019 4:10 AM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR (M) 8222074: Enhance auto vectorization for x86
>>
>> Hi Sandhya and Vladimir K.,
>>
>> On Wed, 10 Apr 2019 at 03:06, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>>>
>>> Hi Vladimir,
>>>
>>> Yes, I missed the question below:
>>>>> There are cases where we can use less `TEMP tmp` registers by using 'dst' register like in mul4B_reg(). Is it intentional to not use 'dst' there?
>>>
>>> No it is not intentional, we can use the dst register in those cases and reduced the tmps.
>>
>> I guess we have to be careful using $dst instead of $tmp registers as the allocator sometimes provides identical $src & $dst. Also, I'm not sure this would be possible in the case of mul4B_reg():
>>
>> 7349   format %{"pmovsxbw  $tmp,$src1\n\t"
>> 7350            "pmovsxbw  $tmp2,$src2\n\t"
>>
>> I believe this couldn't work if you use $dst instead of $tmp and $dst = $src2, what do you think?
>>
>> Thanks,
>> Bernard
>>