[15] RFR (L) 8235825: C2: Merge AD instructions for Replicate nodes

Mon Dec 16 22:29:09 UTC 2019

Hi Vladimir,

Updated version:
   http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.02/

What's changed:
   * Added more comments
   * Fixed missing cases (Repl4B_imm() and Repl8B_imm)
     * Double-checked that there are no other missing cases left:

http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.02/repl.txt

I'd like to reiterate that I deliberately don't want to spend too much 
time polishing the current version because I consider it as an interim 
point and not as the optimal and desired one (though it's definitely 
much better than where we started both in number of instructions and 
clarity).

The shape I want to see in the next couple of iterations is the following:

   (1) all the operation implementations encapsulated in MacroAssembler
     * CPU dispatching will happen there, not in AD file;

   (2) get rid of vec vs legVec separation as much as possible
     * one of the ways to fix it is to introduce additional operand 
types (for example, vecBW == {legVec when avx512() && !avx512bw(); vec, 
otherwise})

It would turn current ReplB_reg & ReplB_reg_leg into:

instruct ReplB_reg(vecBW dst, scalar src) %{
   match(Set dst (ReplicateB src));
   format %{ "replicate $dst,$src" %}
   ins_encode %{
     uint vlen = vector_length(this);
     __ replicate_byte($dst$$XMMRegister, $src$$Register, vlen);
   %}
%}

where MacroAssembler::replicate_byte() hides all the dispatching logic 
against CPU capabilities and vector length.

Moreover, it opens additional merging opportunities. As an example:

instruct ReplBS_reg(vecBW dst, rRegI src) %{
   match(Set dst (ReplicateB src));
   match(Set dst (ReplicateS src));
   format %{ "replicate $dst,$src" %}
   ins_encode %{
     uint vlen = vector_length(this);
     switch (ideal_Opcode()) {
        case Op_ReplicateB: __ replicate_byte($dst$$XMMRegister, 
$src$$Register, vlen); break;
        case Op_ReplicateS: __ replicate_short($dst$$XMMRegister, 
$src$$Register, vlen); break;
        default: ShouldNotReachHere();
     }
   %}
   ins_pipe( pipe_slow );
%}

If we agree that this is the direction we want to move, splitting 
instructions is counter-productive and just pushes more work for later 
iterations. Same applies to dispatching logic on CPU & vector length.

Best regards,
Vladimir Ivanov

On 14.12.2019 01:18, Vladimir Kozlov wrote:
> On 12/13/19 2:27 AM, Vladimir Ivanov wrote:
>> Thanks for the feedback, Vladimir.
>>
>>> replicateB
>>>
>>> Can you fold it differently?
>>>
>>> ReplB_reg_leg
>>
>> Are you talking about ReplB_reg?
> 
> Yes, I was talking about ReplB_reg. I thought we can combine all [8-64] 
> length vectors but I missed that ReplB_reg_leg uses legVec and needs 
> separate instructions :(
> 
> And I wanted to separate instruction which use avx 512 (evpbroadcastb) 
> because it is difficult to see relation between predicate condition 
> (length() == 64 && VM_Version::supports_avx512bw()) and check in code 
> (vlen == 64 || VM_Version::supports_avx512vlbw()). First, || vs &&. 
> Second, avx512bw vs avx512vlbw. May be better to have a separate 
> instruction for this.
> 
>>
>>>    predicate(!VM_Version::supports_avx512vlbw());
>>
>> 3151 instruct ReplB_reg_leg(legVec dst, rRegI src) %{
>> 3152   predicate(n->as_Vector()->length() == 64 && 
>> !VM_Version::supports_avx512bw());
>>
>> For ReplB_reg_leg the predicate can't be simplified: it is applicable 
>> only to 512bit vector when AVX512BW is absent. Otherwise, legVec 
>> constraint will be unnecessarily applied to other configurations.
> 
> That is why you replaced !avx512vlbw with !avx512bw?
> May be this section of code need comment which explains why one or an 
> other is used.
> 
>>
>> 3119 instruct ReplB_reg(vec dst, rRegI src) %{
>> 3120   predicate((n->as_Vector()->length() <= 32) ||
>> 3121             (n->as_Vector()->length() == 64 && 
>> VM_Version::supports_avx512bw()));
>>
>> For ReplB_reg there's a shorter version:
>>
>> predicate(n->as_Vector()->length() <= 32 || 
>> VM_Version::supports_avx512bw());
>>
>>
>> But do you find it easier to read? For example, when you are checking 
>> that all configurations are covered:
>>
>> predicate(n->as_Vector()->length() <= 32 ||
>>            VM_Version::supports_avx512bw());
>>
>> instruct ReplB_reg_leg(legVec dst, rRegI src) %{
>>    predicate(n->as_Vector()->length() == 64 &&
>>             !VM_Version::supports_avx512bw());
>>
>> vs
>>
>> instruct ReplB_reg_leg(legVec dst, rRegI src) %{
>>    predicate(n->as_Vector()->length() == 64 && 
>> !VM_Version::supports_avx512bw());
>>
>> instruct ReplB_reg(vec dst, rRegI src) %{
>>    predicate((n->as_Vector()->length() <= 32) ||
>>              (n->as_Vector()->length() == 64 && 
>> VM_Version::supports_avx512bw()));
> 
> I think next conditions in predicates require comment about using 
> avx512vlbw and avx512bw.
> 
> !   predicate((n->as_Vector()->length() <= 32 && 
> VM_Version::supports_avx512vlbw()) ||
> !             (n->as_Vector()->length() == 64 && 
> VM_Version::supports_avx512bw()));
> 
>>
>>
>>>    ins_encode %{
>>>      uint vlen = vector_length(this);
>>>      __ movdl($dst$$XMMRegister, $src$$Register);
>>>      __ punpcklbw($dst$$XMMRegister, $dst$$XMMRegister);
>>>      __ pshuflw($dst$$XMMRegister, $dst$$XMMRegister, 0x00);
>>>      if (vlen > 8) {
>>>        __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
>>>        if (vlen > 16) {
>>>          __ vinserti128_high($dst$$XMMRegister, $dst$$XMMRegister);
>>>          if (vlen > 32) {
>>>            assert(vlen == 64, "sanity");
>>>            __ vinserti64x4($dst$$XMMRegister, $dst$$XMMRegister, 
>>> $dst$$XMMRegister, 0x1);
>>
>> Yes, it should work as well. Do you find it easier to read though?
> 
> Code is smaller.
> 
>>
>>> Similar ReplB_imm_leg for which I don't see new implementation.
>>
>> Good catch. Added it back.
>>
>> (FTR completeness for reg2reg variants (_reg*) is mandatory. But for 
>> _mem and _imm it is optional: if they some configuration isn't 
>> covered, _reg is used. But it was in original code, so I added it back.)
> 
> I don't see code which was in Repl4B_imm() and Repl8B_imm() (only movdl 
> and movq without vpbroadcastb).
> 
>>
>> Updated version:
>>    http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.00
> 
> Should be webrev.01
> 
>>
>> Another thing I noticed is that for ReplI/.../ReplD cases avx512vl 
>> checks are not necessary:
>>
>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.01/broadcast_vl/ 
>>
>>
>> The code assumes that evpbroadcastd/evpbroadcastq, 
>> vpbroadcastdvpbroadcastq, vpbroadcastss/vpbroadcastsd need AVX512VL 
>> for 512bit case, but Intel manual says AVX512F is enough.
>>
>> I plan to handle it as a separate change, but let me know if you want 
>> to incorporate it into 8235825.
> 
> Yes, lets do it separately.
> 
>>
>>> It should also simplify code for avx512 which one or 2 instructions.
>>
>> Can you elaborate, please? Are you talking about the case when version 
>> for different vector sizes differ in 1-2 instructions (like ReplB_reg)?
> No, I was talking about cases when evpbroadcastb and vpbroadcastb 
> instructions are used. I was think to have them in separate 
> instructions. In your latest version it would be only evpbroadcastb case 
> from ReplB_reg().
> 
> Thanks,
> Vladimir
> 
>>
>>> Other types changes can be done same way.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>> On 12/12/19 3:19 AM, Vladimir Ivanov wrote:
>>>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.00/all/
>>>> https://bugs.openjdk.java.net/browse/JDK-8235825
>>>>
>>>> Merge AD instructions for the following vector nodes:
>>>>    - ReplicateB, ..., ReplicateD
>>>>
>>>> Individual patches:
>>>>
>>>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.00/individual 
>>>>
>>>>
>>>> Testing: tier1-4, test run on different CPU flavors (KNL, CLX)
>>>>
>>>> Contributed-by: Jatin Bhateja <jatin.bhateja at intel.com>
>>>> Reviewed-by: vlivanov, sviswanathan, ?
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov