[15] RFR (L) 8235825: C2: Merge AD instructions for Replicate nodes

Wed Dec 18 01:06:30 UTC 2019

On 12/16/19 2:29 PM, Vladimir Ivanov wrote:
> Hi Vladimir,
> 
> Updated version:
>    http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.02/

Good.

> 
> What's changed:
>    * Added more comments
>    * Fixed missing cases (Repl4B_imm() and Repl8B_imm)
>      * Double-checked that there are no other missing cases left:
> 
> http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.02/repl.txt
> 
> I'd like to reiterate that I deliberately don't want to spend too much time polishing the current version because I 
> consider it as an interim point and not as the optimal and desired one (though it's definitely much better than where we 
> started both in number of instructions and clarity).

I was about to comment changes for Long values but I agree that what is dine is enough.

> 
> The shape I want to see in the next couple of iterations is the following:
> 
>    (1) all the operation implementations encapsulated in MacroAssembler
>      * CPU dispatching will happen there, not in AD file;

Yes, we discussed it before - asm instructions selection should be done in MacroAssembler (it is here exactly for that 
purpose).

> 
>    (2) get rid of vec vs legVec separation as much as possible
>      * one of the ways to fix it is to introduce additional operand types (for example, vecBW == {legVec when avx512() 
> && !avx512bw(); vec, otherwise})

It would be nice.

> 
> It would turn current ReplB_reg & ReplB_reg_leg into:
> 
> instruct ReplB_reg(vecBW dst, scalar src) %{
>    match(Set dst (ReplicateB src));
>    format %{ "replicate $dst,$src" %}
>    ins_encode %{
>      uint vlen = vector_length(this);
>      __ replicate_byte($dst$$XMMRegister, $src$$Register, vlen);
>    %}
> %}
> 
> where MacroAssembler::replicate_byte() hides all the dispatching logic against CPU capabilities and vector length.
> 
> Moreover, it opens additional merging opportunities. As an example:
> 
> instruct ReplBS_reg(vecBW dst, rRegI src) %{
>    match(Set dst (ReplicateB src));
>    match(Set dst (ReplicateS src));
>    format %{ "replicate $dst,$src" %}
>    ins_encode %{
>      uint vlen = vector_length(this);
>      switch (ideal_Opcode()) {
>         case Op_ReplicateB: __ replicate_byte($dst$$XMMRegister, $src$$Register, vlen); break;
>         case Op_ReplicateS: __ replicate_short($dst$$XMMRegister, $src$$Register, vlen); break;
>         default: ShouldNotReachHere();
>      }
>    %}
>    ins_pipe( pipe_slow );
> %}
> 
> If we agree that this is the direction we want to move, splitting instructions is counter-productive and just pushes 
> more work for later iterations. Same applies to dispatching logic on CPU & vector length.

Yes, I agree.

Thanks,
Vladimir

> 
> Best regards,
> Vladimir Ivanov
> 
> On 14.12.2019 01:18, Vladimir Kozlov wrote:
>> On 12/13/19 2:27 AM, Vladimir Ivanov wrote:
>>> Thanks for the feedback, Vladimir.
>>>
>>>> replicateB
>>>>
>>>> Can you fold it differently?
>>>>
>>>> ReplB_reg_leg
>>>
>>> Are you talking about ReplB_reg?
>>
>> Yes, I was talking about ReplB_reg. I thought we can combine all [8-64] length vectors but I missed that ReplB_reg_leg 
>> uses legVec and needs separate instructions :(
>>
>> And I wanted to separate instruction which use avx 512 (evpbroadcastb) because it is difficult to see relation between 
>> predicate condition (length() == 64 && VM_Version::supports_avx512bw()) and check in code (vlen == 64 || 
>> VM_Version::supports_avx512vlbw()). First, || vs &&. Second, avx512bw vs avx512vlbw. May be better to have a separate 
>> instruction for this.
>>
>>>
>>>>    predicate(!VM_Version::supports_avx512vlbw());
>>>
>>> 3151 instruct ReplB_reg_leg(legVec dst, rRegI src) %{
>>> 3152   predicate(n->as_Vector()->length() == 64 && !VM_Version::supports_avx512bw());
>>>
>>> For ReplB_reg_leg the predicate can't be simplified: it is applicable only to 512bit vector when AVX512BW is absent. 
>>> Otherwise, legVec constraint will be unnecessarily applied to other configurations.
>>
>> That is why you replaced !avx512vlbw with !avx512bw?
>> May be this section of code need comment which explains why one or an other is used.
>>
>>>
>>> 3119 instruct ReplB_reg(vec dst, rRegI src) %{
>>> 3120   predicate((n->as_Vector()->length() <= 32) ||
>>> 3121             (n->as_Vector()->length() == 64 && VM_Version::supports_avx512bw()));
>>>
>>> For ReplB_reg there's a shorter version:
>>>
>>> predicate(n->as_Vector()->length() <= 32 || VM_Version::supports_avx512bw());
>>>
>>>
>>> But do you find it easier to read? For example, when you are checking that all configurations are covered:
>>>
>>> predicate(n->as_Vector()->length() <= 32 ||
>>>            VM_Version::supports_avx512bw());
>>>
>>> instruct ReplB_reg_leg(legVec dst, rRegI src) %{
>>>    predicate(n->as_Vector()->length() == 64 &&
>>>             !VM_Version::supports_avx512bw());
>>>
>>> vs
>>>
>>> instruct ReplB_reg_leg(legVec dst, rRegI src) %{
>>>    predicate(n->as_Vector()->length() == 64 && !VM_Version::supports_avx512bw());
>>>
>>> instruct ReplB_reg(vec dst, rRegI src) %{
>>>    predicate((n->as_Vector()->length() <= 32) ||
>>>              (n->as_Vector()->length() == 64 && VM_Version::supports_avx512bw()));
>>
>> I think next conditions in predicates require comment about using avx512vlbw and avx512bw.
>>
>> !   predicate((n->as_Vector()->length() <= 32 && VM_Version::supports_avx512vlbw()) ||
>> !             (n->as_Vector()->length() == 64 && VM_Version::supports_avx512bw()));
>>
>>>
>>>
>>>>    ins_encode %{
>>>>      uint vlen = vector_length(this);
>>>>      __ movdl($dst$$XMMRegister, $src$$Register);
>>>>      __ punpcklbw($dst$$XMMRegister, $dst$$XMMRegister);
>>>>      __ pshuflw($dst$$XMMRegister, $dst$$XMMRegister, 0x00);
>>>>      if (vlen > 8) {
>>>>        __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
>>>>        if (vlen > 16) {
>>>>          __ vinserti128_high($dst$$XMMRegister, $dst$$XMMRegister);
>>>>          if (vlen > 32) {
>>>>            assert(vlen == 64, "sanity");
>>>>            __ vinserti64x4($dst$$XMMRegister, $dst$$XMMRegister, $dst$$XMMRegister, 0x1);
>>>
>>> Yes, it should work as well. Do you find it easier to read though?
>>
>> Code is smaller.
>>
>>>
>>>> Similar ReplB_imm_leg for which I don't see new implementation.
>>>
>>> Good catch. Added it back.
>>>
>>> (FTR completeness for reg2reg variants (_reg*) is mandatory. But for _mem and _imm it is optional: if they some 
>>> configuration isn't covered, _reg is used. But it was in original code, so I added it back.)
>>
>> I don't see code which was in Repl4B_imm() and Repl8B_imm() (only movdl and movq without vpbroadcastb).
>>
>>>
>>> Updated version:
>>>    http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.00
>>
>> Should be webrev.01
>>
>>>
>>> Another thing I noticed is that for ReplI/.../ReplD cases avx512vl checks are not necessary:
>>>
>>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.01/broadcast_vl/
>>>
>>> The code assumes that evpbroadcastd/evpbroadcastq, vpbroadcastdvpbroadcastq, vpbroadcastss/vpbroadcastsd need 
>>> AVX512VL for 512bit case, but Intel manual says AVX512F is enough.
>>>
>>> I plan to handle it as a separate change, but let me know if you want to incorporate it into 8235825.
>>
>> Yes, lets do it separately.
>>
>>>
>>>> It should also simplify code for avx512 which one or 2 instructions.
>>>
>>> Can you elaborate, please? Are you talking about the case when version for different vector sizes differ in 1-2 
>>> instructions (like ReplB_reg)?
>> No, I was talking about cases when evpbroadcastb and vpbroadcastb instructions are used. I was think to have them in 
>> separate instructions. In your latest version it would be only evpbroadcastb case from ReplB_reg().
>>
>> Thanks,
>> Vladimir
>>
>>>
>>>> Other types changes can be done same way.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>> On 12/12/19 3:19 AM, Vladimir Ivanov wrote:
>>>>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.00/all/
>>>>> https://bugs.openjdk.java.net/browse/JDK-8235825
>>>>>
>>>>> Merge AD instructions for the following vector nodes:
>>>>>    - ReplicateB, ..., ReplicateD
>>>>>
>>>>> Individual patches:
>>>>>
>>>>> http://cr.openjdk.java.net/~vlivanov/jbhateja/8235825/webrev.00/individual
>>>>>
>>>>> Testing: tier1-4, test run on different CPU flavors (KNL, CLX)
>>>>>
>>>>> Contributed-by: Jatin Bhateja <jatin.bhateja at intel.com>
>>>>> Reviewed-by: vlivanov, sviswanathan, ?
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov