RFR: 8264104: Eliminate unnecessary vector mask conversion during VectorUnbox for floating point VectorMask [v3]

Mon Apr 12 16:52:47 UTC 2021

On Mon, 12 Apr 2021 07:00:58 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> src/hotspot/share/opto/vectornode.cpp line 1237:
>> 
>>> 1235:           if (in_vt->length_in_bytes() == out_vt->length_in_bytes() &&
>>> 1236:               Matcher::match_rule_supported(Op_VectorMaskCast)) {
>>> 1237:             // VectorUnbox (VectorBox vmask) ==> VectorMaskCast (vmask)
>> 
>> It's better to implement it as a 2-step transformation and place it in `VectorLoadMaskNode::Ideal()`:
>> 
>> VectorUnbox (VectorBox vmask) ==> VectorLoadMask (VectorStoreMask vmask) => VectorMaskCast (vmask)
>
> Thanks for your comments. Yes, theoretically it's better to place it in `VectorLoadMaskNode::Ideal()`. Unfortunately, we met an issue that is related to optimization for `VectorStoreMask`. Considering the following case:
> 
>    LoadVector                                              LoadVector
>            |                                                    |
>    VectorLoadMask  (double)                          VectorLoadMask (double)
>                |                                               |
>            VectorUnbox   (long)          ==>           VectorStoreMask  (double)
>                                                                |
>                                                       VectorLoadMask  (long)
> 
> This case loads the masking values for a double type, and does a bitwise `and` operation. Since the type is mismatched, the compiler will try to do `VectorUnbox (VectorBox vmask) ==> VectorLoadMask (VectorStoreMask vmask)`. However, since there is the transformation `VectorStoreMask (VectorLoadMask value)  ==> value`, the above `VectorStoreMaskNode` will be optimized out. The final graph looks like:
> 
> 
>    LoadVector                                                  LoadVector
>           |                                                      /      \
>         VectorLoadMask  (double)                               /         \
>               |                      ==>     VectorLoadMask (double)      \
>          VectorStoreMask (double)                                   VectorLoadMask (long)
>                    |
>             VectorLoadMask (long)
> 
> Since the two `VectorLoadMaskNode` have different element type, the GVN cannot optimize out one. So finally there will be two similar  `VectorLoadMaskNode`s. That's also why I override the `cmp/hash` for `VectorLoadMaskNode` in the first version.
> 
> So I prefer to add `VectorUnbox (VectorBox vmask) ==> VectorMaskCast (vmask)` directly.

Ok, so you face a transformation ordering problem here. 

By working on `VectorUnbox (VectorBox vmask)` you effectively delay `VectorStoreMask (VectorLoadMask vmask) => vmask` transformation.

As an alternative you could:

(1) check for `VectorLoadMask` users before applying `VectorStoreMask (VectorLoadMask vmask) => vmask`;

(2) nest adjacent casts: 

VectorLoadMask #double (1 LoadVector)
VectorLoadMask #long   (1 LoadVector)
==>
VectorMaskCast #long (VectorLoadMask #double (1 LoadVector)

The latter looks more powerful (and hence preferrable), but I'm fine with what you have right now. (It can be enhanced later.)
Please, leave a comment describing the motivation for doing the transformation directly on `VectorUnbox (VectorBox ...)`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3238