RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4]

Tue Jul 22 03:22:28 UTC 2025

On Mon, 21 Jul 2025 06:41:43 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
>> 
>>  - Refactor the implementation
>>    
>>    Do the convertion in C2's IGVN phase to cover more cases.
>>  - Merge branch 'master' into JDK-8356760
>>  - Simplify the test code
>>  - Address some review comments
>>    
>>    Add support for the following patterns:
>>      toLong(maskAll(true))  => (-1ULL >> (64 -vlen))
>>      toLong(maskAll(false)) => 0
>>    
>>    And add more test cases.
>>  - Merge branch 'master' into JDK-8356760
>>  - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases
>>    
>>    If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would
>>    set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent
>>    to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is
>>    relative smaller than that of `fromLong`. This patch does the conversion
>>    for these cases if `l` is a compile time constant.
>>    
>>    And this conversion also enables further optimizations that recognize
>>    maskAll patterns, see [1].
>>    
>>    Some JTReg test cases are added to ensure the optimization is effective.
>>    
>>    I tried many different ways to write a JMH benchmark, but failed. Since
>>    the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific
>>    compile-time constant, the statement will be hoisted out of the loop.
>>    If we don't use a loop, the hotspot will become other instructions, and
>>    no obvious performance change was observed. However, combined with the
>>    optimization of [1], we can observe a performance improvement of about
>>    7% on both aarch64 and x64.
>>    
>>    The patch was tested on both aarch64 and x64, all of tier1 tier2 and
>>    tier3 tests passed.
>>    
>>    [1] https://github.com/openjdk/jdk/pull/24674
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 692:
> 
>> 690:     // generate a MaskAll or Replicate instead.
>> 691: 
>> 692:     // The "maskAll" API uses the corresponding integer types for floating-point data.
> 
> This is because mask all only accepts -1 and 0 values, since -1.0f in float in IEEE 754 format does not set all bits hence an floating point to integral conversion is mandatory here.

Good to know this, thanks!

> src/hotspot/share/opto/vectornode.cpp line 1520:
> 
>> 1518:   uint vlen = vt->length();
>> 1519:   BasicType bt = vt->element_basic_type();
>> 1520:   int opc = is_mask ? Op_MaskAll : Op_Replicate;
> 
> You can remove this check, since VectorNode::scalar2vector alreday has a match rule for Op_MaskAll

Do you mean this check  `Matcher::match_rule_supported_vector(opc, vlen, maskall_bt)` ? I think it's necessary ? Because in theory some platforms don't support both `MaskAll` and `Replicate`.  Of course, this situation may not exist in reality. If `MaskAll` and `Replicate` are not supported, then `VectorLongToMask` should not be supported either, and this function will not be called.

> src/hotspot/share/opto/vectornode.cpp line 1532:
> 
>> 1530:     } else {
>> 1531:       con = phase->intcon(con_value);
>> 1532:     }
> 
> Suggestion:
> 
>     phase->makecon(TypeInteger::make(bits_type->get_con(), maskall_bt)

This should be:
`con = phase->makecon(TypeInteger::make(con_value, maskall_bt == T_LONG ? T_LONG : T_INT));` because `maskall_bt` can be `T_BYTE` or `T_SHORT`.
Since we still need to check `maskall_bt`, I tend to the current approach because it has fewer function calls.

> src/hotspot/share/opto/vectornode.cpp line 1544:
> 
>> 1542: 
>> 1543: Node* VectorLoadMaskNode::Ideal(PhaseGVN* phase, bool can_reshape) {
>> 1544:   // VectorLoadMask(VectorLongToMask(-1/0)) => Replicate(-1/0)
> 
> FTR: This is only useful for non-predicated targets. Since on predicated target VectorLongToMask is not succeeded by VectorLoadMask
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L703

Yes, thanks for your clarification.

> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34:
> 
>> 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"})
>> 33: public class MaskFromLongToLongBenchmark {
>> 34:     private static final int ITERATION = 10000;
> 
> It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system because of widening cast elision.
> 
> 
> import jdk.incubator.vector.*;
> import java.util.stream.IntStream;
> 
> public class mask_cast_chain {
>    public static final VectorSpecies<Float> FSP = FloatVector.SPECIES_128;
> 
>    public static long micro(float [] src1, float [] src2, int ctr) {
>        long res = 0;
>        for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) {
>             res += FloatVector.fromArray(FSP, src1, i)
>                          .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i))
>                          .cast(DoubleVector.SPECIES_256)
>                          .cast(FloatVector.SPECIES_128)
>                          .toLong();
>        }
>        return res * ctr;
>    }
> 
>    public static void main(String [] args) {
>        float [] src1 = new float[1024];
>        float [] src2 = new float[1024];
> 
>        IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;});
>        IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;});
> 
>        long res = 0;
>        for (int i = 0; i < 100000; i++) {
>           res += micro(src1, src2, i);
>        }
>        long t1 = System.currentTimeMillis();
>        for (int i = 0; i < 100000; i++) {
>           res += micro(src1, src2, i);
>        }
>        long t2 = System.currentTimeMillis();
>        System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res);
>    }
> }

Ok~

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220925155
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220905045
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220919188
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220924181
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220927997