RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4]

Mon Jul 21 12:10:47 UTC 2025

On Thu, 17 Jul 2025 09:09:14 GMT, erifan <duke at openjdk.org> wrote:

>> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent to `maskAll(true)` or `maskAll(false)`. But the cost of the `maskAll` is
>> relative smaller than that of `fromLong`. So this patch does the conversion for these cases.
>> 
>> The conversion is done in C2's IGVN phase. And on platforms (like Arm NEON) that don't support `VectorLongToMask`, the conversion is done during intrinsiication process if `MaskAll` or `Replicate` is supported.
>> 
>> Since this optimization requires the input long value of `VectorMask.fromLong` to be specific compile-time constants, and such expressions are usually hoisted out of the loop. So we can't see noticeable performance change.
>> 
>> This conversion also enables further optimizations that recognize maskAll patterns, see [1]. And we can observe a performance improvement of about 7% on both aarch64 and x64.
>> 
>> As `VectorLongToMask` is converted to `MaskAll` or `Replicate`, some existing optimizations recognizing the `VectorLongToMask` will be affected, like
>> 
>>   VectorMaskToLong (VectorLongToMask x) => x
>> 
>> 
>> Hence, this patch also added the following optimizations:
>> 
>>   VectorMaskToLong (MaskAll x) => (x & (-1ULL >> (64 - vlen)))    // x is -1 or 0
>>   VectorMaskToLong (VectorStoreMask (Replicate x)) => (x & (-1ULL >> (64 - vlen)))  // x is -1 or 0
>> 
>>   VectorMaskCast (VectorMaskCast x) => x
>> 
>> And we can see noticeable performance improvement with the above optimizations for floating-point types.
>> 
>> Benchmarks on Nvidia Grace machine with option `-XX:UseSVE=2`:
>> 
>> Benchmark				Unit	Before		Error		After		Error		Uplift
>> microMaskFromLongToLong_Double128	ops/s	1522384.986	1324881.46	2835774480	403575069.7	1862.71
>> microMaskFromLongToLong_Double256	ops/s	4275.415598	28.560622	4285.587451	27.633101	1
>> microMaskFromLongToLong_Double512	ops/s	3702.171936	9.528497	3692.747579	18.47744	0.99
>> microMaskFromLongToLong_Double64	ops/s	4624.452243	37.388427	4616.320519	23.455954	0.99
>> microMaskFromLongToLong_Float128	ops/s	1239661.887	1286803.852	2842927993	360468218.3	2293.3
>> microMaskFromLongToLong_Float256	ops/s	3681.64954	15.153633	3685.411771	21.737124	1
>> microMaskFromLongToLong_Float512	ops/s	3007.563025	10.189944	3022.002986	14.137287	1
>> microMaskFromLongToLong_Float64		ops/s	1646664.258	1375451.279	2948453900	397472562.4	1790.56
>> 
>> 
>> Benchmarks on AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:
>> 
>> Benchm...
>
> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
> 
>  - Refactor the implementation
>    
>    Do the convertion in C2's IGVN phase to cover more cases.
>  - Merge branch 'master' into JDK-8356760
>  - Simplify the test code
>  - Address some review comments
>    
>    Add support for the following patterns:
>      toLong(maskAll(true))  => (-1ULL >> (64 -vlen))
>      toLong(maskAll(false)) => 0
>    
>    And add more test cases.
>  - Merge branch 'master' into JDK-8356760
>  - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases
>    
>    If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would
>    set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent
>    to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is
>    relative smaller than that of `fromLong`. This patch does the conversion
>    for these cases if `l` is a compile time constant.
>    
>    And this conversion also enables further optimizations that recognize
>    maskAll patterns, see [1].
>    
>    Some JTReg test cases are added to ensure the optimization is effective.
>    
>    I tried many different ways to write a JMH benchmark, but failed. Since
>    the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific
>    compile-time constant, the statement will be hoisted out of the loop.
>    If we don't use a loop, the hotspot will become other instructions, and
>    no obvious performance change was observed. However, combined with the
>    optimization of [1], we can observe a performance improvement of about
>    7% on both aarch64 and x64.
>    
>    The patch was tested on both aarch64 and x64, all of tier1 tier2 and
>    tier3 tests passed.
>    
>    [1] https://github.com/openjdk/jdk/pull/24674

Rest of the patch looks good to me apart from minor changes proposed

test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34:

> 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"})
> 33: public class MaskFromLongToLongBenchmark {
> 34:     private static final int ITERATION = 10000;

It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system.

import jdk.incubator.vector.*;
import java.util.stream.IntStream;

public class mask_cast_chain {
   public static final VectorSpecies<Float> FSP = FloatVector.SPECIES_128;

   public static long micro(float [] src1, float [] src2, int ctr) {
       long res = 0;
       for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) {
            res += FloatVector.fromArray(FSP, src1, i)
                         .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i))
                         .cast(DoubleVector.SPECIES_256)
                         .cast(FloatVector.SPECIES_128)
                         .toLong();
       }
       return res * ctr;
   }

   public static void main(String [] args) {
       float [] src1 = new float[1024];
       float [] src2 = new float[1024];

       IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;});
       IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;});

       long res = 0;
       for (int i = 0; i < 100000; i++) {
          res += micro(src1, src2, i);
       }
       long t1 = System.currentTimeMillis();
       for (int i = 0; i < 100000; i++) {
          res += micro(src1, src2, i);
       }
       long t2 = System.currentTimeMillis();
       System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res);
   }
}

-------------

PR Review: https://git.openjdk.org/jdk/pull/25793#pullrequestreview-3037791349
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2218999865