RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4]

Thu Jul 24 03:41:56 UTC 2025

On Tue, 22 Jul 2025 03:18:19 GMT, erifan <duke at openjdk.org> wrote:

>> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34:
>> 
>>> 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"})
>>> 33: public class MaskFromLongToLongBenchmark {
>>> 34:     private static final int ITERATION = 10000;
>> 
>> It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system because of widening cast elision.
>> 
>> 
>> import jdk.incubator.vector.*;
>> import java.util.stream.IntStream;
>> 
>> public class mask_cast_chain {
>>    public static final VectorSpecies<Float> FSP = FloatVector.SPECIES_128;
>> 
>>    public static long micro(float [] src1, float [] src2, int ctr) {
>>        long res = 0;
>>        for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) {
>>             res += FloatVector.fromArray(FSP, src1, i)
>>                          .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i))
>>                          .cast(DoubleVector.SPECIES_256)
>>                          .cast(FloatVector.SPECIES_128)
>>                          .toLong();
>>        }
>>        return res * ctr;
>>    }
>> 
>>    public static void main(String [] args) {
>>        float [] src1 = new float[1024];
>>        float [] src2 = new float[1024];
>> 
>>        IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;});
>>        IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;});
>> 
>>        long res = 0;
>>        for (int i = 0; i < 100000; i++) {
>>           res += micro(src1, src2, i);
>>        }
>>        long t1 = System.currentTimeMillis();
>>        for (int i = 0; i < 100000; i++) {
>>           res += micro(src1, src2, i);
>>        }
>>        long t2 = System.currentTimeMillis();
>>        System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res);
>>    }
>> }
>
> Ok~

Added some JMH benchmarks, the code is slightly different with your code. Test results show that on my avx2 system, there are ~17% performance improvement for applicable cases. No performance change on avx3 system because `cast` is lowered as empty.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2227250664