RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4]
erifan
duke at openjdk.org
Thu Jul 24 03:41:56 UTC 2025
On Tue, 22 Jul 2025 03:18:19 GMT, erifan <duke at openjdk.org> wrote:
>> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34:
>>
>>> 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"})
>>> 33: public class MaskFromLongToLongBenchmark {
>>> 34: private static final int ITERATION = 10000;
>>
>> It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system because of widening cast elision.
>>
>>
>> import jdk.incubator.vector.*;
>> import java.util.stream.IntStream;
>>
>> public class mask_cast_chain {
>> public static final VectorSpecies<Float> FSP = FloatVector.SPECIES_128;
>>
>> public static long micro(float [] src1, float [] src2, int ctr) {
>> long res = 0;
>> for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) {
>> res += FloatVector.fromArray(FSP, src1, i)
>> .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i))
>> .cast(DoubleVector.SPECIES_256)
>> .cast(FloatVector.SPECIES_128)
>> .toLong();
>> }
>> return res * ctr;
>> }
>>
>> public static void main(String [] args) {
>> float [] src1 = new float[1024];
>> float [] src2 = new float[1024];
>>
>> IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;});
>> IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;});
>>
>> long res = 0;
>> for (int i = 0; i < 100000; i++) {
>> res += micro(src1, src2, i);
>> }
>> long t1 = System.currentTimeMillis();
>> for (int i = 0; i < 100000; i++) {
>> res += micro(src1, src2, i);
>> }
>> long t2 = System.currentTimeMillis();
>> System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res);
>> }
>> }
>
> Ok~
Added some JMH benchmarks, the code is slightly different with your code. Test results show that on my avx2 system, there are ~17% performance improvement for applicable cases. No performance change on avx3 system because `cast` is lowered as empty.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2227250664
More information about the hotspot-compiler-dev
mailing list