RFR: 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases [v4]
erifan
duke at openjdk.org
Tue Jul 22 03:22:28 UTC 2025
On Mon, 21 Jul 2025 06:41:43 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision:
>>
>> - Refactor the implementation
>>
>> Do the convertion in C2's IGVN phase to cover more cases.
>> - Merge branch 'master' into JDK-8356760
>> - Simplify the test code
>> - Address some review comments
>>
>> Add support for the following patterns:
>> toLong(maskAll(true)) => (-1ULL >> (64 -vlen))
>> toLong(maskAll(false)) => 0
>>
>> And add more test cases.
>> - Merge branch 'master' into JDK-8356760
>> - 8356760: VectorAPI: Optimize VectorMask.fromLong for all-true/all-false cases
>>
>> If the input long value `l` of `VectorMask.fromLong(SPECIES, l)` would
>> set or unset all lanes, `VectorMask.fromLong(SPECIES, l)` is equivalent
>> to `maskAll(true)` or `maskAll(false)`. But the cost of `maskAll` is
>> relative smaller than that of `fromLong`. This patch does the conversion
>> for these cases if `l` is a compile time constant.
>>
>> And this conversion also enables further optimizations that recognize
>> maskAll patterns, see [1].
>>
>> Some JTReg test cases are added to ensure the optimization is effective.
>>
>> I tried many different ways to write a JMH benchmark, but failed. Since
>> the input of `VectorMask.fromLong(SPECIES, l)` needs to be a specific
>> compile-time constant, the statement will be hoisted out of the loop.
>> If we don't use a loop, the hotspot will become other instructions, and
>> no obvious performance change was observed. However, combined with the
>> optimization of [1], we can observe a performance improvement of about
>> 7% on both aarch64 and x64.
>>
>> The patch was tested on both aarch64 and x64, all of tier1 tier2 and
>> tier3 tests passed.
>>
>> [1] https://github.com/openjdk/jdk/pull/24674
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 692:
>
>> 690: // generate a MaskAll or Replicate instead.
>> 691:
>> 692: // The "maskAll" API uses the corresponding integer types for floating-point data.
>
> This is because mask all only accepts -1 and 0 values, since -1.0f in float in IEEE 754 format does not set all bits hence an floating point to integral conversion is mandatory here.
Good to know this, thanks!
> src/hotspot/share/opto/vectornode.cpp line 1520:
>
>> 1518: uint vlen = vt->length();
>> 1519: BasicType bt = vt->element_basic_type();
>> 1520: int opc = is_mask ? Op_MaskAll : Op_Replicate;
>
> You can remove this check, since VectorNode::scalar2vector alreday has a match rule for Op_MaskAll
Do you mean this check `Matcher::match_rule_supported_vector(opc, vlen, maskall_bt)` ? I think it's necessary ? Because in theory some platforms don't support both `MaskAll` and `Replicate`. Of course, this situation may not exist in reality. If `MaskAll` and `Replicate` are not supported, then `VectorLongToMask` should not be supported either, and this function will not be called.
> src/hotspot/share/opto/vectornode.cpp line 1532:
>
>> 1530: } else {
>> 1531: con = phase->intcon(con_value);
>> 1532: }
>
> Suggestion:
>
> phase->makecon(TypeInteger::make(bits_type->get_con(), maskall_bt)
This should be:
`con = phase->makecon(TypeInteger::make(con_value, maskall_bt == T_LONG ? T_LONG : T_INT));` because `maskall_bt` can be `T_BYTE` or `T_SHORT`.
Since we still need to check `maskall_bt`, I tend to the current approach because it has fewer function calls.
> src/hotspot/share/opto/vectornode.cpp line 1544:
>
>> 1542:
>> 1543: Node* VectorLoadMaskNode::Ideal(PhaseGVN* phase, bool can_reshape) {
>> 1544: // VectorLoadMask(VectorLongToMask(-1/0)) => Replicate(-1/0)
>
> FTR: This is only useful for non-predicated targets. Since on predicated target VectorLongToMask is not succeeded by VectorLoadMask
> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L703
Yes, thanks for your clarification.
> test/micro/org/openjdk/bench/jdk/incubator/vector/MaskFromLongToLongBenchmark.java line 34:
>
>> 32: @Fork(value = 1, jvmArgs = {"--add-modules=jdk.incubator.vector"})
>> 33: public class MaskFromLongToLongBenchmark {
>> 34: private static final int ITERATION = 10000;
>
> It will be nice to add a synthetic micro for cast chain transform added along with this patch. following micro shows around 1.5x gains on AVX2 system because of widening cast elision.
>
>
> import jdk.incubator.vector.*;
> import java.util.stream.IntStream;
>
> public class mask_cast_chain {
> public static final VectorSpecies<Float> FSP = FloatVector.SPECIES_128;
>
> public static long micro(float [] src1, float [] src2, int ctr) {
> long res = 0;
> for (int i = 0; i < FSP.loopBound(src1.length); i += FSP.length()) {
> res += FloatVector.fromArray(FSP, src1, i)
> .compare(VectorOperators.GE, FloatVector.fromArray(FSP, src2, i))
> .cast(DoubleVector.SPECIES_256)
> .cast(FloatVector.SPECIES_128)
> .toLong();
> }
> return res * ctr;
> }
>
> public static void main(String [] args) {
> float [] src1 = new float[1024];
> float [] src2 = new float[1024];
>
> IntStream.range(0, src1.length).forEach(i -> {src1[i] = (float)i;});
> IntStream.range(0, src2.length).forEach(i -> {src2[i] = (float)500;});
>
> long res = 0;
> for (int i = 0; i < 100000; i++) {
> res += micro(src1, src2, i);
> }
> long t1 = System.currentTimeMillis();
> for (int i = 0; i < 100000; i++) {
> res += micro(src1, src2, i);
> }
> long t2 = System.currentTimeMillis();
> System.out.println("[time] " + (t2 - t1) + "ms" + " [res] " + res);
> }
> }
Ok~
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220925155
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220905045
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220919188
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220924181
PR Review Comment: https://git.openjdk.org/jdk/pull/25793#discussion_r2220927997
More information about the hotspot-compiler-dev
mailing list