RFR: 8354242: VectorAPI: combine vector not operation with compare [v3]
Emanuel Peter
epeter at openjdk.org
Mon Apr 28 06:49:55 UTC 2025
On Fri, 25 Apr 2025 07:24:15 GMT, erifan <duke at openjdk.org> wrote:
>> This patch optimizes the following patterns:
>> For integer types:
>>
>> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1))
>> => (VectorMaskCmp src1 src2 ncond)
>> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1))
>> => (VectorMaskCmp src1 src2 ncond)
>>
>> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the negative comparison of cond.
>>
>> For float and double types:
>>
>> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1))
>> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
>> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1))
>> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
>>
>> cond can be eq or ne.
>>
>> Benchmarks on Nvidia Grace machine with 128-bit SVE2: With option `-XX:UseSVE=2`:
>>
>> Benchmark Unit Before Score Error After Score Error Uplift
>> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29
>> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33
>> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33
>> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31
>> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31
>> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41
>> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29
>> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39
>> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37
>> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41
>> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29
>> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4
>> testCompareGTMaskNotLong ops/s 849405.9159 2432.858159 1177952.041 359.96413 1.38
>> testCompareGTMaskNotShort ops/s 3339509.141 3339.976585 4711442.496 2673.364893 1.41
>> testCompareLEMaskNotByte ops/s 7911340.004 3114.69191 10231626.5 27134.20035 1.29
>> testCompareLEMaskNotInt ops/s 1675812.113 1340.969885 2353255.341 1452.4522 1.4
>> testCompareLEMaskNotLong ops/s 848862.8036 6564.841731 1177763.623 539.290106 1.38
>> testCompareLEMaskNotShort ops/s 3324951.54 2380.29473 4712116.251 1544.559684 1.41
>> testCompareLTMaskNotByte ops/s 7910390.844 2630.861436 10239567.69 6487.441672 1.29
>> testCompareLTMaskNotInt ops/s 16721...
>
> erifan has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>
> - Addressed some review comments
>
> 1. Call VectorNode::Ideal() only once in XorVNode::Ideal.
> 2. Improve code comments.
> - Merge branch 'master' into JDK-8354242
> - Merge branch 'master' into JDK-8354242
> - 8354242: VectorAPI: combine vector not operation with compare
>
> This patch optimizes the following patterns:
> For integer types:
> ```
> (XorV (VectorMaskCmp src1 src2 cond) (Replicate -1))
> => (VectorMaskCmp src1 src2 ncond)
> (XorVMask (VectorMaskCmp src1 src2 cond) (MaskAll m1))
> => (VectorMaskCmp src1 src2 ncond)
> ```
> cond can be eq, ne, le, ge, lt, gt, ule, uge, ult and ugt, ncond is the
> negative comparison of cond.
>
> For float and double types:
> ```
> (XorV (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (Replicate -1))
> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
> (XorVMask (VectorMaskCast (VectorMaskCmp src1 src2 cond)) (MaskAll m1))
> => (VectorMaskCast (VectorMaskCmp src1 src2 ncond))
> ```
> cond can be eq or ne.
>
> Benchmarks on Nvidia Grace machine with 128-bit SVE2:
> With option `-XX:UseSVE=2`:
> ```
> Benchmark Unit Before Score Error After Score Error Uplift
> testCompareEQMaskNotByte ops/s 7912127.225 2677.289518 10266136.26 8955.008548 1.29
> testCompareEQMaskNotDouble ops/s 884737.6799 446.963779 1179760.772 448.031844 1.33
> testCompareEQMaskNotFloat ops/s 1765045.787 682.332214 2359520.803 896.305743 1.33
> testCompareEQMaskNotInt ops/s 1787221.411 977.743935 2353952.519 960.069976 1.31
> testCompareEQMaskNotLong ops/s 895297.1974 673.44808 1178449.02 323.804205 1.31
> testCompareEQMaskNotShort ops/s 3339987.002 3415.2226 4712761.965 2110.862053 1.41
> testCompareGEMaskNotByte ops/s 7907615.16 4094.243652 10251646.9 9486.699831 1.29
> testCompareGEMaskNotInt ops/s 1683738.958 4233.813092 2352855.205 1251.952546 1.39
> testCompareGEMaskNotLong ops/s 854496.1561 8594.598885 1177811.493 521.1229 1.37
> testCompareGEMaskNotShort ops/s 3341860.309 1578.975338 4714008.434 1681.10365 1.41
> testCompareGTMaskNotByte ops/s 7910823.674 2993.367032 10245063.58 9774.75138 1.29
> testCompareGTMaskNotInt ops/s 1673393.928 3153.099431 2353654.521 1190.848583 1.4
> testCompareGTMaskNotLong ops/s 849405.9159 2...
Just a drive-by comment for now, I may review this later more fully.
> I would also prefer if you added the IR restrictions rather than the JTREG requires.
The benefit is that we can still run the tests on all platforms, at least for result verification.
>
> Imagine someone adds optimizations to a new platform, but does not know about this test here. They make a mistake, and there is a bug, leading either to a crash or wrong result. With the requires, you test would never even run, and we would not catch it. With the IR applyIf, we would catch the bug.
Just copy pasting the IR applyIf everywhere is not that much work, and adding in a new platform later is not really hard either.
-------------
Changes requested by epeter (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/24674#pullrequestreview-2798141911
More information about the hotspot-compiler-dev
mailing list