RFR: 8340272: C2 SuperWord: JMH benchmark for Reduction vectorization
Emanuel Peter
epeter at openjdk.org
Tue Sep 17 12:03:18 UTC 2024
I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.
Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. **See Future Work below**.
**How I run my benchmarks**
All benchmarks
`make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64`
Some specific benchmark, with profiler that tells me which code snippet is hottest:
`make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"`
**JMH logs**
Run on my AVX512 laptop, with master:
[run_avx512_master.txt](https://github.com/user-attachments/files/17025111/run_avx512_master.txt)
Run on remote asimd (aarch64, NEON) machine:
[run_asimd_master.txt](https://github.com/user-attachments/files/17025579/run_asimd_master.txt)
**Results**
I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.
Here the interesting `int / long / float / double` results, discussion further below:

And there the less spectacular `byte / char / short` results. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.

Here the PDF:
[benchmark_results.pdf](https://github.com/user-attachments/files/17027695/benchmark_results.pdf)
**Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?**
Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.
// Check if reductions are connected
if (is_marked_reduction(p0)) {
Node* second_in = p0->in(2);
Node_List* second_pk = get_pack(second_in);
if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
// No parent pack or not enough work
// to cover reduction expansion overhead
return false;
} else if (second_pk->size() != p->size()) {
return false;
}
}
But when I disable this code, then I see on the aarch64/ASIMD machine:
VectorReduction2.NoSuperword.intAddSimple 2048 0 avgt 3 751.453 ± 1603.353 ns/op
VectorReduction2.WithSuperword.intAddSimple 2048 0 avgt 3 344.263 ± 2.888 ns/op
Hence, this assumption no longer holds. I think it is because we are actually able to move the reductions out of the loop now, and that was not the case when this code was added.
**2-Element Reductions for INT / LONG**
Apparently, all 2-element int and long reductions are currently deemed not profitable, see change: https://github.com/openjdk/jdk/commit/a880f3d1399469c2cd7ef1ace1deb7e04c5ab3d5
This means that the `long` reductions do not vectorize on the ASIMD / aarch64 machine with `MaxVectorSize=16`.
// Length 2 reductions of INT/LONG do not offer performance benefits
if (((arith_type->basic_type() == T_INT) || (arith_type->basic_type() == T_LONG)) && (size == 2)) {
retValue = false;
} else {
retValue = ReductionNode::implemented(opc, size, arith_type->basic_type());
}
**AARCH64 / NEON / ASIMD**
This is why the `float / double` `add / mul` reductions (yellow `MATCHER`) do not vectorize. We might be able to tackle this with an appropriate alternative implementation.
Also all of the `long` cases fail. For one because they would only be 2-element reductions (because only `MaxVectorSize=16`). But also because the `MulVL` is not allowed apparently, see below.
bool Matcher::match_rule_supported_auto_vectorization(int opcode, int vlen, BasicType bt) {
if (UseSVE == 0) {
// These operations are not profitable to be vectorized on NEON, because no direct
// NEON instructions support them. But the match rule support for them is profitable for
// Vector API intrinsics.
if ((opcode == Op_VectorCastD2X && bt == T_INT) ||
(opcode == Op_VectorCastL2X && bt == T_FLOAT) ||
(opcode == Op_CountLeadingZerosV && bt == T_LONG) ||
(opcode == Op_CountTrailingZerosV && bt == T_LONG) ||
// The implementations of Op_AddReductionVD/F in Neon are for the Vector API only.
// They are not suitable for auto-vectorization because the result would not conform
// to the JLS, Section Evaluation Order.
opcode == Op_AddReductionVD || opcode == Op_AddReductionVF ||
opcode == Op_MulReductionVD || opcode == Op_MulReductionVF ||
opcode == Op_MulVL) {
return false;
}
}
return match_rule_supported_vector(opcode, vlen, bt);
}
**Float / Double with Add and Mul**
On `aarch64` NEON these cases do not vectorize, see last section.
It turns out that many of these cases actually do vectorize (on x64), but the code is just as fast as the scalar code.
This is because the reduction order is strict, to maintain correct rounding.
Interestingly, the code runs about at the same speed, if vectorized or not. it seems that the latency of the reduction is simply the determining factor, no matter if it is vectorized or scalar.
Running this for example shows that the loop-bodies are quite different:
make test TEST="micro:vm.compiler.VectorReduction2.*floatAddBig" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"
Scalar loop body, note only scalar xmm registers are used:
0x00007fcee43ffedb: vaddss %xmm3,%xmm6,%xmm6
0x00007fcee43ffedf: vmulss %xmm2,%xmm4,%xmm4
6.69% 0x00007fcee43ffee3: vmulss %xmm2,%xmm1,%xmm1
0x00007fcee43ffee7: vaddss %xmm4,%xmm5,%xmm3
0x00007fcee43ffeeb: vmulss %xmm18,%xmm13,%xmm2
0x00007fcee43ffef1: vaddss %xmm1,%xmm3,%xmm3
0x00007fcee43ffef5: vmulss %xmm12,%xmm10,%xmm1
0x00007fcee43ffefa: vmulss %xmm11,%xmm13,%xmm5
0x00007fcee43ffeff: vaddss %xmm17,%xmm1,%xmm1
0x00007fcee43fff05: vaddss %xmm5,%xmm2,%xmm4 ; {no_reloc}
5.81% 0x00007fcee43fff09: vaddss %xmm14,%xmm1,%xmm1
0x00007fcee43fff0e: vaddss %xmm15,%xmm4,%xmm4
0x00007fcee43fff13: vaddss %xmm9,%xmm4,%xmm2
0x00007fcee43fff18: vaddss %xmm2,%xmm3,%xmm3
24.14% 0x00007fcee43fff1c: vaddss %xmm3,%xmm6,%xmm2
24.50% 0x00007fcee43fff20: vaddss %xmm2,%xmm1,%xmm9
Vector loop uses `zmm`, `ymm` and `xmm` registers, and not just `mul` and `add`, but also `vpshufd` and `vextractf128` instructions to shuffle the values around in the reduction.
0x00007f4578400c36: vmulps %zmm3,%zmm2,%zmm12
0x00007f4578400c3c: vmulps %zmm9,%zmm11,%zmm13
0.33% 0x00007f4578400c42: vmulps %zmm10,%zmm11,%zmm11
0x00007f4578400c48: vmulps %zmm2,%zmm4,%zmm2
0x00007f4578400c4e: vaddps %zmm13,%zmm11,%zmm11
0x00007f4578400c54: vmulps %zmm3,%zmm4,%zmm3
0x00007f4578400c5a: vmulps %zmm5,%zmm15,%zmm13
0x00007f4578400c60: vaddps %zmm2,%zmm3,%zmm2
0x00007f4578400c66: vmulps %zmm10,%zmm9,%zmm3
0x00007f4578400c6c: vaddps %zmm12,%zmm2,%zmm2
0.39% 0x00007f4578400c72: vaddps %zmm3,%zmm11,%zmm3 ; {no_reloc}
0x00007f4578400c78: vmulps %zmm7,%zmm8,%zmm4
0x00007f4578400c7e: vmulps %zmm6,%zmm8,%zmm8
0.03% 0x00007f4578400c84: vmulps %zmm7,%zmm6,%zmm6
0x00007f4578400c8a: vaddps %zmm8,%zmm4,%zmm4
0x00007f4578400c90: vmulps %zmm5,%zmm16,%zmm5
0x00007f4578400c96: vaddps %zmm6,%zmm4,%zmm4
0x00007f4578400c9c: vmulps %zmm15,%zmm16,%zmm6
0.36% 0x00007f4578400ca2: vaddps %zmm6,%zmm5,%zmm5
0x00007f4578400ca8: vaddps %zmm13,%zmm5,%zmm5
0x00007f4578400cae: vaddss %xmm3,%xmm1,%xmm1
0x00007f4578400cb2: vpshufd $0x1,%xmm3,%xmm13
0.23% 0x00007f4578400cb7: vaddss %xmm13,%xmm1,%xmm1
0.85% 0x00007f4578400cbc: vpshufd $0x2,%xmm3,%xmm13
0x00007f4578400cc1: vaddss %xmm13,%xmm1,%xmm1
1.31% 0x00007f4578400cc6: vpshufd $0x3,%xmm3,%xmm13
0x00007f4578400ccb: vaddss %xmm13,%xmm1,%xmm1
1.61% 0x00007f4578400cd0: vextractf128 $0x1,%ymm3,%xmm13
0x00007f4578400cd6: vaddss %xmm13,%xmm1,%xmm1
1.08% 0x00007f4578400cdb: vpshufd $0x1,%xmm13,%xmm12
0x00007f4578400ce1: vaddss %xmm12,%xmm1,%xmm1
1.48% 0x00007f4578400ce6: vpshufd $0x2,%xmm13,%xmm12
0x00007f4578400cec: vaddss %xmm12,%xmm1,%xmm1
1.48% 0x00007f4578400cf1: vpshufd $0x3,%xmm13,%xmm12
0x00007f4578400cf7: vaddss %xmm12,%xmm1,%xmm1
1.15% 0x00007f4578400cfc: vextracti64x4 $0x1,%zmm3,%ymm12
0x00007f4578400d03: vaddss %xmm12,%xmm1,%xmm1
1.64% 0x00007f4578400d08: vpshufd $0x1,%xmm12,%xmm13
0x00007f4578400d0e: vaddss %xmm13,%xmm1,%xmm1
1.12% 0x00007f4578400d13: vpshufd $0x2,%xmm12,%xmm13
0x00007f4578400d19: vaddss %xmm13,%xmm1,%xmm1
1.34% 0x00007f4578400d1e: vpshufd $0x3,%xmm12,%xmm13
0x00007f4578400d24: vaddss %xmm13,%xmm1,%xmm1
1.41% 0x00007f4578400d29: vextractf128 $0x1,%ymm12,%xmm13
0x00007f4578400d2f: vaddss %xmm13,%xmm1,%xmm1
1.54% 0x00007f4578400d34: vpshufd $0x1,%xmm13,%xmm12
0x00007f4578400d3a: vaddss %xmm12,%xmm1,%xmm1
1.41% 0x00007f4578400d3f: vpshufd $0x2,%xmm13,%xmm12
0x00007f4578400d45: vaddss %xmm12,%xmm1,%xmm1
1.28% 0x00007f4578400d4a: vpshufd $0x3,%xmm13,%xmm12
0x00007f4578400d50: vaddss %xmm12,%xmm1,%xmm1
1.57% 0x00007f4578400d55: vaddss %xmm4,%xmm1,%xmm1
1.31% 0x00007f4578400d59: vpshufd $0x1,%xmm4,%xmm6
0x00007f4578400d5e: vaddss %xmm6,%xmm1,%xmm1
1.28% 0x00007f4578400d62: vpshufd $0x2,%xmm4,%xmm6
0x00007f4578400d67: vaddss %xmm6,%xmm1,%xmm1
1.34% 0x00007f4578400d6b: vpshufd $0x3,%xmm4,%xmm6
0x00007f4578400d70: vaddss %xmm6,%xmm1,%xmm1
1.05% 0x00007f4578400d74: vextractf128 $0x1,%ymm4,%xmm6 ; {no_reloc}
0x00007f4578400d7a: vaddss %xmm6,%xmm1,%xmm1
1.67% 0x00007f4578400d7e: vpshufd $0x1,%xmm6,%xmm11
0x00007f4578400d83: vaddss %xmm11,%xmm1,%xmm1
1.48% 0x00007f4578400d88: vpshufd $0x2,%xmm6,%xmm11
0x00007f4578400d8d: vaddss %xmm11,%xmm1,%xmm1
1.25% 0x00007f4578400d92: vpshufd $0x3,%xmm6,%xmm11
0x00007f4578400d97: vaddss %xmm11,%xmm1,%xmm1
1.21% 0x00007f4578400d9c: vextracti64x4 $0x1,%zmm4,%ymm11
0x00007f4578400da3: vaddss %xmm11,%xmm1,%xmm1
1.94% 0x00007f4578400da8: vpshufd $0x1,%xmm11,%xmm6
0x00007f4578400dae: vaddss %xmm6,%xmm1,%xmm1
1.21% 0x00007f4578400db2: vpshufd $0x2,%xmm11,%xmm6
0x00007f4578400db8: vaddss %xmm6,%xmm1,%xmm1
1.77% 0x00007f4578400dbc: vpshufd $0x3,%xmm11,%xmm6
0x00007f4578400dc2: vaddss %xmm6,%xmm1,%xmm1
1.57% 0x00007f4578400dc6: vextractf128 $0x1,%ymm11,%xmm6
0x00007f4578400dcc: vaddss %xmm6,%xmm1,%xmm1
1.02% 0x00007f4578400dd0: vpshufd $0x1,%xmm6,%xmm11
0x00007f4578400dd5: vaddss %xmm11,%xmm1,%xmm1
1.48% 0x00007f4578400dda: vpshufd $0x2,%xmm6,%xmm11
0x00007f4578400ddf: vaddss %xmm11,%xmm1,%xmm1
1.31% 0x00007f4578400de4: vpshufd $0x3,%xmm6,%xmm11
0x00007f4578400de9: vaddss %xmm11,%xmm1,%xmm1
1.54% 0x00007f4578400dee: vaddss %xmm5,%xmm1,%xmm1
1.61% 0x00007f4578400df2: vpshufd $0x1,%xmm5,%xmm9
0x00007f4578400df7: vaddss %xmm9,%xmm1,%xmm1
1.51% 0x00007f4578400dfc: vpshufd $0x2,%xmm5,%xmm9
0x00007f4578400e01: vaddss %xmm9,%xmm1,%xmm1
1.61% 0x00007f4578400e06: vpshufd $0x3,%xmm5,%xmm9
0x00007f4578400e0b: vaddss %xmm9,%xmm1,%xmm1
1.34% 0x00007f4578400e10: vextractf128 $0x1,%ymm5,%xmm9
0x00007f4578400e16: vaddss %xmm9,%xmm1,%xmm1
1.25% 0x00007f4578400e1b: vpshufd $0x1,%xmm9,%xmm8
0x00007f4578400e21: vaddss %xmm8,%xmm1,%xmm1
2.16% 0x00007f4578400e26: vpshufd $0x2,%xmm9,%xmm8
0x00007f4578400e2c: vaddss %xmm8,%xmm1,%xmm1
1.44% 0x00007f4578400e31: vpshufd $0x3,%xmm9,%xmm8
0x00007f4578400e37: vaddss %xmm8,%xmm1,%xmm1
1.38% 0x00007f4578400e3c: vextracti64x4 $0x1,%zmm5,%ymm8
0x00007f4578400e43: vaddss %xmm8,%xmm1,%xmm1
1.51% 0x00007f4578400e48: vpshufd $0x1,%xmm8,%xmm9
0x00007f4578400e4e: vaddss %xmm9,%xmm1,%xmm1
1.74% 0x00007f4578400e53: vpshufd $0x2,%xmm8,%xmm9
0x00007f4578400e59: vaddss %xmm9,%xmm1,%xmm1
1.87% 0x00007f4578400e5e: vpshufd $0x3,%xmm8,%xmm9
0x00007f4578400e64: vaddss %xmm9,%xmm1,%xmm1
1.28% 0x00007f4578400e69: vextractf128 $0x1,%ymm8,%xmm9
0x00007f4578400e6f: vaddss %xmm9,%xmm1,%xmm1
1.67% 0x00007f4578400e74: vpshufd $0x1,%xmm9,%xmm8 ; {no_reloc}
0x00007f4578400e7a: vaddss %xmm8,%xmm1,%xmm1
1.57% 0x00007f4578400e7f: vpshufd $0x2,%xmm9,%xmm8
0x00007f4578400e85: vaddss %xmm8,%xmm1,%xmm1
1.48% 0x00007f4578400e8a: vpshufd $0x3,%xmm9,%xmm8
0x00007f4578400e90: vaddss %xmm8,%xmm1,%xmm1
1.57% 0x00007f4578400e95: vaddss %xmm2,%xmm1,%xmm1
1.57% 0x00007f4578400e99: vpshufd $0x1,%xmm2,%xmm7
0x00007f4578400e9e: vaddss %xmm7,%xmm1,%xmm1
1.61% 0x00007f4578400ea2: vpshufd $0x2,%xmm2,%xmm7
0x00007f4578400ea7: vaddss %xmm7,%xmm1,%xmm1
1.48% 0x00007f4578400eab: vpshufd $0x3,%xmm2,%xmm7
0x00007f4578400eb0: vaddss %xmm7,%xmm1,%xmm1
1.44% 0x00007f4578400eb4: vextractf128 $0x1,%ymm2,%xmm7
0x00007f4578400eba: vaddss %xmm7,%xmm1,%xmm1
1.02% 0x00007f4578400ebe: vpshufd $0x1,%xmm7,%xmm10
0x00007f4578400ec3: vaddss %xmm10,%xmm1,%xmm1
1.54% 0x00007f4578400ec8: vpshufd $0x2,%xmm7,%xmm10
0x00007f4578400ecd: vaddss %xmm10,%xmm1,%xmm1
1.31% 0x00007f4578400ed2: vpshufd $0x3,%xmm7,%xmm10
0x00007f4578400ed7: vaddss %xmm10,%xmm1,%xmm1
1.28% 0x00007f4578400edc: vextracti64x4 $0x1,%zmm2,%ymm10
0x00007f4578400ee3: vaddss %xmm10,%xmm1,%xmm1
1.28% 0x00007f4578400ee8: vpshufd $0x1,%xmm10,%xmm7
0x00007f4578400eee: vaddss %xmm7,%xmm1,%xmm1
2.03% 0x00007f4578400ef2: vpshufd $0x2,%xmm10,%xmm7
0x00007f4578400ef8: vaddss %xmm7,%xmm1,%xmm1
1.44% 0x00007f4578400efc: vpshufd $0x3,%xmm10,%xmm7
0x00007f4578400f02: vaddss %xmm7,%xmm1,%xmm1
1.48% 0x00007f4578400f06: vextractf128 $0x1,%ymm10,%xmm7
0x00007f4578400f0c: vaddss %xmm7,%xmm1,%xmm1
1.31% 0x00007f4578400f10: vpshufd $0x1,%xmm7,%xmm10
0x00007f4578400f15: vaddss %xmm10,%xmm1,%xmm1
1.08% 0x00007f4578400f1a: vpshufd $0x2,%xmm7,%xmm10
0x00007f4578400f1f: vaddss %xmm10,%xmm1,%xmm1
1.64% 0x00007f4578400f24: vpshufd $0x3,%xmm7,%xmm10
0x00007f4578400f29: vaddss %xmm10,%xmm1,%xmm1
**No vectorization for longMulBig?**
I locally can get vectorization in another setting, but somehow not in the JMH benchmark. This is strange, and I'll have to keep investigating.
**No speedup with doubleMinDotProduct**
Strangely, on my AVX512 machine, that benchmark did vectorize, but it did not experience any speedup. I'm quite confused about that. Especially because the parallel benchmark `doubleMaxDotProduct` vectorizes just fine. More investigation needed.
----------------------------------------
**Future Work**
- [JDK-8307513](https://bugs.openjdk.org/browse/JDK-8307513): C2: intrinsify Math.max(long,long) and Math.min(long,long)
- The `Simple` benchmarks should be profitable -> allow vectorization!
- Investigate the results for `byte / char / short`: is it all due to over-unrolling?
- `float / double` with `add / mul`:
- the main issue is with the strict order of operations. Maybe we can add some `Float.addAssociative` so that we can do non-strict reduction? That could be a cool feature for those who care about performance and are willing to give up some rounding precision.
- The aarch64 backend simply refuses to vectorize these, because there is no strict order implementation. That could be changed. Though we would need a benchmark where it is worth it...
- Investigate the strange results with `longMulBig` and `doubleMinDotProduct`.
-------------
Commit messages:
- 8340272
Changes: https://git.openjdk.org/jdk/pull/21032/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21032&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8340272
Stats: 1454 lines in 1 file changed: 1454 ins; 0 del; 0 mod
Patch: https://git.openjdk.org/jdk/pull/21032.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/21032/head:pull/21032
PR: https://git.openjdk.org/jdk/pull/21032
More information about the hotspot-compiler-dev
mailing list