RFR: 8340272: C2 SuperWord: JMH benchmark for Reduction vectorization
Emanuel Peter
epeter at openjdk.org
Tue Sep 17 12:03:18 UTC 2024
On Tue, 17 Sep 2024 07:53:40 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
> I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.
>
> Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. **See Future Work below**.
>
> **How I run my benchmarks**
>
> All benchmarks
> `make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64`
>
> Some specific benchmark, with profiler that tells me which code snippet is hottest:
> `make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"`
>
> **JMH logs**
>
> Run on my AVX512 laptop, with master:
> [run_avx512_master.txt](https://github.com/user-attachments/files/17025111/run_avx512_master.txt)
>
> Run on remote asimd (aarch64, NEON) machine:
> [run_asimd_master.txt](https://github.com/user-attachments/files/17025579/run_asimd_master.txt)
>
> **Results**
>
> I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.
>
> Here the interesting `int / long / float / double` results, discussion further below:
> 
>
>
> And there the less spectacular `byte / char / short` results. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.
>
> 
>
> Here the PDF:
> [benchmark_results.pdf](https://github.com/user-attachments/files/17027695/benchmark_results.pdf)
>
>
> **Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?**
>
> Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
> The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.
>
> // Check if reductions are connected
> if (is_marked_reduction(p0)) {
> Node* second_in = p0->in(2);
> Node_List* second_pk = get_pack(second_in);
> if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
> // No parent pack or not enough work
> // to cover reduction expansion overhead
> return false;
> } else if (second_pk->size() != p->size()) {
> return false;
> }
> }
>
>
> But when I disable this code, then I see on the aarch64/ASIMD machine:
>
> VectorReduction2.NoSuperword.intAddSimpl...
@galderz You can use this JMH benchmark for your work in https://github.com/openjdk/jdk/pull/20098 if you want.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21032#issuecomment-2355504929
More information about the hotspot-compiler-dev
mailing list