RFR: 8340272: C2 SuperWord: JMH benchmark for Reduction vectorization

Jasmine Karthikeyan jkarthikeyan at openjdk.org
Wed Sep 18 03:01:05 UTC 2024


On Tue, 17 Sep 2024 07:53:40 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.
> 
> Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. **See Future Work below**.
> 
> **How I run my benchmarks**
> 
> All benchmarks
> `make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64`
> 
> Some specific benchmark, with profiler that tells me which code snippet is hottest:
> `make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"`
> 
> **JMH logs**
> 
> Run on my AVX512 laptop, with master:
> [run_avx512_master.txt](https://github.com/user-attachments/files/17025111/run_avx512_master.txt)
> 
> Run on remote asimd (aarch64, NEON) machine:
> [run_asimd_master.txt](https://github.com/user-attachments/files/17025579/run_asimd_master.txt)
> 
> **Results**
> 
> I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.
> 
> Here the interesting `int / long / float / double` results, discussion further below:
> ![image](https://github.com/user-attachments/assets/20abfa7b-aee6-4654-bf4d-e3abc4bbfc8b)
> 
> 
> And there the less spectacular `byte / char / short` results. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.
> 
> ![image](https://github.com/user-attachments/assets/6e1c69cf-db6c-4d33-8750-c8797ffc39a2)
> 
> Here the PDF:
> [benchmark_results.pdf](https://github.com/user-attachments/files/17027695/benchmark_results.pdf)
> 
> 
> **Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?**
> 
> Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
> The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.
> 
>   // Check if reductions are connected
>   if (is_marked_reduction(p0)) {
>     Node* second_in = p0->in(2);
>     Node_List* second_pk = get_pack(second_in);
>     if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
>       // No parent pack or not enough work
>       // to cover reduction expansion overhead
>       return false;
>     } else if (second_pk->size() != p->size()) {
>       return false;
>     }
>   }
> 
> 
> But when I disable this code, then I see on the aarch64/ASIMD machine:
> 
> VectorReduction2.NoSuperword.intAddSimpl...

Looks nice, the benchmark is very thorough! I was interested to see how it performed on my Zen 3 (AVX2) machine, I've attached the results here in case it's interesting/useful: [perf_results.txt](https://github.com/user-attachments/files/17037796/perf_results.txt)

-------------

Marked as reviewed by jkarthikeyan (Committer).

PR Review: https://git.openjdk.org/jdk/pull/21032#pullrequestreview-2311518449


More information about the hotspot-compiler-dev mailing list