RFR: 8340272: C2 SuperWord: JMH benchmark for Reduction vectorization

Jasmine Karthikeyan jkarthikeyan at openjdk.org
Wed Sep 18 14:20:14 UTC 2024


On Tue, 17 Sep 2024 07:53:40 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.
> 
> Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. **See Future Work below**.
> 
> **How I run my benchmarks**
> 
> All benchmarks
> `make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64`
> 
> Some specific benchmark, with profiler that tells me which code snippet is hottest:
> `make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"`
> 
> **JMH logs**
> 
> Run on my AVX512 laptop, with master:
> [run_avx512_master.txt](https://github.com/user-attachments/files/17025111/run_avx512_master.txt)
> 
> Run on remote asimd (aarch64, NEON) machine:
> [run_asimd_master.txt](https://github.com/user-attachments/files/17025579/run_asimd_master.txt)
> 
> **Results**
> 
> I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.
> 
> Here the interesting `int / long / float / double` results, discussion further below:
> ![image](https://github.com/user-attachments/assets/20abfa7b-aee6-4654-bf4d-e3abc4bbfc8b)
> 
> 
> And there the less spectacular `byte / char / short` results. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.
> 
> ![image](https://github.com/user-attachments/assets/6e1c69cf-db6c-4d33-8750-c8797ffc39a2)
> 
> Here the PDF:
> [benchmark_results.pdf](https://github.com/user-attachments/files/17027695/benchmark_results.pdf)
> 
> 
> **Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?**
> 
> Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
> The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.
> 
>   // Check if reductions are connected
>   if (is_marked_reduction(p0)) {
>     Node* second_in = p0->in(2);
>     Node_List* second_pk = get_pack(second_in);
>     if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
>       // No parent pack or not enough work
>       // to cover reduction expansion overhead
>       return false;
>     } else if (second_pk->size() != p->size()) {
>       return false;
>     }
>   }
> 
> 
> But when I disable this code, then I see on the aarch64/ASIMD machine:
> 
> VectorReduction2.NoSuperword.intAddSimpl...

The subword results seem quite tricky, especially since there are some things that performed well for me (like char) but ended up causing regressions for your machine. The long results are also quite strange, but it may just be random noise. I'll definitely make sure to investigate further. Thanks a lot for the analysis!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21032#issuecomment-2358604647


More information about the hotspot-compiler-dev mailing list