RFR: 8340272: C2 SuperWord: JMH benchmark for Reduction vectorization

Emanuel Peter epeter at openjdk.org
Wed Sep 18 07:26:08 UTC 2024


On Wed, 18 Sep 2024 02:58:10 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.
>> 
>> Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. **See Future Work below**.
>> 
>> **How I run my benchmarks**
>> 
>> All benchmarks
>> `make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64`
>> 
>> Some specific benchmark, with profiler that tells me which code snippet is hottest:
>> `make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"`
>> 
>> **JMH logs**
>> 
>> Run on my AVX512 laptop, with master:
>> [run_avx512_master.txt](https://github.com/user-attachments/files/17025111/run_avx512_master.txt)
>> 
>> Run on remote asimd (aarch64, NEON) machine:
>> [run_asimd_master.txt](https://github.com/user-attachments/files/17025579/run_asimd_master.txt)
>> 
>> **Results**
>> 
>> I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.
>> 
>> Here the interesting `int / long / float / double` results, discussion further below:
>> ![image](https://github.com/user-attachments/assets/20abfa7b-aee6-4654-bf4d-e3abc4bbfc8b)
>> 
>> 
>> And there the less spectacular `byte / char / short` results. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.
>> 
>> ![image](https://github.com/user-attachments/assets/6e1c69cf-db6c-4d33-8750-c8797ffc39a2)
>> 
>> Here the PDF:
>> [benchmark_results.pdf](https://github.com/user-attachments/files/17027695/benchmark_results.pdf)
>> 
>> 
>> **Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?**
>> 
>> Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
>> The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.
>> 
>>   // Check if reductions are connected
>>   if (is_marked_reduction(p0)) {
>>     Node* second_in = p0->in(2);
>>     Node_List* second_pk = get_pack(second_in);
>>     if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
>>       // No parent pack or not enough work
>>       // to cover reduction expansion overhead
>>       return false;
>>     } else if (second_pk->size() != p->size()) {
>>       return false;
>>     }
>>   }
>> 
>> 
>> ...
>
> Looks nice, the benchmark is very thorough! I was interested to see how it performed on my Zen 3 (AVX2) machine, I've attached the results here in case it's interesting/useful: [perf_results.txt](https://github.com/user-attachments/files/17037796/perf_results.txt)

@jaskarth thanks for the benchmark!

I included it in these results now:
[benchmark_results.pdf](https://github.com/user-attachments/files/17040018/benchmark_results.pdf)

The results are quite comparable to the AVX512 results. Some comments:
- `byte / char / short`: there is also some variation here, but it seems slightly different. We might want to investigate that anyway, especially the regressions that are in the `15-25%` range. It is also possible that we could invest more to even vectorize these cases, but the IR is more complicated with all the "cast to byte/char/short", i.e. the right and left shifting required to remove the upper bits. Pattern matching those cases is difficult with the current SuperWord structure, as far as I can see. I'm open to ideas/suggestions here ;)
- `int / float / double` performance is as expected, parallel to ASIMD and AVX512. Good.
- `long`:
  - `MulVL` is not implemented for AVX2 (hardware does not have them as far as I know), so those benchmarks results are as expected.
  - Your results around the long min/max are a bit unexpected, especially because the vectorization is not supposed to work as far as I know. Could be interesting to investigate more there.

![image](https://github.com/user-attachments/assets/b4e28637-04e8-431f-bd4f-9170d9461133)

![image](https://github.com/user-attachments/assets/1d0caa02-399c-4549-b314-ef460af133f6)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21032#issuecomment-2357701154


More information about the hotspot-compiler-dev mailing list