RFR: 8340272: C2 SuperWord: JMH benchmark for Reduction vectorization

Emanuel Peter epeter at openjdk.org
Wed Sep 18 12:06:12 UTC 2024


On Wed, 18 Sep 2024 02:58:10 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> I'm adding some proper JMH benchmarks for vectorized reductions. There are already some others, but they are not comprehensive or not JMH.
>> 
>> Plus, I wanted to do a performance-investigation, hopefully leading to some improvements. **See Future Work below**.
>> 
>> **How I run my benchmarks**
>> 
>> All benchmarks
>> `make test TEST="micro:vm.compiler.VectorReduction2" CONF=linux-x64`
>> 
>> Some specific benchmark, with profiler that tells me which code snippet is hottest:
>> `make test TEST="micro:vm.compiler.VectorReduction2.*doubleMinDotProduct" CONF=linux-x64 MICRO="OPTIONS=-prof perfasm"`
>> 
>> **JMH logs**
>> 
>> Run on my AVX512 laptop, with master:
>> [run_avx512_master.txt](https://github.com/user-attachments/files/17025111/run_avx512_master.txt)
>> 
>> Run on remote asimd (aarch64, NEON) machine:
>> [run_asimd_master.txt](https://github.com/user-attachments/files/17025579/run_asimd_master.txt)
>> 
>> **Results**
>> 
>> I ran it on 2 machines so far. Left on my AVX512 machine, right on a ASIMD/NEON/aarch64 machine.
>> 
>> Here the interesting `int / long / float / double` results, discussion further below:
>> ![image](https://github.com/user-attachments/assets/20abfa7b-aee6-4654-bf4d-e3abc4bbfc8b)
>> 
>> 
>> And there the less spectacular `byte / char / short` results. There is no vectorization of these cases. But there seems to be some issue with over-unrolling on my AVX512 machine, one case I looked at would only unroll 4x without SuperWord, but 16x with, and that seems to be unfavourable.
>> 
>> ![image](https://github.com/user-attachments/assets/6e1c69cf-db6c-4d33-8750-c8797ffc39a2)
>> 
>> Here the PDF:
>> [benchmark_results.pdf](https://github.com/user-attachments/files/17027695/benchmark_results.pdf)
>> 
>> 
>> **Why are all the ...Simple benchmarks not vectorizing, i.e. "not profitable"?**
>> 
>> Apparently, there must be sufficient "work" vectors to outweith the "reduction" vectors.
>> The idea used to be that one should have at least 2 work vectors which tend to be profitable, to outweigh the cost of a single reduction vector.
>> 
>>   // Check if reductions are connected
>>   if (is_marked_reduction(p0)) {
>>     Node* second_in = p0->in(2);
>>     Node_List* second_pk = get_pack(second_in);
>>     if ((second_pk == nullptr) || (_num_work_vecs == _num_reductions)) {
>>       // No parent pack or not enough work
>>       // to cover reduction expansion overhead
>>       return false;
>>     } else if (second_pk->size() != p->size()) {
>>       return false;
>>     }
>>   }
>> 
>> 
>> ...
>
> Looks nice, the benchmark is very thorough! I was interested to see how it performed on my Zen 3 (AVX2) machine, I've attached the results here in case it's interesting/useful: [perf_results.txt](https://github.com/user-attachments/files/17037796/perf_results.txt)

@jaskarth @vnkozlov thanks for the review!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21032#issuecomment-2358280361


More information about the hotspot-compiler-dev mailing list