Matrix Multiplication with Vector API - wierd behaviour

Fri Nov 18 18:48:13 UTC 2022

Hi Martin,

Thank you for doing this. This would make a good contribution to the vector benchmarks if you are willing to do so.

As to your question I am unsure, perhaps someone from Intel can chime in regard caching behavior?

If not already doing so switch off tiered compilation (-XX:-TieredCompilation) when running the benchmark.

Misaligned vector loads/stores can result in some performance degradation but I would not expect that much. Running with multiple forks should help average that out (perhaps you are also doing that?). Or one can run with a HotSpot flag to align allocations larger than the default 8-byte alignment.

Check using perfasm if the blocked vector generated code is different for the different matrix dimensions. Perhaps the different loop bounds result in C2 compiling differently? 

Paul.

> On Nov 18, 2022, at 8:39 AM, Martin Stypinski <mstypinski at gmail.com> wrote:
> 
> I recently looked into the Vector API (Java 18 Incubator III) and noticed some strange behavior. You can find my implementations and the raw data of the benchmark, including perfnorm on ubuntu 20.04 LTS, here:
> 	• Code Examples: https://gist.github.com/Styp/a4b398a0113c3430ebf02f020c1f52ff
> 	• Benchmark incl. Perfnorm: https://gist.github.com/Styp/e0e2aead7a0c3ba4934c5e5675c5253b
> 
> I made some visualization which is attached here: https://ibb.co/V9wHJd1
> 
> To explain a little background:
> 	• The baseline is the most primitive implementation of matrix multiplication.
> 	• "Blocked" is the version that is optimized by exploiting cache locality. (Implemented accordingly: https://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf)
> 	• Simple AVX-256 / Simple AVX-512 is the most straightforward way to use Vector API on the baseline.
> 	• "Blocked" AVX-256 / Blocked AVX-512 is the implementation of the blocked version using Vector API.
> 
> Now my questions and comments:
> 	• The CPU used in this benchmark is a Xeon Gold 6212U with 768KB L1D, 1MB L2 and a shared 35.75MB L3 cache
> 	• The speed-up between Baseline and Blocked looks perfectly fine to me.
> 	• What is happening around 1024x1024 array size? The bigger the array size, the more benefit the cache-local implementation provides?!