Matrix Multiplication with Vector API - wierd behaviour

Fri Nov 18 16:39:32 UTC 2022

I recently looked into the Vector API (Java 18 Incubator III) and noticed
some strange behavior. You can find my implementations and the raw data of
the benchmark, including perfnorm on ubuntu 20.04 LTS, here:

   - Code Examples:
   https://gist.github.com/Styp/a4b398a0113c3430ebf02f020c1f52ff
   - Benchmark incl. Perfnorm:
   https://gist.github.com/Styp/e0e2aead7a0c3ba4934c5e5675c5253b

I made some visualization which is attached here: https://ibb.co/V9wHJd1

To explain a little background:

   - The baseline is the most primitive implementation of matrix
   multiplication.
   - "Blocked" is the version that is optimized by exploiting cache
   locality. (Implemented accordingly:
   https://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf)
   - Simple AVX-256 / Simple AVX-512 is the most straightforward way to use
   Vector API on the baseline.
   - "Blocked" AVX-256 / Blocked AVX-512 is the implementation of the
   blocked version using Vector API.

Now my questions and comments:

   - The CPU used in this benchmark is a Xeon Gold 6212U with 768KB L1D,
   1MB L2 and a shared 35.75MB L3 cache
   - The speed-up between Baseline and Blocked looks perfectly fine to me.
   - What is happening around 1024x1024 array size? The bigger the array
   size, the more benefit the cache-local implementation provides?!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20221118/0cd17ff5/attachment-0001.htm>