<div dir="ltr"><p style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">I recently looked into the Vector API (Java 18 Incubator III) and noticed some strange behavior. You can find my implementations and the raw data of the benchmark, including perfnorm on ubuntu 20.04 LTS, here:</span></p><ul style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">Code Examples: <a href="https://gist.github.com/Styp/a4b398a0113c3430ebf02f020c1f52ff">https://gist.github.com/Styp/a4b398a0113c3430ebf02f020c1f52ff</a></span></li><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">Benchmark incl. Perfnorm: <a href="https://gist.github.com/Styp/e0e2aead7a0c3ba4934c5e5675c5253b">https://gist.github.com/Styp/e0e2aead7a0c3ba4934c5e5675c5253b</a></span></li></ul><p style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><br></p><p style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">I made some visualization which is attached here: <a href="https://ibb.co/V9wHJd1">https://ibb.co/V9wHJd1</a></span></p><p style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><br></p><p style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">To explain a little background:</span></p><ul style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">The baseline is the most primitive implementation of matrix multiplication.</span></li><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">"Blocked" is the version that is optimized by exploiting cache locality. (Implemented accordingly: <a href="https://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf">https://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf</a>)</span></li><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">Simple AVX-256 / Simple AVX-512 is the most straightforward way to use Vector API on the baseline.</span></li><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">"Blocked" AVX-256 / Blocked AVX-512 is the implementation of the blocked version using Vector API.</span></li></ul><p style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><br></p><p style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">Now my questions and comments:</span></p><ul style="color:rgb(14,16,26);background-color:transparent;margin-top:0pt;margin-bottom:0pt"><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">The CPU used in this benchmark is a Xeon Gold 6212U with 768KB L1D, 1MB L2 and a shared 35.75MB L3 cache</span></li><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">The speed-up between Baseline and Blocked looks perfectly fine to me.</span></li><li style="background-color:transparent;margin-top:0pt;margin-bottom:0pt;list-style-type:disc"><span style="background-color:transparent;margin-top:0pt;margin-bottom:0pt">What is happening around 1024x1024 array size? The bigger the array size, the more benefit the cache-local implementation provides?!</span></li></ul></div>