<div dir="ltr"><div dir="ltr">Hi Paul,<div><br></div><div>Thank you for doing this. This would make a good contribution to the vector benchmarks if you are willing to do so:<br></div><div>- Sure I can do this, if it's in a reasonable amount of time. I got inspired by Richard Startins blog: <a href="https://richardstartin.github.io" target="_blank">https://richardstartin.github.io</a><span class="gmail-Apple-converted-space"> </span>and Peter Abeles Git-repos: <a href="https://github.com/lessthanoptimal" target="_blank">https://github.com/lessthanoptimal</a></div><div>Is there a repo with all the performance tests?</div><div><br></div>I run the code already with the following setup:<br>@Fork(jvmArgsPrepend = {"--add-modules=jdk.incubator.vector",<br>        "-XX:-TieredCompilation",<br>        "-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0"})<span class="gmail-im" style="color:rgb(80,0,80)"><div><br></div><div>Check using perfasm if the blocked vector generated code is different for the different matrix dimensions. Perhaps the different loop bounds result in C2 compiling differently?<br></div></span><div>- I'll do this and I'll come back...</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Am Fr., 18. Nov. 2022 um 19:48 Uhr schrieb Paul Sandoz <<a href="mailto:paul.sandoz@oracle.com">paul.sandoz@oracle.com</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi Martin,<br>

<br>

Thank you for doing this. This would make a good contribution to the vector benchmarks if you are willing to do so.<br>

<br>

As to your question I am unsure, perhaps someone from Intel can chime in regard caching behavior?<br>

<br>

If not already doing so switch off tiered compilation (-XX:-TieredCompilation) when running the benchmark.<br>

<br>

Misaligned vector loads/stores can result in some performance degradation but I would not expect that much. Running with multiple forks should help average that out (perhaps you are also doing that?). Or one can run with a HotSpot flag to align allocations larger than the default 8-byte alignment.<br>

<br>

Check using perfasm if the blocked vector generated code is different for the different matrix dimensions. Perhaps the different loop bounds result in C2 compiling differently? <br>

<br>

Paul.<br>

<br>

> On Nov 18, 2022, at 8:39 AM, Martin Stypinski <<a href="mailto:mstypinski@gmail.com" target="_blank">mstypinski@gmail.com</a>> wrote:<br>

> <br>

> I recently looked into the Vector API (Java 18 Incubator III) and noticed some strange behavior. You can find my implementations and the raw data of the benchmark, including perfnorm on ubuntu 20.04 LTS, here:<br>

>       • Code Examples: <a href="https://gist.github.com/Styp/a4b398a0113c3430ebf02f020c1f52ff" rel="noreferrer" target="_blank">https://gist.github.com/Styp/a4b398a0113c3430ebf02f020c1f52ff</a><br>

>       • Benchmark incl. Perfnorm: <a href="https://gist.github.com/Styp/e0e2aead7a0c3ba4934c5e5675c5253b" rel="noreferrer" target="_blank">https://gist.github.com/Styp/e0e2aead7a0c3ba4934c5e5675c5253b</a><br>

> <br>

> I made some visualization which is attached here: <a href="https://ibb.co/V9wHJd1" rel="noreferrer" target="_blank">https://ibb.co/V9wHJd1</a><br>

> <br>

> To explain a little background:<br>

>       • The baseline is the most primitive implementation of matrix multiplication.<br>

>       • "Blocked" is the version that is optimized by exploiting cache locality. (Implemented accordingly: <a href="https://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf" rel="noreferrer" target="_blank">https://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf</a>)<br>

>       • Simple AVX-256 / Simple AVX-512 is the most straightforward way to use Vector API on the baseline.<br>

>       • "Blocked" AVX-256 / Blocked AVX-512 is the implementation of the blocked version using Vector API.<br>

> <br>

> Now my questions and comments:<br>

>       • The CPU used in this benchmark is a Xeon Gold 6212U with 768KB L1D, 1MB L2 and a shared 35.75MB L3 cache<br>

>       • The speed-up between Baseline and Blocked looks perfectly fine to me.<br>

>       • What is happening around 1024x1024 array size? The bigger the array size, the more benefit the cache-local implementation provides?!<br>

<br>

</blockquote></div></div>