Matrix Multiplication with Vector API - wierd behaviour

Tue Nov 22 17:06:53 UTC 2022

Hi Paul,

Thank you for doing this. This would make a good contribution to the vector
benchmarks if you are willing to do so:
- Sure I can do this, if it's in a reasonable amount of time. I got
inspired by Richard Startins blog: https://richardstartin.github.io and
Peter Abeles Git-repos: https://github.com/lessthanoptimal
Is there a repo with all the performance tests?

I run the code already with the following setup:
@Fork(jvmArgsPrepend = {"--add-modules=jdk.incubator.vector",
        "-XX:-TieredCompilation",
        "-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0"})

Check using perfasm if the blocked vector generated code is different for
the different matrix dimensions. Perhaps the different loop bounds result
in C2 compiling differently?
- I'll do this and I'll come back...

Am Fr., 18. Nov. 2022 um 19:48 Uhr schrieb Paul Sandoz <
paul.sandoz at oracle.com>:

> Hi Martin,
>
> Thank you for doing this. This would make a good contribution to the
> vector benchmarks if you are willing to do so.
>
> As to your question I am unsure, perhaps someone from Intel can chime in
> regard caching behavior?
>
> If not already doing so switch off tiered compilation
> (-XX:-TieredCompilation) when running the benchmark.
>
> Misaligned vector loads/stores can result in some performance degradation
> but I would not expect that much. Running with multiple forks should help
> average that out (perhaps you are also doing that?). Or one can run with a
> HotSpot flag to align allocations larger than the default 8-byte alignment.
>
> Check using perfasm if the blocked vector generated code is different for
> the different matrix dimensions. Perhaps the different loop bounds result
> in C2 compiling differently?
>
> Paul.
>
> > On Nov 18, 2022, at 8:39 AM, Martin Stypinski <mstypinski at gmail.com>
> wrote:
> >
> > I recently looked into the Vector API (Java 18 Incubator III) and
> noticed some strange behavior. You can find my implementations and the raw
> data of the benchmark, including perfnorm on ubuntu 20.04 LTS, here:
> >       • Code Examples:
> https://gist.github.com/Styp/a4b398a0113c3430ebf02f020c1f52ff
> >       • Benchmark incl. Perfnorm:
> https://gist.github.com/Styp/e0e2aead7a0c3ba4934c5e5675c5253b
> >
> > I made some visualization which is attached here: https://ibb.co/V9wHJd1
> >
> > To explain a little background:
> >       • The baseline is the most primitive implementation of matrix
> multiplication.
> >       • "Blocked" is the version that is optimized by exploiting cache
> locality. (Implemented accordingly:
> https://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf)
> >       • Simple AVX-256 / Simple AVX-512 is the most straightforward way
> to use Vector API on the baseline.
> >       • "Blocked" AVX-256 / Blocked AVX-512 is the implementation of the
> blocked version using Vector API.
> >
> > Now my questions and comments:
> >       • The CPU used in this benchmark is a Xeon Gold 6212U with 768KB
> L1D, 1MB L2 and a shared 35.75MB L3 cache
> >       • The speed-up between Baseline and Blocked looks perfectly fine
> to me.
> >       • What is happening around 1024x1024 array size? The bigger the
> array size, the more benefit the cache-local implementation provides?!
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20221122/a1e3fed4/attachment.htm>