Intel AMX and feature detection

Tue Jun 25 21:58:40 UTC 2024

On Tue, Jun 25, 2024 at 5:47 PM Robert Muir <rcmuir at gmail.com> wrote:
>
> On Tue, Jun 25, 2024 at 2:37 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> > It would be useful to understand more why you needed to avoid FMA on Apple Silicon and what limitations you hit for AVX-512 (it's particular challenging Intel vs AMD in some cases with AVX-512). It may be in many cases accessing the CPU flags is useful to you because you are trying to workaround limitations in the certain hardware that the current Vector API implementation is not aware of (likely the auto-vectorizer may not be either)?
> >
>
> For the case of FMA, for some cpus it gives worse performance than
> mul+add. For example amd cpus until recently where the latency was
> reduced. So we want to use what is fastest, since we don't care about
> an ulp here (we are using vector api with floating point).
>
> For the case of AVX-512: it is avoiding downclocking. We got report of
> performance regression from a user and tracked it down to this with
> https://github.com/travisdowns/avx-turbo. So it is better to avoid 512
> bit multiply. "PREFERRED" says 512 but for that instruction it is not
> what you want.

I forgot to add a link to the current (ugly, bad!, hacky!) heuristics we use:

https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/Constants.java#L88-L130

Really we just want to know cpu family. But we can't get it portably
so we "infer" from other JVM options that openjdk sets based on the
cpu family.