Intel AMX and feature detection

Tue Jun 25 21:47:57 UTC 2024

On Tue, Jun 25, 2024 at 2:37 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:

> It would be useful to understand more why you needed to avoid FMA on Apple Silicon and what limitations you hit for AVX-512 (it's particular challenging Intel vs AMD in some cases with AVX-512). It may be in many cases accessing the CPU flags is useful to you because you are trying to workaround limitations in the certain hardware that the current Vector API implementation is not aware of (likely the auto-vectorizer may not be either)?
>

For the case of FMA, for some cpus it gives worse performance than
mul+add. For example amd cpus until recently where the latency was
reduced. So we want to use what is fastest, since we don't care about
an ulp here (we are using vector api with floating point).

For the case of AVX-512: it is avoiding downclocking. We got report of
performance regression from a user and tracked it down to this with
https://github.com/travisdowns/avx-turbo. So it is better to avoid 512
bit multiply. "PREFERRED" says 512 but for that instruction it is not
what you want.