Intel AMX and feature detection

Daniel Lemire daniel at lemire.me
Tue Jun 25 23:07:11 UTC 2024


If the CPU has AVX-512 VBMI2, then it is Ice Lake or better, or AMD Zen 4 or better, and then the downclocking should not be a concern. 

I would suggest you verify that the issues that have been reported still applies to a CPU with VBMI2.

- Daniel

On Tue, Jun 25, 2024, at 17:47, Robert Muir wrote:
> On Tue, Jun 25, 2024 at 2:37 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> > It would be useful to understand more why you needed to avoid FMA on Apple Silicon and what limitations you hit for AVX-512 (it's particular challenging Intel vs AMD in some cases with AVX-512). It may be in many cases accessing the CPU flags is useful to you because you are trying to workaround limitations in the certain hardware that the current Vector API implementation is not aware of (likely the auto-vectorizer may not be either)?
> >
> 
> For the case of FMA, for some cpus it gives worse performance than
> mul+add. For example amd cpus until recently where the latency was
> reduced. So we want to use what is fastest, since we don't care about
> an ulp here (we are using vector api with floating point).
> 
> For the case of AVX-512: it is avoiding downclocking. We got report of
> performance regression from a user and tracked it down to this with
> https://github.com/travisdowns/avx-turbo. So it is better to avoid 512
> bit multiply. "PREFERRED" says 512 but for that instruction it is not
> what you want.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240625/37c4a0f7/attachment.htm>


More information about the panama-dev mailing list