Intel AMX and feature detection

Tue Jun 25 18:04:58 UTC 2024

Hi Uwe,

The last two links are the same, was than intended?

I think there are two cases here:

1. The JVM configuration does not support the direct compilation of *any* Vector API expressions to vector hardware instructions.

2. The JVM configuration supports the direct compilation of Vector API expressions but due to hardware restrictions not all expressions can be compiled optimally. This can be split into two cases 
  2.1 generate set of instructions emulating the expression as optimally as possible for the current hardware (e.g. using blend instructions for masks); or
  2.2 fallback to Java, which in general is a bug and where it would be useful to optionally log some sort of warning.

It would be useful to understand more why you needed to avoid FMA on Apple Silicon and what limitations you hit for AVX-512 (it's particular challenging Intel vs AMD in some cases with AVX-512). It may be in many cases accessing the CPU flags is useful to you because you are trying to workaround limitations in the certain hardware that the current Vector API implementation is not aware of (likely the auto-vectorizer may not be either)?

Paul.

> On Jun 24, 2024, at 5:46 AM, Uwe Schindler <uschindler at apache.org> wrote:
> 
> Hi,
> I agree fully about 2nd point. The vector API requires some feature detection, otherwise it is impossible to use it without the risk of a dramatic slowdown (40x with Graal or C1 only). In Apache Lucene we have support for the vector API, but according to some best guesses with parsing HotspotMXBeans command line flags, we decide which of the algorithms in Apache Lucene are delegated to the Panama vectorized implementation.
> In addition, the FFM API is also damn slow once you enable Graal or disable C2 (e.g., client VM). So our code is a real spaghetti-code mess to detect if it is useful to switch to vectorized impls using Panama-Vector.
> I am planning to submit a feature request about this. It would be good to get at least the actual maximum bitsize and which of the vector operators are supported (like masks, FMA,...). One problem is also that if C2 is disabled the  code returns default values for the maximum vectorsize/species.
> Have a look at these code desasters:
>     • https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L103-L139 (worst, it parses Hotspot flags and disables by inspecting system properties)
>     • https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73 (this is mostly OK)
>     • https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73 (here we have to use different implemntation depending on vector bitsize in default species,....
> Some of that code can't be avoided by some feature detection API, as we for example avoid Panama Vectors with FMA on Apple Silicon or avoid AVX-512 on some Intel/AMD Silicon, not sure what was the problem - slowness in some combinations, for sure.
> Uwe
> Am 17.06.2024 um 06:26 schrieb Andrii Lomakin:
>> Hi guys.
>> 
>> I have three questions:
>> 
>> 1. Do you plan to add support for Intel AMX instructions? According
>> to Intel reports, it can add 2-3 times speedup in deep learning model
>> inference.
>> 2. The next question follows from the first one. Even now, masks are
>> not supported in every architecture, but AFAIK, there is no way to
>> detect whether they are supported at runtime. Do you plan to provide a
>> so-called "feature detection" API?
>> 3. And the last question: even on older sets of commands, there are
>> some that use register values as masks, blending, for example. Will
>> those instructions be supported on architectures that do not support
>> masking registers per se?
>> 
> -- 
> Uwe Schindler
> uschindler at apache.org 
> ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
> Bremen, Germany
> https://lucene.apache.org/
> https://solr.apache.org/