Intel AMX and feature detection
John Rose
john.r.rose at oracle.com
Wed Jun 26 22:28:29 UTC 2024
Random idea of the day: We could overload the preferred
species mechanism to also say whether any vector at all
is welcome, by adding a SPECIES_NONE (or SPECIES_SCALAR)
to the enum… Then you uniformly query the species, and
on J9 and Graal and C1 you get NONE, on the platforms
Daniel Lemire mentions you get AVX-512, and on others
you get other reasonable choices.
On 26 Jun 2024, at 7:11, Uwe Schindler wrote:
> Hi,
>
> I just want to explain a bit what the difference between your
> statement and the Panama Vector API is:
>
> Am 25.06.2024 um 20:04 schrieb Paul Sandoz:
>> Hi Uwe,
>>
>> The last two links are the same, was than intended?
>
> Sorry, the last link should have gone to the actual implementation and
> usage of Panama APIs:
> https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
>
> This file actually uses the Panama Vector API, but it has some static
> final constants where the code calling vector API sometimes uses a
> different implementation depending on bit size and some minor
> differences on silicon types. Sorry for the wrong link, copypaste
> problem! My fault!
>
> It would be nice to see this as an example where Panama Vector API is
> used in the wild!
>
> Maybe lets keep the FMA yes/no discussion out of the Panama Context,
> as this also affects Math.fma, too. I agree in Lucene's code we need
> some special "hacks" to figure out if FMA works well. So let's ignore
> that. See my "P.S." at end of mail about FMA in general!
>
>> I think there are two cases here:
>>
>> 1. The JVM configuration does not support the direct compilation of
>> *any* Vector API expressions to vector hardware instructions.
>
> That's exactly what we would like to know: There must be a way to
> figure out before, if the code you intend to write with Panama Vector
> API is actually fast and gets optimized with JVM. There are many
> traps:
>
> * If you use Graal: it's damn slow, forget to use it!
> * If you use Eclipse Open J9: damn slow, forget to use it!
> * If Hotspot is in client mode (no C2 enabled): damn slow, forget to
> use it!
>
> Basically this is a big on / off switch: Code should be easily able to
> detect this beforehand. Some general thing like a static getter in the
> root of Panama Vector API (is it optimized at all?). If Graal, Open
> J9, or Hotspot C1 would just return false we would be super happy!
>
> This would solve most problems here: The problem of the Panama Vector
> API is: Everybody should normally use scalar code (without Panama) and
> let hotspot optimize it. But if we know that Panama would bring a
> significant improvement, we can spend time into writing a Panama
> variant. That's what Lucene tries to do. The scalar code is easy to
> read and runs fast if optimized. But to further improve it (e.g., for
> floats where the order of add / mul really matters and Hotspot is
> limited due to its requirement to be correct) you can make code rund 6
> times faster with Panama. And only under those circumstances we really
> want to execute Panama Code.
>
> But for this you need to figure out when it makes sense.
>
>> 2. The JVM configuration supports the direct compilation of Vector
>> API expressions but due to hardware restrictions not all expressions
>> can be compiled optimally. This can be split into two cases
>> 2.1 generate set of instructions emulating the expression as
>> optimally as possible for the current hardware (e.g. using blend
>> instructions for masks); or
>> 2.2 fallback to Java, which in general is a bug and where it would
>> be useful to optionally log some sort of warning.
>
> This is exactly some things we would like to know for some lower level
> decisions. The code behind the last link, that was unfortunately
> hidden in my original link:
>
> * What operators are there at all (like you say).
> * What bit sizes are there?
>
>> It would be useful to understand more why you needed to avoid FMA on
>> Apple Silicon and what limitations you hit for AVX-512 (it's
>> particular challenging Intel vs AMD in some cases with AVX-512). It
>> may be in many cases accessing the CPU flags is useful to you because
>> you are trying to workaround limitations in the certain hardware that
>> the current Vector API implementation is not aware of (likely the
>> auto-vectorizer may not be either)?
> Let's put that aside. Sorry, that's too specific.
>> Paul.
>
> Sorry for the always lengthly mails,
>
> Uwe
>
> P.S.: Here is some additional discussion about FMA and its
> implementation in JDK in general:
>
> We would like to have a "speed over correctness" variant in the Math
> class that falls back to plain mul/add if there's no CPU instruction
> and NOT use BigDecimal. The problem here is that the BigDecimal
> implementation is a huge trap! (there are issues about this already in
> the Bug Tracker, but all are closed wth "works as expected as
> correctness matters". If you write code where correctness does not
> matter and you want the fastest variant of Math#fma (no matter if it
> is a separate mul/add or fma), there should be a way to use it;
> especially for use cases like machine learning. In finacial
> applications it is of course important that Math.fma() always returns
> the same result. In Lucene we have a helper method for that: If
> Hotspot's "UseFMA" is detected to be true, we use it (with some stupid
> hacks for Apple Silicon, but let's remove that from discussion):
>
> private static float fma(float a, float b, float c) {
> if (Constants.HAS_FAST_SCALAR_FMA) {
> return Math.fma(a, b, c);
> } else {
> return a * b + c;
> }
> }
>
> We don't care about correctness. It would be cool if this code is
> somewhere as an alternative in the Math class: "Give us the fastest
> way to do multiply and add three floats, no matter how exact it is,
> should just be fast".
>
>>> On Jun 24, 2024, at 5:46 AM, Uwe Schindler<uschindler at apache.org>
>>> wrote:
>>>
>>> Hi,
>>> I agree fully about 2nd point. The vector API requires some feature
>>> detection, otherwise it is impossible to use it without the risk of
>>> a dramatic slowdown (40x with Graal or C1 only). In Apache Lucene we
>>> have support for the vector API, but according to some best guesses
>>> with parsing HotspotMXBeans command line flags, we decide which of
>>> the algorithms in Apache Lucene are delegated to the Panama
>>> vectorized implementation.
>>> In addition, the FFM API is also damn slow once you enable Graal or
>>> disable C2 (e.g., client VM). So our code is a real spaghetti-code
>>> mess to detect if it is useful to switch to vectorized impls using
>>> Panama-Vector.
>>> I am planning to submit a feature request about this. It would be
>>> good to get at least the actual maximum bitsize and which of the
>>> vector operators are supported (like masks, FMA,...). One problem is
>>> also that if C2 is disabled the code returns default values for the
>>> maximum vectorsize/species.
>>> Have a look at these code desasters:
>>> •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L103-L139
>>> (worst, it parses Hotspot flags and disables by inspecting system
>>> properties)
>>> •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73
>>> (this is mostly OK)
>>> •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73
>>> (here we have to use different implemntation depending on vector
>>> bitsize in default species,....
>>> Some of that code can't be avoided by some feature detection API, as
>>> we for example avoid Panama Vectors with FMA on Apple Silicon or
>>> avoid AVX-512 on some Intel/AMD Silicon, not sure what was the
>>> problem - slowness in some combinations, for sure.
>>> Uwe
>>> Am 17.06.2024 um 06:26 schrieb Andrii Lomakin:
>>>> Hi guys.
>>>>
>>>> I have three questions:
>>>>
>>>> 1. Do you plan to add support for Intel AMX instructions? According
>>>> to Intel reports, it can add 2-3 times speedup in deep learning
>>>> model
>>>> inference.
>>>> 2. The next question follows from the first one. Even now, masks
>>>> are
>>>> not supported in every architecture, but AFAIK, there is no way to
>>>> detect whether they are supported at runtime. Do you plan to
>>>> provide a
>>>> so-called "feature detection" API?
>>>> 3. And the last question: even on older sets of commands, there are
>>>> some that use register values as masks, blending, for example. Will
>>>> those instructions be supported on architectures that do not
>>>> support
>>>> masking registers per se?
>>>>
>>> --
>>> Uwe Schindler
>>> uschindler at apache.org ASF Member, Member of PMC and Committer of
>>> Apache Lucene and Apache Solr
>>> Bremen, Germany
>>> https://lucene.apache.org/
>>> https://solr.apache.org/
>
> --
> Uwe Schindler
> uschindler at apache.org ASF Member, Member of PMC and Committer of
> Apache Lucene and Apache Solr
> Bremen, Germany
> https://lucene.apache.org/
> https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240626/8b1f2339/attachment-0001.htm>
More information about the panama-dev
mailing list