Intel AMX and feature detection
John Rose
john.r.rose at oracle.com
Wed Jun 26 22:55:42 UTC 2024
Actually we have VectorShape.S_64_BIT which, if it is
the preferred shape, is really telling you the
“vector” processing is inside the CPU not the VPU.
That’s a good-enough hint to avoid the Vector API,
right?
On 26 Jun 2024, at 15:28, John Rose wrote:
> Random idea of the day: We could overload the preferred
> species mechanism to also say whether any vector at all
> is welcome, by adding a SPECIES_NONE (or SPECIES_SCALAR)
> to the enum… Then you uniformly query the species, and
> on J9 and Graal and C1 you get NONE, on the platforms
> Daniel Lemire mentions you get AVX-512, and on others
> you get other reasonable choices.
>
> On 26 Jun 2024, at 7:11, Uwe Schindler wrote:
>
>> Hi,
>>
>> I just want to explain a bit what the difference between your
>> statement and the Panama Vector API is:
>>
>> Am 25.06.2024 um 20:04 schrieb Paul Sandoz:
>>> Hi Uwe,
>>>
>>> The last two links are the same, was than intended?
>>
>> Sorry, the last link should have gone to the actual implementation
>> and usage of Panama APIs:
>> https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
>>
>> This file actually uses the Panama Vector API, but it has some static
>> final constants where the code calling vector API sometimes uses a
>> different implementation depending on bit size and some minor
>> differences on silicon types. Sorry for the wrong link, copypaste
>> problem! My fault!
>>
>> It would be nice to see this as an example where Panama Vector API is
>> used in the wild!
>>
>> Maybe lets keep the FMA yes/no discussion out of the Panama Context,
>> as this also affects Math.fma, too. I agree in Lucene's code we need
>> some special "hacks" to figure out if FMA works well. So let's ignore
>> that. See my "P.S." at end of mail about FMA in general!
>>
>>> I think there are two cases here:
>>>
>>> 1. The JVM configuration does not support the direct compilation of
>>> *any* Vector API expressions to vector hardware instructions.
>>
>> That's exactly what we would like to know: There must be a way to
>> figure out before, if the code you intend to write with Panama Vector
>> API is actually fast and gets optimized with JVM. There are many
>> traps:
>>
>> * If you use Graal: it's damn slow, forget to use it!
>> * If you use Eclipse Open J9: damn slow, forget to use it!
>> * If Hotspot is in client mode (no C2 enabled): damn slow, forget to
>> use it!
>>
>> Basically this is a big on / off switch: Code should be easily able
>> to detect this beforehand. Some general thing like a static getter in
>> the root of Panama Vector API (is it optimized at all?). If Graal,
>> Open J9, or Hotspot C1 would just return false we would be super
>> happy!
>>
>> This would solve most problems here: The problem of the Panama Vector
>> API is: Everybody should normally use scalar code (without Panama)
>> and let hotspot optimize it. But if we know that Panama would bring a
>> significant improvement, we can spend time into writing a Panama
>> variant. That's what Lucene tries to do. The scalar code is easy to
>> read and runs fast if optimized. But to further improve it (e.g., for
>> floats where the order of add / mul really matters and Hotspot is
>> limited due to its requirement to be correct) you can make code rund
>> 6 times faster with Panama. And only under those circumstances we
>> really want to execute Panama Code.
>>
>> But for this you need to figure out when it makes sense.
>>
>>> 2. The JVM configuration supports the direct compilation of Vector
>>> API expressions but due to hardware restrictions not all expressions
>>> can be compiled optimally. This can be split into two cases
>>> 2.1 generate set of instructions emulating the expression as
>>> optimally as possible for the current hardware (e.g. using blend
>>> instructions for masks); or
>>> 2.2 fallback to Java, which in general is a bug and where it
>>> would be useful to optionally log some sort of warning.
>>
>> This is exactly some things we would like to know for some lower
>> level decisions. The code behind the last link, that was
>> unfortunately hidden in my original link:
>>
>> * What operators are there at all (like you say).
>> * What bit sizes are there?
>>
>>> It would be useful to understand more why you needed to avoid FMA on
>>> Apple Silicon and what limitations you hit for AVX-512 (it's
>>> particular challenging Intel vs AMD in some cases with AVX-512). It
>>> may be in many cases accessing the CPU flags is useful to you
>>> because you are trying to workaround limitations in the certain
>>> hardware that the current Vector API implementation is not aware of
>>> (likely the auto-vectorizer may not be either)?
>> Let's put that aside. Sorry, that's too specific.
>>> Paul.
>>
>> Sorry for the always lengthly mails,
>>
>> Uwe
>>
>> P.S.: Here is some additional discussion about FMA and its
>> implementation in JDK in general:
>>
>> We would like to have a "speed over correctness" variant in the Math
>> class that falls back to plain mul/add if there's no CPU instruction
>> and NOT use BigDecimal. The problem here is that the BigDecimal
>> implementation is a huge trap! (there are issues about this already
>> in the Bug Tracker, but all are closed wth "works as expected as
>> correctness matters". If you write code where correctness does not
>> matter and you want the fastest variant of Math#fma (no matter if it
>> is a separate mul/add or fma), there should be a way to use it;
>> especially for use cases like machine learning. In finacial
>> applications it is of course important that Math.fma() always returns
>> the same result. In Lucene we have a helper method for that: If
>> Hotspot's "UseFMA" is detected to be true, we use it (with some
>> stupid hacks for Apple Silicon, but let's remove that from
>> discussion):
>>
>> private static float fma(float a, float b, float c) {
>> if (Constants.HAS_FAST_SCALAR_FMA) {
>> return Math.fma(a, b, c);
>> } else {
>> return a * b + c;
>> }
>> }
>>
>> We don't care about correctness. It would be cool if this code is
>> somewhere as an alternative in the Math class: "Give us the fastest
>> way to do multiply and add three floats, no matter how exact it is,
>> should just be fast".
>>
>>>> On Jun 24, 2024, at 5:46 AM, Uwe Schindler<uschindler at apache.org>
>>>> wrote:
>>>>
>>>> Hi,
>>>> I agree fully about 2nd point. The vector API requires some feature
>>>> detection, otherwise it is impossible to use it without the risk of
>>>> a dramatic slowdown (40x with Graal or C1 only). In Apache Lucene
>>>> we have support for the vector API, but according to some best
>>>> guesses with parsing HotspotMXBeans command line flags, we decide
>>>> which of the algorithms in Apache Lucene are delegated to the
>>>> Panama vectorized implementation.
>>>> In addition, the FFM API is also damn slow once you enable Graal or
>>>> disable C2 (e.g., client VM). So our code is a real spaghetti-code
>>>> mess to detect if it is useful to switch to vectorized impls using
>>>> Panama-Vector.
>>>> I am planning to submit a feature request about this. It would be
>>>> good to get at least the actual maximum bitsize and which of the
>>>> vector operators are supported (like masks, FMA,...). One problem
>>>> is also that if C2 is disabled the code returns default values for
>>>> the maximum vectorsize/species.
>>>> Have a look at these code desasters:
>>>> •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L103-L139
>>>> (worst, it parses Hotspot flags and disables by inspecting system
>>>> properties)
>>>> •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73
>>>> (this is mostly OK)
>>>> •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73
>>>> (here we have to use different implemntation depending on vector
>>>> bitsize in default species,....
>>>> Some of that code can't be avoided by some feature detection API,
>>>> as we for example avoid Panama Vectors with FMA on Apple Silicon or
>>>> avoid AVX-512 on some Intel/AMD Silicon, not sure what was the
>>>> problem - slowness in some combinations, for sure.
>>>> Uwe
>>>> Am 17.06.2024 um 06:26 schrieb Andrii Lomakin:
>>>>> Hi guys.
>>>>>
>>>>> I have three questions:
>>>>>
>>>>> 1. Do you plan to add support for Intel AMX instructions?
>>>>> According
>>>>> to Intel reports, it can add 2-3 times speedup in deep learning
>>>>> model
>>>>> inference.
>>>>> 2. The next question follows from the first one. Even now, masks
>>>>> are
>>>>> not supported in every architecture, but AFAIK, there is no way to
>>>>> detect whether they are supported at runtime. Do you plan to
>>>>> provide a
>>>>> so-called "feature detection" API?
>>>>> 3. And the last question: even on older sets of commands, there
>>>>> are
>>>>> some that use register values as masks, blending, for example.
>>>>> Will
>>>>> those instructions be supported on architectures that do not
>>>>> support
>>>>> masking registers per se?
>>>>>
>>>> --
>>>> Uwe Schindler
>>>> uschindler at apache.org ASF Member, Member of PMC and Committer of
>>>> Apache Lucene and Apache Solr
>>>> Bremen, Germany
>>>> https://lucene.apache.org/
>>>> https://solr.apache.org/
>>
>> --
>> Uwe Schindler
>> uschindler at apache.org ASF Member, Member of PMC and Committer of
>> Apache Lucene and Apache Solr
>> Bremen, Germany
>> https://lucene.apache.org/
>> https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240626/a0215a88/attachment-0001.htm>
More information about the panama-dev
mailing list