Intel AMX and feature detection

Wed Jun 26 22:55:42 UTC 2024

Actually we have VectorShape.S_64_BIT which, if it is
the preferred shape, is really telling you the
“vector” processing is inside the CPU not the VPU.
That’s a good-enough hint to avoid the Vector API,
right?

On 26 Jun 2024, at 15:28, John Rose wrote:

> Random idea of the day:  We could overload the preferred
> species mechanism to also say whether any vector at all
> is welcome, by adding a SPECIES_NONE (or SPECIES_SCALAR)
> to the enum…  Then you uniformly query the species, and
> on J9 and Graal and C1 you get NONE, on the platforms
> Daniel Lemire mentions you get AVX-512, and on others
> you get other reasonable choices.
>
> On 26 Jun 2024, at 7:11, Uwe Schindler wrote:
>
>> Hi,
>>
>> I just want to explain a bit what the difference between your 
>> statement and the Panama Vector API is:
>>
>> Am 25.06.2024 um 20:04 schrieb Paul Sandoz:
>>> Hi Uwe,
>>>
>>> The last two links are the same, was than intended?
>>
>> Sorry, the last link should have gone to the actual implementation 
>> and usage of Panama APIs: 
>> https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
>>
>> This file actually uses the Panama Vector API, but it has some static 
>> final constants where the code calling vector API sometimes uses a 
>> different implementation depending on bit size and some minor 
>> differences on silicon types. Sorry for the wrong link, copypaste 
>> problem! My fault!
>>
>> It would be nice to see this as an example where Panama Vector API is 
>> used in the wild!
>>
>> Maybe lets keep the FMA yes/no discussion out of the Panama Context, 
>> as this also affects Math.fma, too. I agree in Lucene's code we need 
>> some special "hacks" to figure out if FMA works well. So let's ignore 
>> that. See my "P.S." at end of mail about FMA in general!
>>
>>> I think there are two cases here:
>>>
>>> 1. The JVM configuration does not support the direct compilation of 
>>> *any* Vector API expressions to vector hardware instructions.
>>
>> That's exactly what we would like to know: There must be a way to 
>> figure out before, if the code you intend to write with Panama Vector 
>> API is actually fast and gets optimized with JVM. There are many 
>> traps:
>>
>>  * If you use Graal: it's damn slow, forget to use it!
>>  * If you use Eclipse Open J9: damn slow, forget to use it!
>>  * If Hotspot is in client mode (no C2 enabled): damn slow, forget to
>>    use it!
>>
>> Basically this is a big on / off switch: Code should be easily able 
>> to detect this beforehand. Some general thing like a static getter in 
>> the root of Panama Vector API (is it optimized at all?). If Graal, 
>> Open J9, or Hotspot C1 would just return false we would be super 
>> happy!
>>
>> This would solve most problems here: The problem of the Panama Vector 
>> API is: Everybody should normally use scalar code (without Panama) 
>> and let hotspot optimize it. But if we know that Panama would bring a 
>> significant improvement, we can spend time into writing a Panama 
>> variant. That's what Lucene tries to do. The scalar code is easy to 
>> read and runs fast if optimized. But to further improve it (e.g., for 
>> floats where the order of add / mul really matters and Hotspot is 
>> limited due to its requirement to be correct) you can make code rund 
>> 6 times faster with Panama. And only under those circumstances we 
>> really want to execute Panama Code.
>>
>> But for this you need to figure out when it makes sense.
>>
>>> 2. The JVM configuration supports the direct compilation of Vector 
>>> API expressions but due to hardware restrictions not all expressions 
>>> can be compiled optimally. This can be split into two cases
>>>    2.1 generate set of instructions emulating the expression as 
>>> optimally as possible for the current hardware (e.g. using blend 
>>> instructions for masks); or
>>>    2.2 fallback to Java, which in general is a bug and where it 
>>> would be useful to optionally log some sort of warning.
>>
>> This is exactly some things we would like to know for some lower 
>> level decisions. The code behind the last link, that was 
>> unfortunately hidden in my original link:
>>
>>  * What operators are there at all (like you say).
>>  * What bit sizes are there?
>>
>>> It would be useful to understand more why you needed to avoid FMA on 
>>> Apple Silicon and what limitations you hit for AVX-512 (it's 
>>> particular challenging Intel vs AMD in some cases with AVX-512). It 
>>> may be in many cases accessing the CPU flags is useful to you 
>>> because you are trying to workaround limitations in the certain 
>>> hardware that the current Vector API implementation is not aware of 
>>> (likely the auto-vectorizer may not be either)?
>> Let's put that aside. Sorry, that's too specific.
>>> Paul.
>>
>> Sorry for the always lengthly mails,
>>
>> Uwe
>>
>> P.S.: Here is some additional discussion about FMA and its 
>> implementation in JDK in general:
>>
>> We would like to have a "speed over correctness" variant in the Math 
>> class that falls back to plain mul/add if there's no CPU instruction 
>> and NOT use BigDecimal. The problem here is that the BigDecimal 
>> implementation is a huge trap! (there are issues about this already 
>> in the Bug Tracker, but all are closed wth "works as expected as 
>> correctness matters". If you write code where correctness does not 
>> matter and you want the fastest variant of Math#fma (no matter if it 
>> is a separate mul/add or fma), there should be a way to use it; 
>> especially for use cases like machine learning. In finacial 
>> applications it is of course important that Math.fma() always returns 
>> the same result. In Lucene we have a helper method for that: If 
>> Hotspot's "UseFMA" is detected to be true, we use it (with some 
>> stupid hacks for Apple Silicon, but let's remove that from 
>> discussion):
>>
>>   private static float fma(float a, float b, float c) {
>>     if (Constants.HAS_FAST_SCALAR_FMA) {
>>       return Math.fma(a, b, c);
>>     } else {
>>       return a * b + c;
>>     }
>>   }
>>
>> We don't care about correctness. It would be cool if this code is 
>> somewhere as an alternative in the Math class: "Give us the fastest 
>> way to do multiply and add three floats, no matter how exact it is, 
>> should just be fast".
>>
>>>> On Jun 24, 2024, at 5:46 AM, Uwe Schindler<uschindler at apache.org>  
>>>> wrote:
>>>>
>>>> Hi,
>>>> I agree fully about 2nd point. The vector API requires some feature 
>>>> detection, otherwise it is impossible to use it without the risk of 
>>>> a dramatic slowdown (40x with Graal or C1 only). In Apache Lucene 
>>>> we have support for the vector API, but according to some best 
>>>> guesses with parsing HotspotMXBeans command line flags, we decide 
>>>> which of the algorithms in Apache Lucene are delegated to the 
>>>> Panama vectorized implementation.
>>>> In addition, the FFM API is also damn slow once you enable Graal or 
>>>> disable C2 (e.g., client VM). So our code is a real spaghetti-code 
>>>> mess to detect if it is useful to switch to vectorized impls using 
>>>> Panama-Vector.
>>>> I am planning to submit a feature request about this. It would be 
>>>> good to get at least the actual maximum bitsize and which of the 
>>>> vector operators are supported (like masks, FMA,...). One problem 
>>>> is also that if C2 is disabled the  code returns default values for 
>>>> the maximum vectorsize/species.
>>>> Have a look at these code desasters:
>>>>      •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L103-L139 
>>>>  (worst, it parses Hotspot flags and disables by inspecting system 
>>>> properties)
>>>>      •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73 
>>>>  (this is mostly OK)
>>>>      •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73 
>>>>  (here we have to use different implemntation depending on vector 
>>>> bitsize in default species,....
>>>> Some of that code can't be avoided by some feature detection API, 
>>>> as we for example avoid Panama Vectors with FMA on Apple Silicon or 
>>>> avoid AVX-512 on some Intel/AMD Silicon, not sure what was the 
>>>> problem - slowness in some combinations, for sure.
>>>> Uwe
>>>> Am 17.06.2024 um 06:26 schrieb Andrii Lomakin:
>>>>> Hi guys.
>>>>>
>>>>> I have three questions:
>>>>>
>>>>> 1. Do you plan to add support for Intel AMX instructions? 
>>>>> According
>>>>> to Intel reports, it can add 2-3 times speedup in deep learning 
>>>>> model
>>>>> inference.
>>>>> 2. The next question follows from the first one. Even now, masks 
>>>>> are
>>>>> not supported in every architecture, but AFAIK, there is no way to
>>>>> detect whether they are supported at runtime. Do you plan to 
>>>>> provide a
>>>>> so-called "feature detection" API?
>>>>> 3. And the last question: even on older sets of commands, there 
>>>>> are
>>>>> some that use register values as masks, blending, for example. 
>>>>> Will
>>>>> those instructions be supported on architectures that do not 
>>>>> support
>>>>> masking registers per se?
>>>>>
>>>> -- 
>>>> Uwe Schindler
>>>> uschindler at apache.org  ASF Member, Member of PMC and Committer of 
>>>> Apache Lucene and Apache Solr
>>>> Bremen, Germany
>>>> https://lucene.apache.org/
>>>> https://solr.apache.org/
>>
>> -- 
>> Uwe Schindler
>> uschindler at apache.org  ASF Member, Member of PMC and Committer of 
>> Apache Lucene and Apache Solr
>> Bremen, Germany
>> https://lucene.apache.org/
>> https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240626/a0215a88/attachment-0001.htm>