Intel AMX and feature detection

Wed Jun 26 22:28:29 UTC 2024

Random idea of the day:  We could overload the preferred
species mechanism to also say whether any vector at all
is welcome, by adding a SPECIES_NONE (or SPECIES_SCALAR)
to the enum…  Then you uniformly query the species, and
on J9 and Graal and C1 you get NONE, on the platforms
Daniel Lemire mentions you get AVX-512, and on others
you get other reasonable choices.

On 26 Jun 2024, at 7:11, Uwe Schindler wrote:

> Hi,
>
> I just want to explain a bit what the difference between your 
> statement and the Panama Vector API is:
>
> Am 25.06.2024 um 20:04 schrieb Paul Sandoz:
>> Hi Uwe,
>>
>> The last two links are the same, was than intended?
>
> Sorry, the last link should have gone to the actual implementation and 
> usage of Panama APIs: 
> https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
>
> This file actually uses the Panama Vector API, but it has some static 
> final constants where the code calling vector API sometimes uses a 
> different implementation depending on bit size and some minor 
> differences on silicon types. Sorry for the wrong link, copypaste 
> problem! My fault!
>
> It would be nice to see this as an example where Panama Vector API is 
> used in the wild!
>
> Maybe lets keep the FMA yes/no discussion out of the Panama Context, 
> as this also affects Math.fma, too. I agree in Lucene's code we need 
> some special "hacks" to figure out if FMA works well. So let's ignore 
> that. See my "P.S." at end of mail about FMA in general!
>
>> I think there are two cases here:
>>
>> 1. The JVM configuration does not support the direct compilation of 
>> *any* Vector API expressions to vector hardware instructions.
>
> That's exactly what we would like to know: There must be a way to 
> figure out before, if the code you intend to write with Panama Vector 
> API is actually fast and gets optimized with JVM. There are many 
> traps:
>
>  * If you use Graal: it's damn slow, forget to use it!
>  * If you use Eclipse Open J9: damn slow, forget to use it!
>  * If Hotspot is in client mode (no C2 enabled): damn slow, forget to
>    use it!
>
> Basically this is a big on / off switch: Code should be easily able to 
> detect this beforehand. Some general thing like a static getter in the 
> root of Panama Vector API (is it optimized at all?). If Graal, Open 
> J9, or Hotspot C1 would just return false we would be super happy!
>
> This would solve most problems here: The problem of the Panama Vector 
> API is: Everybody should normally use scalar code (without Panama) and 
> let hotspot optimize it. But if we know that Panama would bring a 
> significant improvement, we can spend time into writing a Panama 
> variant. That's what Lucene tries to do. The scalar code is easy to 
> read and runs fast if optimized. But to further improve it (e.g., for 
> floats where the order of add / mul really matters and Hotspot is 
> limited due to its requirement to be correct) you can make code rund 6 
> times faster with Panama. And only under those circumstances we really 
> want to execute Panama Code.
>
> But for this you need to figure out when it makes sense.
>
>> 2. The JVM configuration supports the direct compilation of Vector 
>> API expressions but due to hardware restrictions not all expressions 
>> can be compiled optimally. This can be split into two cases
>>    2.1 generate set of instructions emulating the expression as 
>> optimally as possible for the current hardware (e.g. using blend 
>> instructions for masks); or
>>    2.2 fallback to Java, which in general is a bug and where it would 
>> be useful to optionally log some sort of warning.
>
> This is exactly some things we would like to know for some lower level 
> decisions. The code behind the last link, that was unfortunately 
> hidden in my original link:
>
>  * What operators are there at all (like you say).
>  * What bit sizes are there?
>
>> It would be useful to understand more why you needed to avoid FMA on 
>> Apple Silicon and what limitations you hit for AVX-512 (it's 
>> particular challenging Intel vs AMD in some cases with AVX-512). It 
>> may be in many cases accessing the CPU flags is useful to you because 
>> you are trying to workaround limitations in the certain hardware that 
>> the current Vector API implementation is not aware of (likely the 
>> auto-vectorizer may not be either)?
> Let's put that aside. Sorry, that's too specific.
>> Paul.
>
> Sorry for the always lengthly mails,
>
> Uwe
>
> P.S.: Here is some additional discussion about FMA and its 
> implementation in JDK in general:
>
> We would like to have a "speed over correctness" variant in the Math 
> class that falls back to plain mul/add if there's no CPU instruction 
> and NOT use BigDecimal. The problem here is that the BigDecimal 
> implementation is a huge trap! (there are issues about this already in 
> the Bug Tracker, but all are closed wth "works as expected as 
> correctness matters". If you write code where correctness does not 
> matter and you want the fastest variant of Math#fma (no matter if it 
> is a separate mul/add or fma), there should be a way to use it; 
> especially for use cases like machine learning. In finacial 
> applications it is of course important that Math.fma() always returns 
> the same result. In Lucene we have a helper method for that: If 
> Hotspot's "UseFMA" is detected to be true, we use it (with some stupid 
> hacks for Apple Silicon, but let's remove that from discussion):
>
>   private static float fma(float a, float b, float c) {
>     if (Constants.HAS_FAST_SCALAR_FMA) {
>       return Math.fma(a, b, c);
>     } else {
>       return a * b + c;
>     }
>   }
>
> We don't care about correctness. It would be cool if this code is 
> somewhere as an alternative in the Math class: "Give us the fastest 
> way to do multiply and add three floats, no matter how exact it is, 
> should just be fast".
>
>>> On Jun 24, 2024, at 5:46 AM, Uwe Schindler<uschindler at apache.org>  
>>> wrote:
>>>
>>> Hi,
>>> I agree fully about 2nd point. The vector API requires some feature 
>>> detection, otherwise it is impossible to use it without the risk of 
>>> a dramatic slowdown (40x with Graal or C1 only). In Apache Lucene we 
>>> have support for the vector API, but according to some best guesses 
>>> with parsing HotspotMXBeans command line flags, we decide which of 
>>> the algorithms in Apache Lucene are delegated to the Panama 
>>> vectorized implementation.
>>> In addition, the FFM API is also damn slow once you enable Graal or 
>>> disable C2 (e.g., client VM). So our code is a real spaghetti-code 
>>> mess to detect if it is useful to switch to vectorized impls using 
>>> Panama-Vector.
>>> I am planning to submit a feature request about this. It would be 
>>> good to get at least the actual maximum bitsize and which of the 
>>> vector operators are supported (like masks, FMA,...). One problem is 
>>> also that if C2 is disabled the  code returns default values for the 
>>> maximum vectorsize/species.
>>> Have a look at these code desasters:
>>>      •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L103-L139 
>>>  (worst, it parses Hotspot flags and disables by inspecting system 
>>> properties)
>>>      •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73 
>>>  (this is mostly OK)
>>>      •https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73 
>>>  (here we have to use different implemntation depending on vector 
>>> bitsize in default species,....
>>> Some of that code can't be avoided by some feature detection API, as 
>>> we for example avoid Panama Vectors with FMA on Apple Silicon or 
>>> avoid AVX-512 on some Intel/AMD Silicon, not sure what was the 
>>> problem - slowness in some combinations, for sure.
>>> Uwe
>>> Am 17.06.2024 um 06:26 schrieb Andrii Lomakin:
>>>> Hi guys.
>>>>
>>>> I have three questions:
>>>>
>>>> 1. Do you plan to add support for Intel AMX instructions? According
>>>> to Intel reports, it can add 2-3 times speedup in deep learning 
>>>> model
>>>> inference.
>>>> 2. The next question follows from the first one. Even now, masks 
>>>> are
>>>> not supported in every architecture, but AFAIK, there is no way to
>>>> detect whether they are supported at runtime. Do you plan to 
>>>> provide a
>>>> so-called "feature detection" API?
>>>> 3. And the last question: even on older sets of commands, there are
>>>> some that use register values as masks, blending, for example. Will
>>>> those instructions be supported on architectures that do not 
>>>> support
>>>> masking registers per se?
>>>>
>>> -- 
>>> Uwe Schindler
>>> uschindler at apache.org  ASF Member, Member of PMC and Committer of 
>>> Apache Lucene and Apache Solr
>>> Bremen, Germany
>>> https://lucene.apache.org/
>>> https://solr.apache.org/
>
> -- 
> Uwe Schindler
> uschindler at apache.org  ASF Member, Member of PMC and Committer of 
> Apache Lucene and Apache Solr
> Bremen, Germany
> https://lucene.apache.org/
> https://solr.apache.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240626/8b1f2339/attachment-0001.htm>