<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body><div style="font-family: sans-serif;"><div class="plaintext" style="white-space: normal;"><p dir="auto">Actually we have VectorShape.S_64_BIT which, if it is
<br>
the preferred shape, is really telling you the
<br>
“vector” processing is inside the CPU not the VPU.
<br>
That’s a good-enough hint to avoid the Vector API,
<br>
right?</p>
<p dir="auto">On 26 Jun 2024, at 15:28, John Rose wrote:</p>
</div><blockquote class="embedded" style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #777777; color: #777777;"><div id="771B1F1C-6E9F-492D-98EE-256D34044680">
<div style="font-family: sans-serif;">
<div class="plaintext" style="white-space: normal;">
<p dir="auto">Random idea of the day: We could overload the preferred<br>
species mechanism to also say whether any vector at all<br>
is welcome, by adding a SPECIES_NONE (or SPECIES_SCALAR)<br>
to the enum… Then you uniformly query the species, and<br>
on J9 and Graal and C1 you get NONE, on the platforms<br>
Daniel Lemire mentions you get AVX-512, and on others<br>
you get other reasonable choices.</p>
<p dir="auto">On 26 Jun 2024, at 7:11, Uwe Schindler wrote:</p>
</div>
<blockquote class="embedded" style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #777777; color: #777777;">
<div id="CEF0B61B-AFAE-4A68-AEC9-06A7D6D071A7">
<p>Hi,</p>
<p>I just want to explain a bit what the difference between your statement and the Panama Vector API is:<br></p>
<div class="moz-cite-prefix">Am 25.06.2024 um 20:04 schrieb Paul Sandoz:<br></div>
<blockquote type="cite" cite="mid:78A74DCA-6E80-4199-94C5-02651D2E2B8C@oracle.com">
<pre class="moz-quote-pre" wrap="">Hi Uwe,
The last two links are the same, was than intended?</pre></blockquote>
<p>Sorry, the last link should have gone to the actual implementation and usage of Panama APIs: <a class="moz-txt-link-freetext" href="https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java">https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java</a></p>
<p>This file actually uses the Panama Vector API, but it has some static final constants where the code calling vector API sometimes uses a different implementation depending on bit size and some minor differences on silicon types. Sorry for the wrong link, copypaste problem! My fault!</p>
<p><span style="white-space: pre-wrap">It would be nice to see this as an example where Panama Vector API is used in the wild!</span></p>
<p>Maybe lets keep the FMA yes/no discussion out of the Panama Context, as this also affects Math.fma, too. I agree in Lucene's code we need some special "hacks" to figure out if FMA works well. So let's ignore that. See my "P.S." at end of mail about FMA in general!</p>
<blockquote type="cite" cite="mid:78A74DCA-6E80-4199-94C5-02651D2E2B8C@oracle.com">
<pre class="moz-quote-pre" wrap="">I think there are two cases here:
1. The JVM configuration does not support the direct compilation of *any* Vector API expressions to vector hardware instructions.</pre></blockquote>
<p>That's exactly what we would like to know: There must be a way to figure out before, if the code you intend to write with Panama Vector API is actually fast and gets optimized with JVM. There are many traps:<br></p>
<ul>
<li>If you use Graal: it's damn slow, forget to use it!</li>
<li>If you use Eclipse Open J9: damn slow, forget to use it!</li>
<li>If Hotspot is in client mode (no C2 enabled): damn slow, forget to use it!<br></li>
</ul>
<p>Basically this is a big on / off switch: Code should be easily able to detect this beforehand. Some general thing like a static getter in the root of Panama Vector API (is it optimized at all?). If Graal, Open J9, or Hotspot C1 would just return false we would be super happy!</p>
This would solve most problems here: The problem of the Panama Vector API is: Everybody should normally use scalar code (without Panama) and let hotspot optimize it. But if we know that Panama would bring a significant improvement, we can spend time into writing a Panama variant. That's what Lucene tries to do. The scalar code is easy to read and runs fast if optimized. But to further improve it (e.g., for floats where the order of add / mul really matters and Hotspot is limited due to its requirement to be correct) you can make code rund 6 times faster with Panama. And only under those circumstances we really want to execute Panama Code.<br>
<p>But for this you need to figure out when it makes sense.</p>
<blockquote type="cite" cite="mid:78A74DCA-6E80-4199-94C5-02651D2E2B8C@oracle.com">
<pre class="moz-quote-pre" wrap="">2. The JVM configuration supports the direct compilation of Vector API expressions but due to hardware restrictions not all expressions can be compiled optimally. This can be split into two cases
2.1 generate set of instructions emulating the expression as optimally as possible for the current hardware (e.g. using blend instructions for masks); or
2.2 fallback to Java, which in general is a bug and where it would be useful to optionally log some sort of warning.
</pre></blockquote>
<p>This is exactly some things we would like to know for some lower level decisions. The code behind the last link, that was unfortunately hidden in my original link:</p>
<ul>
<li><span style="white-space: pre-wrap">What operators are there at all (like you say).</span></li>
<li><span style="white-space: pre-wrap">What bit sizes are there?</span></li>
</ul>
<blockquote type="cite" cite="mid:78A74DCA-6E80-4199-94C5-02651D2E2B8C@oracle.com">
<pre class="moz-quote-pre" wrap="">It would be useful to understand more why you needed to avoid FMA on Apple Silicon and what limitations you hit for AVX-512 (it's particular challenging Intel vs AMD in some cases with AVX-512). It may be in many cases accessing the CPU flags is useful to you because you are trying to workaround limitations in the certain hardware that the current Vector API implementation is not aware of (likely the auto-vectorizer may not be either)?</pre></blockquote>
Let's put that aside. Sorry, that's too specific.
<blockquote type="cite" cite="mid:78A74DCA-6E80-4199-94C5-02651D2E2B8C@oracle.com">
<pre class="moz-quote-pre" wrap="">Paul.</pre></blockquote>
<p><span style="white-space: pre-wrap">Sorry for the always lengthly mails,</span></p>
<p><span style="white-space: pre-wrap">Uwe</span></p>
<p>P.S.: Here is some additional discussion about FMA and its implementation in JDK in general:</p>
<p>We would like to have a "speed over correctness" variant in the Math class that falls back to plain mul/add if there's no CPU instruction and NOT use BigDecimal. The problem here is that the BigDecimal implementation is a huge trap! (there are issues about this already in the Bug Tracker, but all are closed wth "works as expected as correctness matters". If you write code where correctness does not matter and you want the fastest variant of Math#fma (no matter if it is a separate mul/add or fma), there should be a way to use it; especially for use cases like machine learning. In finacial applications it is of course important that Math.fma() always returns the same result. In Lucene we have a helper method for that: If Hotspot's "UseFMA" is detected to be true, we use it (with some stupid hacks for Apple Silicon, but let's remove that from discussion):</p>
<p><font face="monospace"> private static float fma(float a, float b, float c) {<br>
if (Constants.HAS_FAST_SCALAR_FMA) {<br>
return Math.fma(a, b, c);<br>
} else {<br>
return a * b + c;<br>
}<br>
}</font><br></p>
<p>We don't care about correctness. It would be cool if this code is somewhere as an alternative in the Math class: "Give us the fastest way to do multiply and add three floats, no matter how exact it is, should just be fast".</p>
<blockquote type="cite" cite="mid:78A74DCA-6E80-4199-94C5-02651D2E2B8C@oracle.com">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">On Jun 24, 2024, at 5:46 AM, Uwe Schindler <a class="moz-txt-link-rfc2396E" href="mailto:uschindler@apache.org"><uschindler@apache.org></a> wrote:
Hi,
I agree fully about 2nd point. The vector API requires some feature detection, otherwise it is impossible to use it without the risk of a dramatic slowdown (40x with Graal or C1 only). In Apache Lucene we have support for the vector API, but according to some best guesses with parsing HotspotMXBeans command line flags, we decide which of the algorithms in Apache Lucene are delegated to the Panama vectorized implementation.
In addition, the FFM API is also damn slow once you enable Graal or disable C2 (e.g., client VM). So our code is a real spaghetti-code mess to detect if it is useful to switch to vectorized impls using Panama-Vector.
I am planning to submit a feature request about this. It would be good to get at least the actual maximum bitsize and which of the vector operators are supported (like masks, FMA,...). One problem is also that if C2 is disabled the code returns default values for the maximum vectorsize/species.
Have a look at these code desasters:
• <a class="moz-txt-link-freetext" href="https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L103-L139">https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java#L103-L139</a> (worst, it parses Hotspot flags and disables by inspecting system properties)
• <a class="moz-txt-link-freetext" href="https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73">https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73</a> (this is mostly OK)
• <a class="moz-txt-link-freetext" href="https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73">https://github.com/apache/lucene/blob/3ae59a9809d9239593aa94dcc23f8ce382d59e60/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L40-L73</a> (here we have to use different implemntation depending on vector bitsize in default species,....
Some of that code can't be avoided by some feature detection API, as we for example avoid Panama Vectors with FMA on Apple Silicon or avoid AVX-512 on some Intel/AMD Silicon, not sure what was the problem - slowness in some combinations, for sure.
Uwe
Am 17.06.2024 um 06:26 schrieb Andrii Lomakin:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Hi guys.
I have three questions:
1. Do you plan to add support for Intel AMX instructions? According
to Intel reports, it can add 2-3 times speedup in deep learning model
inference.
2. The next question follows from the first one. Even now, masks are
not supported in every architecture, but AFAIK, there is no way to
detect whether they are supported at runtime. Do you plan to provide a
so-called "feature detection" API?
3. And the last question: even on older sets of commands, there are
some that use register values as masks, blending, for example. Will
those instructions be supported on architectures that do not support
masking registers per se?
</pre></blockquote>
<pre class="moz-quote-pre" wrap="">--
Uwe Schindler
<a class="moz-txt-link-abbreviated" href="mailto:uschindler@apache.org">uschindler@apache.org</a>
ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
Bremen, Germany
<a class="moz-txt-link-freetext" href="https://lucene.apache.org/">https://lucene.apache.org/</a>
<a class="moz-txt-link-freetext" href="https://solr.apache.org/">https://solr.apache.org/</a>
</pre></blockquote>
<pre class="moz-quote-pre" wrap="">
</pre></blockquote>
<pre class="moz-signature" cols="72">--
Uwe Schindler
<a class="moz-txt-link-abbreviated" href="mailto:uschindler@apache.org">uschindler@apache.org</a>
ASF Member, Member of PMC and Committer of Apache Lucene and Apache Solr
Bremen, Germany
<a class="moz-txt-link-freetext" href="https://lucene.apache.org/">https://lucene.apache.org/</a>
<a class="moz-txt-link-freetext" href="https://solr.apache.org/">https://solr.apache.org/</a></pre></div>
</blockquote>
<div class="plaintext" style="white-space: normal;"></div>
</div></div></blockquote>
<div class="plaintext" style="white-space: normal;">
</div>
</div></body>
</html>