Intel AMX and feature detection

Tue Jul 2 01:34:43 UTC 2024

I suppose the per-operation aspect could come in as an optional
list of operations, to pass (using varargs) to the
preferred-species API point.

public static final VectorSpecies<Float> WHICH_FLOAT = VectorSpecies.ofPreferred(Float.class, VectorOperators.FMA, VectorOperators.ADD);

This could be factored down into a less convenient fine-grained API:

public static final VectorSpecies<Float> WHICH_FLOAT = VectorSpecies.ofLargestShape(Float.class).preferredOp(VectorOperators.FMA).preferredOp(VectorOperators.ADD);

The assumption there would be that each preferredOp would potentially make the vector size smaller, and you’d start with an optimistically wide shape (“largest shape”).

On 28 Jun 2024, at 15:33, Vladimir Ivanov wrote:

>> On Skylake we restrict the JVM to AVX=2 overall (no AVX 512 ISA) and user can override with UseAVX=3.
>> On Cascade Lake we only restrict the auto vectorizer to 256-bit vector width with AVX 512 ISA, and also allow the intrinsics and Vector API to benefit from 512-bit vector width. FMA is expensive but not every vector instruction is and when the user explicitly uses IntVector.SPECIES_512 and they would expect to get 512-bit code generation, won't they?  Also, SPECIES_PREFERRED and SPECIES_MAX for an element type are interconnected. So, lot of questions on how to handle everything if we do want to restrict SPECIES_PREFERRED.
>> As Paul mentioned we need an override anyway, so then the thought came to mind if the existing option (MaxVectorSize=32) would work, then why not use the override other way and then we don’t have to go into all the complications.
>
> All those are good points, Sandhya.
>
> The notion of preferred vector shape turns out to be way too vague. It was a good first approximation when we compared AVX vs AVX2 vs baseline AVX512 (Skylake support), but now it doesn't cover all the different flavors and varying quality of support of multiple AVX512* ISA extensions.
>
> As I suggested in a separate email, new per-operation API may be a better option here.
>
> Best regards,
> Vladimir Ivanov
>
>>
>> Best Regards,
>> Sandhya
>>
>> -----Original Message-----
>> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>> Sent: Thursday, June 27, 2024 3:55 PM
>> To: Viswanathan, Sandhya <sandhya.viswanathan at intel.com>; Paul Sandoz <paul.sandoz at oracle.com>; John Rose <john.r.rose at oracle.com>
>> Cc: Uwe Schindler <uschindler at apache.org>; panama-dev at openjdk.org
>> Subject: Re: Intel AMX and feature detection
>>
>>
>>> It is possible do the “down bit’ing” today by setting -XX:MaxVectorSize=32 on JVM command line. This sets the preferred species to 256-bit vector size with AVX-512 ISA.  Would that work by any chance? Or I guess that is not what we want ...
>>
>> HotSpot JVM already does that: by default, AVX512 is not used on Skylake CPUs [1] even though they support AVX512F et all  (unless user explicitly specifies -XX:UseAVX=3). Maybe the JVM should be more aggressive, hard to say. (It'll affect Cascade Lake.) The choice is not specific to Vector API, but affects the whole JVM (in particular, VM intrinsics and C2 auto-vectorizer).
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1]
>> https://urldefense.com/v3/__https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/vm_version_x86.cpp*L1005__;Iw!!ACWV5N9M2RV99hQ!NcewmJ_5Ainlcy60l1uoLs36xjcl5P5mzGo-Qcb_aJwsGfAyapERNxuFi8xFz6h6K_2sKLvoKKpmAUZoFOij9LedRFiA9Id2bw$
>>
>>> -----Original Message-----
>>> From: panama-dev <panama-dev-retn at openjdk.org> On Behalf Of Paul
>>> Sandoz
>>> Sent: Thursday, June 27, 2024 7:59 AM
>>> To: John Rose <john.r.rose at oracle.com>
>>> Cc: Uwe Schindler <uschindler at apache.org>; panama-dev at openjdk.org
>>> Subject: Re: Intel AMX and feature detection
>>>
>>>
>>>
>>>> On Jun 26, 2024, at 3:55 PM, John Rose <john.r.rose at oracle.com> wrote:
>>>>
>>>> Actually we have VectorShape.S_64_BIT which, if it is the preferred
>>>> shape, is really telling you the “vector” processing is inside the
>>>> CPU not the VPU.
>>>> That’s a good-enough hint to avoid the Vector API, right?
>>>
>>> Yes, I think that is a good proxy for lack of any compiler support.
>>>
>>> I would like to hear opinions from the Intel’s folks on “down bit’ing” from 512 to 256 on AVX-512 without VBMI2. It seems pragmatic. What about the auto-vectorizer? We also have HotSpot flags to say uses AVX2 on an AVX-512 machine as a workaround. If we do this perhaps we need a flag to override the “down bit’ing”.
>>>
>>> I will log issues for both.
>>>
>>> Paul.