Intel AMX and feature detection

Thu Jun 27 09:49:51 UTC 2024

There's this interesting project that is related:

https://github.com/YaSuenag/ffmasm

It can be used to create an executable memory segment with custom assembly.

However, the resulting code is not treated as an intrinsics, and even 
using the linker critical option is not enough to reach performance 
parity with vector API (of course).

Would be cool if there was a way to reconcile that project with the old 
code snippets one, perhaps via some restricted method :-)

Maurizio

On 21/06/2024 22:53, Paul Sandoz wrote:
>
>> On Jun 20, 2024, at 11:17 PM, Andrii Lomakin <andrii0lomakin at gmail.com> wrote:
>>
>> Hi Paul.
>>
>> Thank you for all your help.
>>
>> I will raise this topic again when float16 support is landed. You will
>> probably have new ideas about the details of the implementation till
>> this time.
>> I remember we already discussed the possibility of implementing
>> special vector shapes in another thread.
>>
>> I am afraid that fine-grained foreign memory calls will kill all
>> performance benefits.
> Yes, it might in its current form, even when linking to “critical” functions. I mostly mention it as a way to more quickly experiment with AMX and Java.
>
> Out first experiments leveraging vector hardware instructions in Java used a technique we called code snippets where we could bind a method handle to some x86 code via a calling convention (see this presentation [1] from slide 37). It was effective for experimenting, and did not have the VM/native transition costs currently associated with Panama. Something of that nature might allow users to link to more specialized instructions, similarly as if it was a compiler intrinsic (a downside is the code snippet is opaque to C2).
>
> Paul.
>
> [1] https://cr.openjdk.org/~psandoz/conferences/2015-JavaOne/j1-2015-unsafe-CON7076.pdf
>
>> On Thu, Jun 20, 2024 at 8:55 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
>>> Hi Andrii,
>>>
>>> We have thought about AMX a little bit, but nothing concrete has emerged so far. It may be we can lean on special vector shapes (e.g. viewed linearly with a max size of 1024Kb), where vectors of such shapes would correspond to tile registers that can be used with a limited set of operators supported in the hardware e.g., DOT.  I believe the element types supported are int8 and float16, and the Vector API would need to be extended for that (we are investigating float16). One challenge might be to manage the register file, which I believe is programmable, and it may require that some sort of scoped execution to configure/release.
>>>
>>> As an interim experiment it may be possible to leverage Panama and native methods using the AMX intrinsics.
>>>
>>> No current plans to support a feature detection API. On architectures that don’t support explicit mask registers and mask register accepting instructions we emulate using vector registers and blend instructions, as you indicate.
>>>
>>> Paul.
>>>
>>>> On Jun 16, 2024, at 9:26 PM, Andrii Lomakin <andrii0lomakin at gmail.com> wrote:
>>>>
>>>> Hi guys.
>>>>
>>>> I have three  questions:
>>>>
>>>> 1.   Do you plan to add support for Intel AMX instructions? According
>>>> to Intel reports, it can add 2-3 times speedup in deep learning model
>>>> inference.
>>>> 2. The next question follows from the first one. Even now, masks are
>>>> not supported in every architecture, but AFAIK, there is no way to
>>>> detect whether they are supported at runtime. Do you plan to provide a
>>>> so-called "feature detection" API?
>>>> 3. And the last question: even on older sets of commands, there are
>>>> some that use register values as masks, blending, for example. Will
>>>> those instructions be supported on architectures that do not support
>>>> masking registers per se?