Using the Vector API to access SIMD instructions

Sat Oct 9 12:08:48 UTC 2021

----- Original Message -----
> From: "raffaello giulietti" <raffaello.giulietti at gmail.com>
> To: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Sent: Vendredi 8 Octobre 2021 19:43:24
> Subject: Re: Using the Vector API to access SIMD instructions

> On 2021-10-08 18:50, Remi Forax wrote:
>> ----- Original Message -----
>>> From: "raffaello giulietti" <raffaello.giulietti at gmail.com>
>>> To: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>> Sent: Vendredi 8 Octobre 2021 17:16:41
>>> Subject: Using the Vector API to access SIMD instructions
>> 
>>> Hello,
>>>
>>> I'm implementing two decimal floating-point formats, as defined by the
>>> IEEE 754 spec. I'm anticipating that they will become primitive classes
>>> (JEP 401) if I can make them perform reasonably well.
>>>
>>> In Decimal128 I find myself coding something like
>>>
>>>      int c0, c1, c2, c3;
>>>      long m;
>>>      int f, ph;
>>>
>>>      ...
>>>
>>>      int d0 = (int) ((m * c0) >>> f);
>>>      int d1 = (int) ((m * c1) >>> f);
>>>      int d2 = (int) ((m * c2) >>> f);
>>>      int d3 = (int) ((m * c3) >>> f);
>>>
>>>      int e0 = c0 - ph * d0;
>>>      int e1 = c1 - ph * d1;
>>>      int e2 = c2 - ph * d2;
>>>      int e3 = c3 - ph * d3;
>>>
>>> I would like to code these repetitive 4 + 4 lines as SIMD operations
>>> using the Vector API.
>>>
>>> However, it seems to me that I would have to re-code them 3 times,
>>> depending on the preferred size of the underlying SIMD registers and
>>> hoping that the preferred size can be constant folded and dead code be
>>> eliminated by C2.
>> 
>> You can use IntVector.SPECIES_128
>>    https://docs.oracle.com/en/java/javase/17/docs/api/jdk.incubator.vector/jdk/incubator/vector/IntVector.html#SPECIES_128
>> and write the code once and run it everywhere :)
>> 
> 
> This is less than optimal if the platform supports larger registers, I
> guess. For example, even my old laptop supports 256 bit YMM registers.

I don't think it works that way, IntVector.SPECIES_128 let you express 128 bits int vectors,
it says nothing about the fact that 256 bits registers can not be used.

> 
> And there seems to be no way to "load/unload" a Vector without passing
> through an array or a ByteBuffer.

I wonder if zero + withLane + withLane etc. is optimized by c2 like this is done for primitive class (valhalla) ?

> 
> In the end I think that the Vector API is design more in supporting
> vector operation on large arrays than on small fixed size data as here.
> My attempts to vectorize the above scalar code end up being rather ugly
> and probably not worth the possible performance enhancements.
> 
> 
> Greetings
> Raffaello

regards,
Rémi