Using the Vector API to access SIMD instructions

Sun Oct 10 12:16:52 UTC 2021

On 2021-10-09 14:08, forax at univ-mlv.fr wrote:
> ----- Original Message -----
>> From: "raffaello giulietti" <raffaello.giulietti at gmail.com>
>> To: "Remi Forax" <forax at univ-mlv.fr>
>> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>> Sent: Vendredi 8 Octobre 2021 19:43:24
>> Subject: Re: Using the Vector API to access SIMD instructions
> 
>> On 2021-10-08 18:50, Remi Forax wrote:
>>> ----- Original Message -----
>>>> From: "raffaello giulietti" <raffaello.giulietti at gmail.com>
>>>> To: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>> Sent: Vendredi 8 Octobre 2021 17:16:41
>>>> Subject: Using the Vector API to access SIMD instructions
>>>
>>>> Hello,
>>>>
>>>> I'm implementing two decimal floating-point formats, as defined by the
>>>> IEEE 754 spec. I'm anticipating that they will become primitive classes
>>>> (JEP 401) if I can make them perform reasonably well.
>>>>
>>>> In Decimal128 I find myself coding something like
>>>>
>>>>       int c0, c1, c2, c3;
>>>>       long m;
>>>>       int f, ph;
>>>>
>>>>       ...
>>>>
>>>>       int d0 = (int) ((m * c0) >>> f);
>>>>       int d1 = (int) ((m * c1) >>> f);
>>>>       int d2 = (int) ((m * c2) >>> f);
>>>>       int d3 = (int) ((m * c3) >>> f);
>>>>
>>>>       int e0 = c0 - ph * d0;
>>>>       int e1 = c1 - ph * d1;
>>>>       int e2 = c2 - ph * d2;
>>>>       int e3 = c3 - ph * d3;
>>>>
>>>> I would like to code these repetitive 4 + 4 lines as SIMD operations
>>>> using the Vector API.
>>>>
>>>> However, it seems to me that I would have to re-code them 3 times,
>>>> depending on the preferred size of the underlying SIMD registers and
>>>> hoping that the preferred size can be constant folded and dead code be
>>>> eliminated by C2.
>>>
>>> You can use IntVector.SPECIES_128
>>>     https://docs.oracle.com/en/java/javase/17/docs/api/jdk.incubator.vector/jdk/incubator/vector/IntVector.html#SPECIES_128
>>> and write the code once and run it everywhere :)
>>>
>>
>> This is less than optimal if the platform supports larger registers, I
>> guess. For example, even my old laptop supports 256 bit YMM registers.
> 
> I don't think it works that way, IntVector.SPECIES_128 let you express 128 bits int vectors,
> it says nothing about the fact that 256 bits registers can not be used.
> 

Hi Rémi,

please note that the first 4 scalar lines contain multiplications on 
longs, so one would have to issue two sequences on 
LongVector.SPECIES_128 or one sequence on LongVector.SPECIES_256.

So how would you code these 8 lines of straightforward scalar shape for 
the platform's best available register size (SPECIES_PREFERRED) to 
enhance performance?

I see two ways to code the first 4 lines:

* making use of an intermediate array (whose allocation you hope will be 
elided by the runtime compiler escape analysis) and following a pattern 
of code similar to the excerpts in the javadoc

* or switching over the underlying preferred species size and write 3 
specializations for LongVector.SPECIES_[64,128,256], hoping that 2 of 
them are seen as dead code by the runtime compiler's constant folding 
over SPECIES_PREFERRED.

Similarly for the last 4 lines above but on IntVector.SPECIES_[64,128].

Greetings
Raffaello

>>
>> And there seems to be no way to "load/unload" a Vector without passing
>> through an array or a ByteBuffer.
> 
> I wonder if zero + withLane + withLane etc. is optimized by c2 like this is done for primitive class (valhalla) ?
> 
>>
>> In the end I think that the Vector API is design more in supporting
>> vector operation on large arrays than on small fixed size data as here.
>> My attempts to vectorize the above scalar code end up being rather ugly
>> and probably not worth the possible performance enhancements.
>>
>>
>> Greetings
>> Raffaello
> 
> regards,
> Rémi
>