Using the Vector API to access SIMD instructions
Raffaello Giulietti
raffaello.giulietti at gmail.com
Sun Oct 10 12:16:52 UTC 2021
On 2021-10-09 14:08, forax at univ-mlv.fr wrote:
> ----- Original Message -----
>> From: "raffaello giulietti" <raffaello.giulietti at gmail.com>
>> To: "Remi Forax" <forax at univ-mlv.fr>
>> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>> Sent: Vendredi 8 Octobre 2021 19:43:24
>> Subject: Re: Using the Vector API to access SIMD instructions
>
>> On 2021-10-08 18:50, Remi Forax wrote:
>>> ----- Original Message -----
>>>> From: "raffaello giulietti" <raffaello.giulietti at gmail.com>
>>>> To: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>> Sent: Vendredi 8 Octobre 2021 17:16:41
>>>> Subject: Using the Vector API to access SIMD instructions
>>>
>>>> Hello,
>>>>
>>>> I'm implementing two decimal floating-point formats, as defined by the
>>>> IEEE 754 spec. I'm anticipating that they will become primitive classes
>>>> (JEP 401) if I can make them perform reasonably well.
>>>>
>>>> In Decimal128 I find myself coding something like
>>>>
>>>> int c0, c1, c2, c3;
>>>> long m;
>>>> int f, ph;
>>>>
>>>> ...
>>>>
>>>> int d0 = (int) ((m * c0) >>> f);
>>>> int d1 = (int) ((m * c1) >>> f);
>>>> int d2 = (int) ((m * c2) >>> f);
>>>> int d3 = (int) ((m * c3) >>> f);
>>>>
>>>> int e0 = c0 - ph * d0;
>>>> int e1 = c1 - ph * d1;
>>>> int e2 = c2 - ph * d2;
>>>> int e3 = c3 - ph * d3;
>>>>
>>>> I would like to code these repetitive 4 + 4 lines as SIMD operations
>>>> using the Vector API.
>>>>
>>>> However, it seems to me that I would have to re-code them 3 times,
>>>> depending on the preferred size of the underlying SIMD registers and
>>>> hoping that the preferred size can be constant folded and dead code be
>>>> eliminated by C2.
>>>
>>> You can use IntVector.SPECIES_128
>>> https://docs.oracle.com/en/java/javase/17/docs/api/jdk.incubator.vector/jdk/incubator/vector/IntVector.html#SPECIES_128
>>> and write the code once and run it everywhere :)
>>>
>>
>> This is less than optimal if the platform supports larger registers, I
>> guess. For example, even my old laptop supports 256 bit YMM registers.
>
> I don't think it works that way, IntVector.SPECIES_128 let you express 128 bits int vectors,
> it says nothing about the fact that 256 bits registers can not be used.
>
Hi Rémi,
please note that the first 4 scalar lines contain multiplications on
longs, so one would have to issue two sequences on
LongVector.SPECIES_128 or one sequence on LongVector.SPECIES_256.
So how would you code these 8 lines of straightforward scalar shape for
the platform's best available register size (SPECIES_PREFERRED) to
enhance performance?
I see two ways to code the first 4 lines:
* making use of an intermediate array (whose allocation you hope will be
elided by the runtime compiler escape analysis) and following a pattern
of code similar to the excerpts in the javadoc
* or switching over the underlying preferred species size and write 3
specializations for LongVector.SPECIES_[64,128,256], hoping that 2 of
them are seen as dead code by the runtime compiler's constant folding
over SPECIES_PREFERRED.
Similarly for the last 4 lines above but on IntVector.SPECIES_[64,128].
Greetings
Raffaello
>>
>> And there seems to be no way to "load/unload" a Vector without passing
>> through an array or a ByteBuffer.
>
> I wonder if zero + withLane + withLane etc. is optimized by c2 like this is done for primitive class (valhalla) ?
>
>>
>> In the end I think that the Vector API is design more in supporting
>> vector operation on large arrays than on small fixed size data as here.
>> My attempts to vectorize the above scalar code end up being rather ugly
>> and probably not worth the possible performance enhancements.
>>
>>
>> Greetings
>> Raffaello
>
> regards,
> Rémi
>
More information about the panama-dev
mailing list