Vector API FMA on CascadeLake
John Rose
john.r.rose at oracle.com
Tue Jul 2 01:24:17 UTC 2024
Nice results!
That hand-unrolling makes we wish we had synthetic multi-vector types.
Something like this, which would declare four physical vectors at a
time:
public static final VectorSpecies<Float> ZMM_FLOAT =
FloatVector.SPECIES_512.replicate(4);
In this case, that would be sufficient to unroll the loop.
On 28 Jun 2024, at 18:19, Viswanathan, Sandhya wrote:
> I did some jmh experiments with small kernels and see good gains
> (1.3x, for 4096 array size) with 512-bit vector fma on CascadeLake
> (8280L) for dot product over 256-bit vector fma. The kernels I tried
> are:
>
> public static final VectorSpecies<Float> YMM_FLOAT =
> FloatVector.SPECIES_256;
> public static final VectorSpecies<Float> ZMM_FLOAT =
> FloatVector.SPECIES_512;
>
> @Benchmark
> public float vectorUnrolled512() {
> var sum1 = FloatVector.zero(ZMM_FLOAT);
> var sum2 = FloatVector.zero(ZMM_FLOAT);
> var sum3 = FloatVector.zero(ZMM_FLOAT);
> var sum4 = FloatVector.zero(ZMM_FLOAT);
> int width = ZMM_FLOAT.length();
> for (int i = 0; i <= (left.length - width * 4); i += width * 4) {
> sum1 = FloatVector.fromArray(ZMM_FLOAT, left,
> i).fma(FloatVector.fromArray(ZMM_FLOAT, right, i), sum1);
> sum2 = FloatVector.fromArray(ZMM_FLOAT, left, i +
> width).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width), sum2);
> sum3 = FloatVector.fromArray(ZMM_FLOAT, left, i + width *
> 2).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 2), sum3);
> sum4 = FloatVector.fromArray(ZMM_FLOAT, left, i + width *
> 3).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 3), sum4);
> }
> return
> sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);
> }
>
> @Benchmark
> public float vectorUnrolled256() {
> var sum1 = FloatVector.zero(YMM_FLOAT);
> var sum2 = FloatVector.zero(YMM_FLOAT);
> var sum3 = FloatVector.zero(YMM_FLOAT);
> var sum4 = FloatVector.zero(YMM_FLOAT);
> int width = YMM_FLOAT.length();
> for (int i = 0; i <= (left.length - width * 4); i += width * 4) {
> sum1 = FloatVector.fromArray(YMM_FLOAT, left,
> i).fma(FloatVector.fromArray(YMM_FLOAT, right, i), sum1);
> sum2 = FloatVector.fromArray(YMM_FLOAT, left, i +
> width).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width), sum2);
> sum3 = FloatVector.fromArray(YMM_FLOAT, left, i + width *
> 2).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 2), sum3);
> sum4 = FloatVector.fromArray(YMM_FLOAT, left, i + width *
> 3).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 3), sum4);
> }
> return
> sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);
> }
>
> The unrolled kernels also show good gains (1.8x) with 512-bit over
> 256-bit:
>
> @Benchmark
> public float vector512() {
> var sum = FloatVector.zero(ZMM_FLOAT);
> int width = ZMM_FLOAT.length();
> for (int i = 0; i <= (left.length - width); i += width) {
> var l = FloatVector.fromArray(ZMM_FLOAT, left, i);
> var r = FloatVector.fromArray(ZMM_FLOAT, right, i);
> sum = l.fma(r, sum);
> }
> return sum.reduceLanes(VectorOperators.ADD);
> }
>
> @Benchmark
> public float vector256() {
> var sum = FloatVector.zero(YMM_FLOAT);
> int width = YMM_FLOAT.length();
> for (int i = 0; i <= (left.length - width); i += width) {
> var l = FloatVector.fromArray(YMM_FLOAT, left, i);
> var r = FloatVector.fromArray(YMM_FLOAT, right, i);
> sum = l.fma(r, sum);
> }
> return sum.reduceLanes(VectorOperators.ADD);
> }
>
> Note that the hand unrolled kernels with multiple accumulators are the
> way to go as fma/multiply has high latency and you can get very good
> perf gains with hand unrolled kernels over non unrolled ones.
>
> Best Regards,
> Sandhya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240701/b434cdf6/attachment-0001.htm>
More information about the panama-dev
mailing list