Vector API FMA on CascadeLake
Viswanathan, Sandhya
sandhya.viswanathan at intel.com
Sat Jun 29 01:19:11 UTC 2024
I did some jmh experiments with small kernels and see good gains (1.3x, for 4096 array size) with 512-bit vector fma on CascadeLake (8280L) for dot product over 256-bit vector fma. The kernels I tried are:
public static final VectorSpecies<Float> YMM_FLOAT = FloatVector.SPECIES_256;
public static final VectorSpecies<Float> ZMM_FLOAT = FloatVector.SPECIES_512;
@Benchmark
public float vectorUnrolled512() {
var sum1 = FloatVector.zero(ZMM_FLOAT);
var sum2 = FloatVector.zero(ZMM_FLOAT);
var sum3 = FloatVector.zero(ZMM_FLOAT);
var sum4 = FloatVector.zero(ZMM_FLOAT);
int width = ZMM_FLOAT.length();
for (int i = 0; i <= (left.length - width * 4); i += width * 4) {
sum1 = FloatVector.fromArray(ZMM_FLOAT, left, i).fma(FloatVector.fromArray(ZMM_FLOAT, right, i), sum1);
sum2 = FloatVector.fromArray(ZMM_FLOAT, left, i + width).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width), sum2);
sum3 = FloatVector.fromArray(ZMM_FLOAT, left, i + width * 2).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 2), sum3);
sum4 = FloatVector.fromArray(ZMM_FLOAT, left, i + width * 3).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 3), sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);
}
@Benchmark
public float vectorUnrolled256() {
var sum1 = FloatVector.zero(YMM_FLOAT);
var sum2 = FloatVector.zero(YMM_FLOAT);
var sum3 = FloatVector.zero(YMM_FLOAT);
var sum4 = FloatVector.zero(YMM_FLOAT);
int width = YMM_FLOAT.length();
for (int i = 0; i <= (left.length - width * 4); i += width * 4) {
sum1 = FloatVector.fromArray(YMM_FLOAT, left, i).fma(FloatVector.fromArray(YMM_FLOAT, right, i), sum1);
sum2 = FloatVector.fromArray(YMM_FLOAT, left, i + width).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width), sum2);
sum3 = FloatVector.fromArray(YMM_FLOAT, left, i + width * 2).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 2), sum3);
sum4 = FloatVector.fromArray(YMM_FLOAT, left, i + width * 3).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 3), sum4);
}
return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);
}
The unrolled kernels also show good gains (1.8x) with 512-bit over 256-bit:
@Benchmark
public float vector512() {
var sum = FloatVector.zero(ZMM_FLOAT);
int width = ZMM_FLOAT.length();
for (int i = 0; i <= (left.length - width); i += width) {
var l = FloatVector.fromArray(ZMM_FLOAT, left, i);
var r = FloatVector.fromArray(ZMM_FLOAT, right, i);
sum = l.fma(r, sum);
}
return sum.reduceLanes(VectorOperators.ADD);
}
@Benchmark
public float vector256() {
var sum = FloatVector.zero(YMM_FLOAT);
int width = YMM_FLOAT.length();
for (int i = 0; i <= (left.length - width); i += width) {
var l = FloatVector.fromArray(YMM_FLOAT, left, i);
var r = FloatVector.fromArray(YMM_FLOAT, right, i);
sum = l.fma(r, sum);
}
return sum.reduceLanes(VectorOperators.ADD);
}
Note that the hand unrolled kernels with multiple accumulators are the way to go as fma/multiply has high latency and you can get very good perf gains with hand unrolled kernels over non unrolled ones.
Best Regards,
Sandhya
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240629/ef71cbf7/attachment.htm>
More information about the panama-dev
mailing list