Vector API FMA on CascadeLake
Robert Muir
rcmuir at gmail.com
Tue Jul 2 14:26:41 UTC 2024
Thank you for benchmarking. For the downclocking issue we had reported
from a user, we only saw it with integer multiplication, not
floating-point: dot product of ByteVectors.
Instead of converting ByteVector(128) to IntVector(512) and doing
multiply/add, we workaround the problem by intermediate conversion of
ByteVector(128) to ShortVector (256), doing the multiplication, then
convert to IntVector (512) for the addition. it still avoids overflow
but does not do the heavy multiplication with avx-512.
On Fri, Jun 28, 2024 at 9:19 PM Viswanathan, Sandhya
<sandhya.viswanathan at intel.com> wrote:
>
> I did some jmh experiments with small kernels and see good gains (1.3x, for 4096 array size) with 512-bit vector fma on CascadeLake (8280L) for dot product over 256-bit vector fma. The kernels I tried are:
>
>
>
> public static final VectorSpecies<Float> YMM_FLOAT = FloatVector.SPECIES_256;
>
> public static final VectorSpecies<Float> ZMM_FLOAT = FloatVector.SPECIES_512;
>
>
>
> @Benchmark
>
> public float vectorUnrolled512() {
>
> var sum1 = FloatVector.zero(ZMM_FLOAT);
>
> var sum2 = FloatVector.zero(ZMM_FLOAT);
>
> var sum3 = FloatVector.zero(ZMM_FLOAT);
>
> var sum4 = FloatVector.zero(ZMM_FLOAT);
>
> int width = ZMM_FLOAT.length();
>
> for (int i = 0; i <= (left.length - width * 4); i += width * 4) {
>
> sum1 = FloatVector.fromArray(ZMM_FLOAT, left, i).fma(FloatVector.fromArray(ZMM_FLOAT, right, i), sum1);
>
> sum2 = FloatVector.fromArray(ZMM_FLOAT, left, i + width).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width), sum2);
>
> sum3 = FloatVector.fromArray(ZMM_FLOAT, left, i + width * 2).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 2), sum3);
>
> sum4 = FloatVector.fromArray(ZMM_FLOAT, left, i + width * 3).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 3), sum4);
>
> }
>
> return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);
>
> }
>
>
>
> @Benchmark
>
> public float vectorUnrolled256() {
>
> var sum1 = FloatVector.zero(YMM_FLOAT);
>
> var sum2 = FloatVector.zero(YMM_FLOAT);
>
> var sum3 = FloatVector.zero(YMM_FLOAT);
>
> var sum4 = FloatVector.zero(YMM_FLOAT);
>
> int width = YMM_FLOAT.length();
>
> for (int i = 0; i <= (left.length - width * 4); i += width * 4) {
>
> sum1 = FloatVector.fromArray(YMM_FLOAT, left, i).fma(FloatVector.fromArray(YMM_FLOAT, right, i), sum1);
>
> sum2 = FloatVector.fromArray(YMM_FLOAT, left, i + width).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width), sum2);
>
> sum3 = FloatVector.fromArray(YMM_FLOAT, left, i + width * 2).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 2), sum3);
>
> sum4 = FloatVector.fromArray(YMM_FLOAT, left, i + width * 3).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 3), sum4);
>
> }
>
> return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);
>
> }
>
>
>
> The unrolled kernels also show good gains (1.8x) with 512-bit over 256-bit:
>
>
>
> @Benchmark
>
> public float vector512() {
>
> var sum = FloatVector.zero(ZMM_FLOAT);
>
> int width = ZMM_FLOAT.length();
>
> for (int i = 0; i <= (left.length - width); i += width) {
>
> var l = FloatVector.fromArray(ZMM_FLOAT, left, i);
>
> var r = FloatVector.fromArray(ZMM_FLOAT, right, i);
>
> sum = l.fma(r, sum);
>
> }
>
> return sum.reduceLanes(VectorOperators.ADD);
>
> }
>
>
>
> @Benchmark
>
> public float vector256() {
>
> var sum = FloatVector.zero(YMM_FLOAT);
>
> int width = YMM_FLOAT.length();
>
> for (int i = 0; i <= (left.length - width); i += width) {
>
> var l = FloatVector.fromArray(YMM_FLOAT, left, i);
>
> var r = FloatVector.fromArray(YMM_FLOAT, right, i);
>
> sum = l.fma(r, sum);
>
> }
>
> return sum.reduceLanes(VectorOperators.ADD);
>
> }
>
>
>
> Note that the hand unrolled kernels with multiple accumulators are the way to go as fma/multiply has high latency and you can get very good perf gains with hand unrolled kernels over non unrolled ones.
>
>
>
> Best Regards,
>
> Sandhya
More information about the panama-dev
mailing list