<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/xhtml; charset=utf-8">
</head>
<body><div style="font-family: sans-serif;"><div class="plaintext" style="white-space: normal;"><p dir="auto">Nice results!</p>
<p dir="auto">That hand-unrolling makes we wish we had synthetic multi-vector types.</p>
<p dir="auto">Something like this, which would declare four physical vectors at a time:</p>
<p dir="auto">public static final VectorSpecies<Float> ZMM_FLOAT = FloatVector.SPECIES_512.replicate(4);</p>
<p dir="auto">In this case, that would be sufficient to unroll the loop.</p>
<p dir="auto">On 28 Jun 2024, at 18:19, Viswanathan, Sandhya wrote:</p>
</div><blockquote class="embedded" style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #777777; color: #777777;"><div id="61E977EA-789A-4623-894C-3415D3828809">
<div lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word">
<div class="WordSection1" style="page: WordSection1;">
<p class="MsoNormal">I did some jmh experiments with small kernels and see good gains (1.3x, for 4096 array size) with 512-bit vector fma on CascadeLake (8280L) for dot product over 256-bit vector fma. The kernels I tried are:</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal"> public static final VectorSpecies<Float> YMM_FLOAT = FloatVector.SPECIES_256;</p>
<p class="MsoNormal"> public static final VectorSpecies<Float> ZMM_FLOAT = FloatVector.SPECIES_512;</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal"> @Benchmark</p>
<p class="MsoNormal"> public float vectorUnrolled512() {</p>
<p class="MsoNormal"> var sum1 = FloatVector.zero(ZMM_FLOAT);</p>
<p class="MsoNormal"> var sum2 = FloatVector.zero(ZMM_FLOAT);</p>
<p class="MsoNormal"> var sum3 = FloatVector.zero(ZMM_FLOAT);</p>
<p class="MsoNormal"> var sum4 = FloatVector.zero(ZMM_FLOAT);</p>
<p class="MsoNormal"> int width = ZMM_FLOAT.length();</p>
<p class="MsoNormal"> for (int i = 0; i <= (left.length - width * 4); i += width * 4) {</p>
<p class="MsoNormal"> sum1 = FloatVector.fromArray(ZMM_FLOAT, left, i).fma(FloatVector.fromArray(ZMM_FLOAT, right, i), sum1);</p>
<p class="MsoNormal"> sum2 = FloatVector.fromArray(ZMM_FLOAT, left, i + width).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width), sum2);</p>
<p class="MsoNormal"> sum3 = FloatVector.fromArray(ZMM_FLOAT, left, i + width * 2).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 2), sum3);</p>
<p class="MsoNormal"> sum4 = FloatVector.fromArray(ZMM_FLOAT, left, i + width * 3).fma(FloatVector.fromArray(ZMM_FLOAT, right, i + width * 3), sum4);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal"> @Benchmark</p>
<p class="MsoNormal"> public float vectorUnrolled256() {</p>
<p class="MsoNormal"> var sum1 = FloatVector.zero(YMM_FLOAT);</p>
<p class="MsoNormal"> var sum2 = FloatVector.zero(YMM_FLOAT);</p>
<p class="MsoNormal"> var sum3 = FloatVector.zero(YMM_FLOAT);</p>
<p class="MsoNormal"> var sum4 = FloatVector.zero(YMM_FLOAT);</p>
<p class="MsoNormal"> int width = YMM_FLOAT.length();</p>
<p class="MsoNormal"> for (int i = 0; i <= (left.length - width * 4); i += width * 4) {</p>
<p class="MsoNormal"> sum1 = FloatVector.fromArray(YMM_FLOAT, left, i).fma(FloatVector.fromArray(YMM_FLOAT, right, i), sum1);</p>
<p class="MsoNormal"> sum2 = FloatVector.fromArray(YMM_FLOAT, left, i + width).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width), sum2);</p>
<p class="MsoNormal"> sum3 = FloatVector.fromArray(YMM_FLOAT, left, i + width * 2).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 2), sum3);</p>
<p class="MsoNormal"> sum4 = FloatVector.fromArray(YMM_FLOAT, left, i + width * 3).fma(FloatVector.fromArray(YMM_FLOAT, right, i + width * 3), sum4);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> return sum1.add(sum2).add(sum3).add(sum4).reduceLanes(VectorOperators.ADD);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">The unrolled kernels also show good gains (1.8x) with 512-bit over 256-bit:</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal"> @Benchmark</p>
<p class="MsoNormal"> public float vector512() {</p>
<p class="MsoNormal"> var sum = FloatVector.zero(ZMM_FLOAT);</p>
<p class="MsoNormal"> int width = ZMM_FLOAT.length();</p>
<p class="MsoNormal"> for (int i = 0; i <= (left.length - width); i += width) {</p>
<p class="MsoNormal"> var l = FloatVector.fromArray(ZMM_FLOAT, left, i);</p>
<p class="MsoNormal"> var r = FloatVector.fromArray(ZMM_FLOAT, right, i);</p>
<p class="MsoNormal"> sum = l.fma(r, sum);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> return sum.reduceLanes(VectorOperators.ADD);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">@Benchmark</p>
<p class="MsoNormal"> public float vector256() {</p>
<p class="MsoNormal"> var sum = FloatVector.zero(YMM_FLOAT);</p>
<p class="MsoNormal"> int width = YMM_FLOAT.length();</p>
<p class="MsoNormal"> for (int i = 0; i <= (left.length - width); i += width) {</p>
<p class="MsoNormal"> var l = FloatVector.fromArray(YMM_FLOAT, left, i);</p>
<p class="MsoNormal"> var r = FloatVector.fromArray(YMM_FLOAT, right, i);</p>
<p class="MsoNormal"> sum = l.fma(r, sum);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> return sum.reduceLanes(VectorOperators.ADD);</p>
<p class="MsoNormal"> }</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Note that the hand unrolled kernels with multiple accumulators are the way to go as fma/multiply has high latency and you can get very good perf gains with hand unrolled kernels over non unrolled ones.</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Best Regards,</p>
<p class="MsoNormal">Sandhya</p>
</div>
</div></div></blockquote>
<div class="plaintext" style="white-space: normal;">
</div>
</div></body>
</html>