[aarch64-port-dev ] [PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics, approval request

Fri Mar 1 16:01:07 UTC 2019

On 28/02/2019 12:38, Andrew Haley wrote:
> On 2/28/19 9:54 AM, Andrew Haley wrote:
>> On 2/27/19 8:21 PM, Vladimir Kozlov wrote:
>>
>>> So I have question for aarch64 developers. Are aarch64 fmin/fmax
>>> instructions are always faster than code generated by default?
>> Be aware that AArch64 is an abstract architecture, so it cannot be
>> said to have performance properties.
>>
>> In real hardware, however, the answer is no. Nothing like. I have seen
>> the fmin/fmax instructions cause a 3x slowdown on a reduction loop.
> 
> So Andrew Dinn asked me what machine, and what test. After some time
> trying I confess that I cannot reproduce this result. I didn't think
> much of it at the time, which was why I didn't record that
> information. My apologies.

Ok, that's good to know. The tests I ran to check the benefits of
FPMax/Min intrinsics were for 3 different AArch64 CPUs (AppliedMicro,
Qualcomm and AMD) and they only showed a small degradation in
performance for the intrinsic with sorted data and a good improvement
with random data.

Also, I can now provide some details including timings the tests
fpmin/max reduction tests I tried on Qualcomm and AMD. I tested 3
separate implementations:

1) only Pengfei's fpmin/max intrinsics no reduction rules (novec)
2) Pengfei's fpmin/max intrinsics plus reduction rules (vec)
3) Pengfei's fpmin/max intrinsics plus my upgraded reduction rules (vecplus)

The difference between vec and vecplus is that Pengfei only uses the
vector reduction instruction fmaxv for reducing a 4 float vector (T4S).

instruct reduce_max4F(vRegF dst, vRegF src1, vecX src2) %{
  . . .
  ins_encode %{
    __ fmaxv(as_FloatRegister($dst$$reg), __ T4S,
as_FloatRegister($src2$$reg));
    __ fmaxs(as_FloatRegister($dst$$reg), as_FloatRegister($dst$$reg),
as_FloatRegister($src1$$reg));
  %}
  . . .

The fmaxv vector operation picks the max of the 4 new vector elements in
one step. The subsequent scalar compare picks it or the current
reduction value for the next cycle round the loop.

For the 2 float and 2 double reduction rules (T2S and T2D) Pengfei's
rules compare  of the 2 vector entries independently using a vector
element pick two scalar comparisons. Here is the double version of
Pengfei's encoding (the float version simply replaces D with F and d with f)

instruct reduce_max2D(vRegD dst, vRegD src1, vecX src2, vecX tmp) %{
  . . .
  ins_encode %{
    __ fmaxd(as_FloatRegister($dst$$reg), as_FloatRegister($src1$$reg),
as_FloatRegister($src2$$reg));
    __ ins(as_FloatRegister($tmp$$reg), __ D,
as_FloatRegister($src2$$reg), 0, 1);
    __ fmaxd(as_FloatRegister($dst$$reg), as_FloatRegister($dst$$reg),
as_FloatRegister($tmp$$reg));
  %}
  . . .

My alternative patch modifies the 2D rule to work like the 4S rule i.e.

instruct reduce_max2D(vRegD dst, vRegD src1, vecX src2, vecX tmp) %{
  . . .
  ins_encode %{
    __ fmaxv(as_FloatRegister($dst$$reg), __ T2D,
as_FloatRegister($src2$$reg));
    __ fmaxd(as_FloatRegister($dst$$reg), as_FloatRegister($dst$$reg),
as_FloatRegister($src1$$reg));
  %}
  . . .

There is a corresponding tweak to the 2F rule but it is somewhat
immaterial since I could not produce a test that would cause it to be
applied.

In fact, like Pengfei, I found it hard to come up with tests that caused
the reduction to be performed.

The obvious example one would want to work would be something like this:

  double da[] = ...
  doubel db[] = ...

  @Benchmark
  public void testVecMaxDoubleReduce2() {
      double max = 0.0;
      for (int z = 0; z < COUNT; z++) {
	  for (int i = 0; i < LENGTH; i++) {
	      max = Math.max(max, da[i]);
	  }
      }
      dc[0] = max;
  }

Obviously there are 3 more equivalent benchmarks obtained by
substituting Math.min for Math.max and.or float for double.

For this test the max operations are translated to the FPMax intrinsic.
However, the reduction is not applied.  That's because the compiler
never considers performing the da[i] loads in the loop body as vector loads.

Vectorized loading is only performed when two arrays are loaded and
combined using a binary operator. So, the following test does get
vectorized and, as a consequence, is then vector reduced.

  @Benchmark
  public void testVecMaxDoubleReduce3() {
      double max = 0.0;
      for (int z = 0; z < COUNT; z++) {
	  for (int i = 0; i < LENGTH; i++) {
	      max = Math.max(max, da[i] + db[i]);
	  }
      }
      dc[0] = max;
  }

In this case the compiler spots that the adds refer to two elements of
da and db using the same index and decides that it can perform the add
as a 2D vector op. Now that it has the sum as a 2D value it is able to
use the 2D FPMax reduction rule to compute the value of the Math.max
call. This also works for the other 3 cases where a min and/or float
type is substituted.

I got the following result (i us/op) from running these tests on the AMD box

Benchmark                  No Redn       Redn           Full Redn
testVecMaxDoubleReduce2    6042 ± 0.47   6041 ±  0.17   6042 ±   0.37
testVecMaxDoubleReduce3    6042 ± 0.57   6042 ±  0.55   3576 ± 143.86

testVecMaxFloatReduce2     6041 ± 0.05   6042 ±  0.47   6042 ±   0.38
testVecMaxFloatReduce3     6042 ± 0.31   1556 ± 17.25   1562 ±  21.96

testVecMinDoubleReduce2    6041 ± 0.05   6042 ±  0.34   6042 ±   0.39
testVecMinDoubleReduce3    6042 ± 0.36   6050 ±  7.20   3322 ±  23.41

testVecMinFloatReduce2     6042 ± 0.49   6042 ±  0.58   6042 ±   0.30
testVecMinFloatReduce3     6042 ± 0.44   1535 ±  3.24   1581 ±  61.56

The 3 columns show no reduction, reduction using Pengfei's T4S rule and
reduction using both T4S and T2D rules. In all 3 cases the FpMin/Max
intrinsic is enabled. As you can see the Reduce2 tests don't get
reduced. The Reduce3 tests do get reduced with both sets of reduction
rules (I verified this in the debugger and by eyeballing the generated
code). for the T4S cases there is a clear improvement with both sets of
rules. For the 2D case Pengfei's rules don't give a better result but
mine do (when you look at the generated code it is clear that Pengfei's
reduction rule is not going to give a better result over the no
reduction case).

Results on the Qualcomm box were pretty much identical.

n.b. the variation in the average values and spreads for some of the
float and double runs are misleading. My machine crashed while running
the tests remotely and I ended up perturbing the box I was running the
tests on when I re-established a login session and checked the progress
of the runs. Previous tests provided more consistent results where the
FloatReduce3 values for vec and vecplus were very close and had a very
small spread (as you would expect) and where the DoubleReduce3 values
for vecplus were also both close to 3300 with a very small spread.

So, these figures seem to show that the use of fmaxv reduction for both
T4S and T2D saves cycles by avoiding the need for extra vector register
element transfers and fpmax/min comparisons.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander