RFR: 8258932: AArch64: Enhance floating-point Min/MaxReductionV with fminp/fmaxp [v3]

Mon Jan 11 11:41:14 UTC 2021

On Mon, 11 Jan 2021 10:38:41 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Hi,
>> 
>> I made a mistake to say that the code is not vectorized with `COUNT < 12`, seems that the percentages of vectorized code is too small to be catched by `JMH perfasm`.
>> To observed if `Min/MaxReductionVNode` are created or not, I added a explicit print in `ReductionNode::make`, like:
>> --- a/src/hotspot/share/opto/vectornode.cpp
>> +++ b/src/hotspot/share/opto/vectornode.cpp
>> @@ -961,7 +961,9 @@ ReductionNode* ReductionNode::make(int opc, Node *ctrl, Node* n1, Node* n2, Basi
>>    case Op_MinReductionV:  return new MinReductionVNode(ctrl, n1, n2);
>> -  case Op_MaxReductionV:  return new MaxReductionVNode(ctrl, n1, n2);
>> +  case Op_MaxReductionV:
>> +    warning("in ReductionNode::make, making a MaxReductionVNode, length %d", n2->bottom_type()->is_vect()->length());
>> +    return new MaxReductionVNode(ctrl, n1, n2);
>>    case Op_AndReductionV:  return new AndReductionVNode(ctrl, n1, n2);
>> 
>> In my observation, we have `Max4F` when `COUNT >= 4`, it is resonable to create `Max4F` other than `Max2F`.
>> The `Max2F` is created with `COUNT == 3` and `-XX:-SuperWordLoopUnrollAnalysis`.
>> But I did not find any noticeable improvements with such a small percentage.
>> 
>> The JMH has been updated, the performance results are:
>> Benchmark                              (COUNT_DOUBLE)  (COUNT_FLOAT)  (seed)  Mode  Cnt    Score   Error  Units
>> # Kunpeng 916, default
>> VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  677.778 ± 0.694  ns/op
>> VectorReductionFloatingMinMax.maxRedF             512              3       0  avgt   10   21.016 ± 0.097  ns/op
>> VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  677.633 ± 0.664  ns/op
>> VectorReductionFloatingMinMax.minRedF             512              3       0  avgt   10   21.001 ± 0.019  ns/op
>> # Kunpeng 916, fmaxp/fminp
>> VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  425.776 ± 0.785  ns/op
>> VectorReductionFloatingMinMax.maxRedF             512              3       0  avgt   10   20.883 ± 0.033  ns/op
>> VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  426.177 ± 3.258  ns/op
>> VectorReductionFloatingMinMax.minRedF             512              3       0  avgt   10   20.871 ± 0.044  ns/op
>
> Did you try math.abs() for doubles?

The `Math.abs(doublesA[i] - doublesB[i])` has `~36%` improvements.
I updated the tests for doubles with `Math.abs()`, it looks more consistent. Thanks.
The JMH results of doubles with `Math.abs()`:
Benchmark                              (COUNT_DOUBLE)  (COUNT_FLOAT)  (seed)  Mode  Cnt    Score   Error  Units
# Kunpeng 916, default
VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  681.319 ± 0.658  ns/op
VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  682.596 ± 4.322  ns/op
# Kunpeng 916, fmaxp/fminp
VectorReductionFloatingMinMax.maxRedD             512              3       0  avgt   10  439.130 ± 0.450  ns/op => 35.54%
VectorReductionFloatingMinMax.minRedD             512              3       0  avgt   10  439.105 ± 0.435  ns/op => 35.67%

-------------

PR: https://git.openjdk.java.net/jdk/pull/1925