RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v3]

Thu Jul 25 03:01:35 UTC 2024

On Wed, 24 Jul 2024 20:35:46 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.
>> 
>> Performance for add reduction before:
>> Benchmark                 (size)   Mode  Cnt     Score    Error   Units
>> Float128Vector.ADDLanes        1024  thrpt    5  4667.317 ±  0.456  ops/ms
>> Float256Vector.ADDLanes        1024  thrpt    5  5861.845 ±  0.933  ops/ms
>> Float512Vector.ADDLanes        1024  thrpt    5  4831.763 ± 36.330  ops/ms
>> Double128Vector.ADDLanes    1024  thrpt    5  2402.777 ±  0.814  ops/ms
>> Double256Vector.ADDLanes    1024  thrpt    5  4628.929 ±  1.638  ops/ms
>> Double512Vector.ADDLanes    1024  thrpt    5  4327.784 ± 13.728  ops/ms
>> 
>> Performance for add reduction after:
>> Benchmark                 (size)   Mode  Cnt      Score     Error   Units
>> Float128Vector.ADDLanes        1024  thrpt    5   4879.820 ±   7.407  ops/ms
>> Float256Vector.ADDLanes        1024  thrpt    5   9614.422 ±   4.621  ops/ms
>> Float512Vector.ADDLanes        1024  thrpt    5  15007.357 ±  57.316  ops/ms
>> Double128Vector.ADDLanes    1024  thrpt    5   2443.077 ±   1.694  ops/ms
>> Double256Vector.ADDLanes    1024  thrpt    5   4873.086 ±   1.680  ops/ms
>> Double512Vector.ADDLanes    1024  thrpt    5   9485.805 ±  31.852  ops/ms
>> 
>> Performance for mul reduction before:
>> Benchmark                 (size)   Mode  Cnt     Score    Error   Units
>> Float128Vector.MULLanes        1024  thrpt    5  4692.669 ±  3.555  ops/ms
>> Float256Vector.MULLanes        1024  thrpt    5  5866.017 ±  7.740  ops/ms
>> Float512Vector.MULLanes        1024  thrpt    5  4852.888 ± 46.561  ops/ms
>> Double128Vector.MULLanes    1024  thrpt    5  2402.173 ±  1.795  ops/ms
>> Double256Vector.MULLanes    1024  thrpt    5  4646.541 ±  2.136  ops/ms
>> Double512Vector.MULLanes    1024  thrpt    5  4292.133 ± 19.717  ops/ms
>> 
>> Performance for mul reduction after:
>> Benchmark                 (size)   Mode  Cnt      Score    Error   Units
>> Float128Vector.MULLanes        1024  thrpt    5   4885.890 ±  1.386  ops/ms
>> Float256Vector.MULLanes        1024  thrpt    5   9441.757 ± 46.048  ops/ms
>> Float512Vector.MULLanes        1024  thrpt    5  15091.997 ± 60.052  ops/ms
>> Double128Vector.MULLanes    1024  thrpt    5   2444.268 ±  1.677  ops/ms
>> Double256...
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
> 
>   review comment resolution

Nice work Sandhya, few comments.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1795:

> 1793:     case Op_MulReductionVF: mulps(dst, src); break;
> 1794:     case Op_MulReductionVD: mulpd(dst, src); break;
> 1795:     default:                assert(false, "wrong opcode");

Suggestion:

    default: assert(false, "%s", NodeClassNames[opcode]);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1855:

> 1853:     case Op_MulReductionVF: vmulps(dst, src1, src2, vector_len); break;
> 1854:     case Op_MulReductionVD: vmulpd(dst, src1, src2, vector_len); break;
> 1855:     default:                assert(false, "wrong opcode");

Suggestion:

        default: assert(false, "%s", NodeClassNames[opcode]);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1891:

> 1889:       break;
> 1890: 
> 1891:     default: assert(false, "wrong opcode");

Suggestion:

       default: assert(false, "%s", NodeClassNames[opcode]);

src/hotspot/cpu/x86/x86.ad line 5155:

> 5153: 
> 5154: instruct unordered_reduction2F(regF dst, regF src1, vec src2) %{
> 5155:   // Non-strictly ordered floating-point add reduction for doubles. This rule is

Use of doubles in the comments looks incorrect for since patterns corresponds to float type.

src/hotspot/cpu/x86/x86.ad line 5172:

> 5170: 
> 5171: instruct unordered_reduction4F(regF dst, regF src1, vec src2, vec vtmp) %{
> 5172:   // Non-strictly ordered floating-point add reduction for doubles. This rule is

Use of doubles in the comments looks incorrect for since patterns corresponds to float type.

src/hotspot/cpu/x86/x86.ad line 5189:

> 5187: 
> 5188: instruct unordered_reduction8F(regF dst, regF src1, vec src2, vec vtmp1, vec vtmp2) %{
> 5189:   // Non-strictly ordered floating-point add reduction for doubles. This rule is

Use of doubles in the comments looks incorrect for since patterns corresponds to float type.

test/jdk/jdk/incubator/vector/templates/Unit-header.template line 103:

> 101: 
> 102:     // for floating point multiplication reduction ops that may introduce rounding errors
> 103:     private static final $type$ RELATIVE_ROUNDING_ERROR_FACTOR_MUL = ($type$)100.0;

Instead of taking a fixed rounding factor of 100.0, why do you not make it equivalent to vector length ?

-------------

PR Review: https://git.openjdk.org/jdk/pull/20306#pullrequestreview-2198105137
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690679989
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690680135
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690680325
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690682656
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690683407
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690683509
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690719459