RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v3]
Jatin Bhateja
jbhateja at openjdk.org
Thu Jul 25 03:01:35 UTC 2024
On Wed, 24 Jul 2024 20:35:46 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.
>>
>> Performance for add reduction before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.ADDLanes 1024 thrpt 5 4667.317 ± 0.456 ops/ms
>> Float256Vector.ADDLanes 1024 thrpt 5 5861.845 ± 0.933 ops/ms
>> Float512Vector.ADDLanes 1024 thrpt 5 4831.763 ± 36.330 ops/ms
>> Double128Vector.ADDLanes 1024 thrpt 5 2402.777 ± 0.814 ops/ms
>> Double256Vector.ADDLanes 1024 thrpt 5 4628.929 ± 1.638 ops/ms
>> Double512Vector.ADDLanes 1024 thrpt 5 4327.784 ± 13.728 ops/ms
>>
>> Performance for add reduction after:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.ADDLanes 1024 thrpt 5 4879.820 ± 7.407 ops/ms
>> Float256Vector.ADDLanes 1024 thrpt 5 9614.422 ± 4.621 ops/ms
>> Float512Vector.ADDLanes 1024 thrpt 5 15007.357 ± 57.316 ops/ms
>> Double128Vector.ADDLanes 1024 thrpt 5 2443.077 ± 1.694 ops/ms
>> Double256Vector.ADDLanes 1024 thrpt 5 4873.086 ± 1.680 ops/ms
>> Double512Vector.ADDLanes 1024 thrpt 5 9485.805 ± 31.852 ops/ms
>>
>> Performance for mul reduction before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.MULLanes 1024 thrpt 5 4692.669 ± 3.555 ops/ms
>> Float256Vector.MULLanes 1024 thrpt 5 5866.017 ± 7.740 ops/ms
>> Float512Vector.MULLanes 1024 thrpt 5 4852.888 ± 46.561 ops/ms
>> Double128Vector.MULLanes 1024 thrpt 5 2402.173 ± 1.795 ops/ms
>> Double256Vector.MULLanes 1024 thrpt 5 4646.541 ± 2.136 ops/ms
>> Double512Vector.MULLanes 1024 thrpt 5 4292.133 ± 19.717 ops/ms
>>
>> Performance for mul reduction after:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.MULLanes 1024 thrpt 5 4885.890 ± 1.386 ops/ms
>> Float256Vector.MULLanes 1024 thrpt 5 9441.757 ± 46.048 ops/ms
>> Float512Vector.MULLanes 1024 thrpt 5 15091.997 ± 60.052 ops/ms
>> Double128Vector.MULLanes 1024 thrpt 5 2444.268 ± 1.677 ops/ms
>> Double256...
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>
> review comment resolution
Nice work Sandhya, few comments.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1795:
> 1793: case Op_MulReductionVF: mulps(dst, src); break;
> 1794: case Op_MulReductionVD: mulpd(dst, src); break;
> 1795: default: assert(false, "wrong opcode");
Suggestion:
default: assert(false, "%s", NodeClassNames[opcode]);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1855:
> 1853: case Op_MulReductionVF: vmulps(dst, src1, src2, vector_len); break;
> 1854: case Op_MulReductionVD: vmulpd(dst, src1, src2, vector_len); break;
> 1855: default: assert(false, "wrong opcode");
Suggestion:
default: assert(false, "%s", NodeClassNames[opcode]);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1891:
> 1889: break;
> 1890:
> 1891: default: assert(false, "wrong opcode");
Suggestion:
default: assert(false, "%s", NodeClassNames[opcode]);
src/hotspot/cpu/x86/x86.ad line 5155:
> 5153:
> 5154: instruct unordered_reduction2F(regF dst, regF src1, vec src2) %{
> 5155: // Non-strictly ordered floating-point add reduction for doubles. This rule is
Use of doubles in the comments looks incorrect for since patterns corresponds to float type.
src/hotspot/cpu/x86/x86.ad line 5172:
> 5170:
> 5171: instruct unordered_reduction4F(regF dst, regF src1, vec src2, vec vtmp) %{
> 5172: // Non-strictly ordered floating-point add reduction for doubles. This rule is
Use of doubles in the comments looks incorrect for since patterns corresponds to float type.
src/hotspot/cpu/x86/x86.ad line 5189:
> 5187:
> 5188: instruct unordered_reduction8F(regF dst, regF src1, vec src2, vec vtmp1, vec vtmp2) %{
> 5189: // Non-strictly ordered floating-point add reduction for doubles. This rule is
Use of doubles in the comments looks incorrect for since patterns corresponds to float type.
test/jdk/jdk/incubator/vector/templates/Unit-header.template line 103:
> 101:
> 102: // for floating point multiplication reduction ops that may introduce rounding errors
> 103: private static final $type$ RELATIVE_ROUNDING_ERROR_FACTOR_MUL = ($type$)100.0;
Instead of taking a fixed rounding factor of 100.0, why do you not make it equivalent to vector length ?
-------------
PR Review: https://git.openjdk.org/jdk/pull/20306#pullrequestreview-2198105137
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690679989
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690680135
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690680325
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690682656
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690683407
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690683509
PR Review Comment: https://git.openjdk.org/jdk/pull/20306#discussion_r1690719459
More information about the hotspot-compiler-dev
mailing list