RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v4]

Tue Jul 30 22:45:33 UTC 2024

On Thu, 25 Jul 2024 23:40:45 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.
>> 
>> Performance for add reduction before:
>> Benchmark                 (size)   Mode  Cnt     Score    Error   Units
>> Float128Vector.ADDLanes        1024  thrpt    5  4667.317 ±  0.456  ops/ms
>> Float256Vector.ADDLanes        1024  thrpt    5  5861.845 ±  0.933  ops/ms
>> Float512Vector.ADDLanes        1024  thrpt    5  4831.763 ± 36.330  ops/ms
>> Double128Vector.ADDLanes    1024  thrpt    5  2402.777 ±  0.814  ops/ms
>> Double256Vector.ADDLanes    1024  thrpt    5  4628.929 ±  1.638  ops/ms
>> Double512Vector.ADDLanes    1024  thrpt    5  4327.784 ± 13.728  ops/ms
>> 
>> Performance for add reduction after:
>> Benchmark                 (size)   Mode  Cnt      Score     Error   Units
>> Float128Vector.ADDLanes        1024  thrpt    5   4879.820 ±   7.407  ops/ms
>> Float256Vector.ADDLanes        1024  thrpt    5   9614.422 ±   4.621  ops/ms
>> Float512Vector.ADDLanes        1024  thrpt    5  15007.357 ±  57.316  ops/ms
>> Double128Vector.ADDLanes    1024  thrpt    5   2443.077 ±   1.694  ops/ms
>> Double256Vector.ADDLanes    1024  thrpt    5   4873.086 ±   1.680  ops/ms
>> Double512Vector.ADDLanes    1024  thrpt    5   9485.805 ±  31.852  ops/ms
>> 
>> Performance for mul reduction before:
>> Benchmark                 (size)   Mode  Cnt     Score    Error   Units
>> Float128Vector.MULLanes        1024  thrpt    5  4692.669 ±  3.555  ops/ms
>> Float256Vector.MULLanes        1024  thrpt    5  5866.017 ±  7.740  ops/ms
>> Float512Vector.MULLanes        1024  thrpt    5  4852.888 ± 46.561  ops/ms
>> Double128Vector.MULLanes    1024  thrpt    5  2402.173 ±  1.795  ops/ms
>> Double256Vector.MULLanes    1024  thrpt    5  4646.541 ±  2.136  ops/ms
>> Double512Vector.MULLanes    1024  thrpt    5  4292.133 ± 19.717  ops/ms
>> 
>> Performance for mul reduction after:
>> Benchmark                 (size)   Mode  Cnt      Score    Error   Units
>> Float128Vector.MULLanes        1024  thrpt    5   4885.890 ±  1.386  ops/ms
>> Float256Vector.MULLanes        1024  thrpt    5   9441.757 ± 46.048  ops/ms
>> Float512Vector.MULLanes        1024  thrpt    5  15091.997 ± 60.052  ops/ms
>> Double128Vector.MULLanes    1024  thrpt    5   2444.268 ±  1.677  ops/ms
>> Double256...
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Jatin review comment resolution

@vnkozlov Please let me know if it would be possible for you to run this PR through your testing. Also any review comments are welcome too. I am hoping to integrate this in next couple of days.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20306#issuecomment-2259325215