RFR: 8337062: x86_64: Unordered add/mul reduction support for vector api [v4]
Sandhya Viswanathan
sviswanathan at openjdk.org
Tue Jul 30 22:45:33 UTC 2024
On Thu, 25 Jul 2024 23:40:45 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> Vector API doesn't define an order on reduction. The requires_strict_order flag was recently added as part of [JDK-8320725](https://bugs.openjdk.org/browse/JDK-8320725) to identify if a reduction should be ordered or unordered. This flag is used to implement efficient vector api unordered reduction for floating point add/mul on x86_64.
>>
>> Performance for add reduction before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.ADDLanes 1024 thrpt 5 4667.317 ± 0.456 ops/ms
>> Float256Vector.ADDLanes 1024 thrpt 5 5861.845 ± 0.933 ops/ms
>> Float512Vector.ADDLanes 1024 thrpt 5 4831.763 ± 36.330 ops/ms
>> Double128Vector.ADDLanes 1024 thrpt 5 2402.777 ± 0.814 ops/ms
>> Double256Vector.ADDLanes 1024 thrpt 5 4628.929 ± 1.638 ops/ms
>> Double512Vector.ADDLanes 1024 thrpt 5 4327.784 ± 13.728 ops/ms
>>
>> Performance for add reduction after:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.ADDLanes 1024 thrpt 5 4879.820 ± 7.407 ops/ms
>> Float256Vector.ADDLanes 1024 thrpt 5 9614.422 ± 4.621 ops/ms
>> Float512Vector.ADDLanes 1024 thrpt 5 15007.357 ± 57.316 ops/ms
>> Double128Vector.ADDLanes 1024 thrpt 5 2443.077 ± 1.694 ops/ms
>> Double256Vector.ADDLanes 1024 thrpt 5 4873.086 ± 1.680 ops/ms
>> Double512Vector.ADDLanes 1024 thrpt 5 9485.805 ± 31.852 ops/ms
>>
>> Performance for mul reduction before:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.MULLanes 1024 thrpt 5 4692.669 ± 3.555 ops/ms
>> Float256Vector.MULLanes 1024 thrpt 5 5866.017 ± 7.740 ops/ms
>> Float512Vector.MULLanes 1024 thrpt 5 4852.888 ± 46.561 ops/ms
>> Double128Vector.MULLanes 1024 thrpt 5 2402.173 ± 1.795 ops/ms
>> Double256Vector.MULLanes 1024 thrpt 5 4646.541 ± 2.136 ops/ms
>> Double512Vector.MULLanes 1024 thrpt 5 4292.133 ± 19.717 ops/ms
>>
>> Performance for mul reduction after:
>> Benchmark (size) Mode Cnt Score Error Units
>> Float128Vector.MULLanes 1024 thrpt 5 4885.890 ± 1.386 ops/ms
>> Float256Vector.MULLanes 1024 thrpt 5 9441.757 ± 46.048 ops/ms
>> Float512Vector.MULLanes 1024 thrpt 5 15091.997 ± 60.052 ops/ms
>> Double128Vector.MULLanes 1024 thrpt 5 2444.268 ± 1.677 ops/ms
>> Double256...
>
> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision:
>
> Jatin review comment resolution
@vnkozlov Please let me know if it would be possible for you to run this PR through your testing. Also any review comments are welcome too. I am hoping to integrate this in next couple of days.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20306#issuecomment-2259325215
More information about the hotspot-compiler-dev
mailing list