RFR: 8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long) [v12]
Galder Zamarreño
galder at openjdk.org
Thu Feb 20 06:27:57 UTC 2025
On Wed, 19 Feb 2025 19:50:50 GMT, Evgeny Astigeevich <eastigeevich at openjdk.org> wrote:
>> I will run a comparison next with the same batch of tests but looking at `int` and see if there are any differences compared with `long` or not.
>
> Hi @galderz,
> Results from Graviton 3(Neoverse-V1).
> Without the patch:
>
> Benchmark (probability) (range) (seed) (size) Mode Cnt Score Error Units
> MinMaxVector.intClippingRange N/A 90 0 1000 thrpt 8 12565.427 ± 37.538 ops/ms
> MinMaxVector.intClippingRange N/A 100 0 1000 thrpt 8 12462.072 ± 84.067 ops/ms
> MinMaxVector.intLoopMax 50 N/A N/A 2048 thrpt 8 5113.090 ± 68.720 ops/ms
> MinMaxVector.intLoopMax 80 N/A N/A 2048 thrpt 8 5129.857 ± 35.005 ops/ms
> MinMaxVector.intLoopMax 100 N/A N/A 2048 thrpt 8 5116.081 ± 8.946 ops/ms
> MinMaxVector.intLoopMin 50 N/A N/A 2048 thrpt 8 6174.544 ± 52.573 ops/ms
> MinMaxVector.intLoopMin 80 N/A N/A 2048 thrpt 8 6110.884 ± 54.447 ops/ms
> MinMaxVector.intLoopMin 100 N/A N/A 2048 thrpt 8 6178.661 ± 48.450 ops/ms
> MinMaxVector.intReductionMax 50 N/A N/A 2048 thrpt 8 5109.270 ± 10.525 ops/ms
> MinMaxVector.intReductionMax 80 N/A N/A 2048 thrpt 8 5123.426 ± 28.229 ops/ms
> MinMaxVector.intReductionMax 100 N/A N/A 2048 thrpt 8 5133.799 ± 7.693 ops/ms
> MinMaxVector.intReductionMin 50 N/A N/A 2048 thrpt 8 5130.209 ± 15.491 ops/ms
> MinMaxVector.intReductionMin 80 N/A N/A 2048 thrpt 8 5127.823 ± 27.767 ops/ms
> MinMaxVector.intReductionMin 100 N/A N/A 2048 thrpt 8 5118.217 ± 22.186 ops/ms
> MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 8 1831.026 ± 15.502 ops/ms
> MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 8 1827.194 ± 22.076 ops/ms
> MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 8 2643.383 ± 9.830 ops/ms
> MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 8 2640.417 ± 7.797 ops/ms
> MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 8 1244.321 ± 1.001 ops/ms
> MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 8 3239.234 ± 8.813 ops/ms
> MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 8 3252.713 ± 3...
Thanks @eastig for the results on Graviton 3. I'm summarising them here:
Benchmark (probability) (range) (seed) (size) Mode Cnt Base Patch Units
MinMaxVector.longClippingRange N/A 90 0 1000 thrpt 8 1831.026 5094.259 ops/ms (+178%)
MinMaxVector.longClippingRange N/A 100 0 1000 thrpt 8 1827.194 5096.835 ops/ms (+180%)
MinMaxVector.longLoopMax 50 N/A N/A 2048 thrpt 8 2643.383 2636.438 ops/ms
MinMaxVector.longLoopMax 80 N/A N/A 2048 thrpt 8 2640.417 2644.069 ops/ms
MinMaxVector.longLoopMax 100 N/A N/A 2048 thrpt 8 1244.321 2646.250 ops/ms (+112%)
MinMaxVector.longLoopMin 50 N/A N/A 2048 thrpt 8 3239.234 2648.504 ops/ms (-18%)
MinMaxVector.longLoopMin 80 N/A N/A 2048 thrpt 8 3252.713 2658.082 ops/ms (-18%)
MinMaxVector.longLoopMin 100 N/A N/A 2048 thrpt 8 1204.370 2647.532 ops/ms (+119%)
MinMaxVector.longReductionMax 50 N/A N/A 2048 thrpt 8 2536.322 2536.254 ops/ms
MinMaxVector.longReductionMax 80 N/A N/A 2048 thrpt 8 2536.318 2536.209 ops/ms
MinMaxVector.longReductionMax 100 N/A N/A 2048 thrpt 8 1395.273 2536.342 ops/ms (+81%)
MinMaxVector.longReductionMin 50 N/A N/A 2048 thrpt 8 2536.325 2536.271 ops/ms
MinMaxVector.longReductionMin 80 N/A N/A 2048 thrpt 8 2536.265 2536.250 ops/ms
MinMaxVector.longReductionMin 100 N/A N/A 2048 thrpt 8 1389.982 2536.246 ops/ms (+82%)
On Graviton 3 there are wide enough registers for vectorization to kick in, so we see similar improvements to x64 AVX-512 in https://github.com/openjdk/jdk/pull/20098#issuecomment-2642788364. There is some variance in the 50/80% probability range, this was also observed slightly there, but on the aarch64 system it looks more pronounced. Interesting that it happened with min but not max but could be variance.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20098#issuecomment-2670574593
More information about the core-libs-dev
mailing list