RFR: 8353741: Improve UUID.toString performance by using SIMD within a register instead of table lookup
Johannes Graham
duke at openjdk.org
Fri Apr 4 16:47:19 UTC 2025
On Sat, 11 Jan 2025 05:21:36 GMT, Shaojin Wen <swen at openjdk.org> wrote:
>> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD within a register) instead of table lookup. Eliminating the table lookup can also avoid the performance degradation problem when the cache misses.
>
> The new implementation improves performance on the aarch64 architecture but results in a performance regression on x64.
>
> ## 1. Script
>
> git remote add wenshao git at github.com:wenshao/jdk.git
> git fetch wenshao
>
> # baseline dfaa89162a3
> git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
> make test TEST="micro:java.util.UUIDBench.toString"
>
> # current c513087056b
> git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
> make test TEST="micro:java.util.UUIDBench.toString"
>
>
> ## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
>
> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
> -UUIDBench.toString 20000 thrpt 15 94.274 ± 0.452 ops/us
>
> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
> +UUIDBench.toString 20000 thrpt 15 80.241 ± 0.894 ops/us -14.88%
>
>
>
> ## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
>
> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
> -UUIDBench.toString 20000 thrpt 15 85.323 ± 2.044 ops/us
>
> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
> +UUIDBench.toString 20000 thrpt 15 73.636 ± 0.590 ops/us -13.69%
>
>
> ## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)
>
> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
> -UUIDBench.toString 20000 thrpt 15 69.286 ± 1.136 ops/us
>
> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
> +UUIDBench.toString 20000 thrpt 15 80.475 ± 0.310 ops/us +16.14%
>
>
>
> ## 5. MacBook M1 Pro (aarch64)
>
> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
> -UUIDBench.toString 20000 thrpt 15 108.254 ? 1.167 ops/us
>
> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
> +UUIDBench.toString 20000 thrpt 15 122.313 ? 0.820 ops/us +12.98%
>
>
>
> ## 6. orange_pi5_aarch64 (CPU RK3588S)
>
> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
> -UUIDBench.toString 20000 thrpt 15 37.783 ± 1.553 ops/us
>
> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
> +UUIDBench.toString 20000 thrpt 15 42.928 ± 2.534 ops/us +13.61%
>
>
>
>
> ## 7. orange_aipro_aarch64 (CPU TAISHANV200M)
>
> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
> -UUIDBench.toString 20000 thrpt 15 13.822 ± 0.203 ops/us
>
> +Benchmark (size) M...
With regard to the aarch64 vector instrinsic, I don't have access to an aarch64 to try it on (I'm faking it x64 by disabling the intrinsic). @wenshao would it be possible for you to try the Long.expand version of this patch with the patch from https://github.com/openjdk/jdk/pull/23089 to see how aarch64 performs?
> ARMv8 includes Apple M1/M2, AWS Graviton 3; ARMv9.0 includes Apple M3/M4, Aliyun Yitian 710.
An interesting piece of trivia - while the M4 is ARMv9, it appears not to support SVE - in particular the bdep instruction that this code would use. See https://github.com/llvm/llvm-project/blob/14b44179cb61dd551c911dea54de57b588621005/llvm/lib/Target/AArch64/AArch64Processors.td#L923
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2590911374
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2614028489
More information about the core-libs-dev
mailing list