RFR: 8353741: Improve UUID.toString performance by using SIMD within a register instead of table lookup

Fri Apr 4 16:47:19 UTC 2025

On Sat, 11 Jan 2025 05:21:36 GMT, Shaojin Wen <swen at openjdk.org> wrote:

>> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD within a register) instead of table lookup. Eliminating the table lookup can also avoid the performance degradation problem when the cache misses.
>
> The new implementation improves performance on the aarch64 architecture but results in a performance regression on x64.
> 
> ## 1. Script
> 
> git remote add wenshao git at github.com:wenshao/jdk.git
> git fetch wenshao
> 
> # baseline dfaa89162a3
> git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
> make test TEST="micro:java.util.UUIDBench.toString"
> 
>  # current c513087056b
> git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
> make test TEST="micro:java.util.UUIDBench.toString"
> 
> 
> ## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
> 
> -Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
> -UUIDBench.toString   20000  thrpt   15  94.274 ± 0.452  ops/us
> 
> +Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
> +UUIDBench.toString   20000  thrpt   15  80.241 ± 0.894  ops/us -14.88%
> 
> 
> 
> ## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
> 
> -Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
> -UUIDBench.toString   20000  thrpt   15  85.323 ± 2.044  ops/us
> 
> +Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
> +UUIDBench.toString   20000  thrpt   15  73.636 ± 0.590  ops/us -13.69%
> 
> 
> ## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)
> 
> -Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
> -UUIDBench.toString   20000  thrpt   15  69.286 ± 1.136  ops/us
> 
> +Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
> +UUIDBench.toString   20000  thrpt   15  80.475 ± 0.310  ops/us +16.14%
> 
> 
> 
> ## 5. MacBook M1 Pro (aarch64)
> 
> -Benchmark           (size)   Mode  Cnt    Score   Error   Units (baseline dfaa89162a3)
> -UUIDBench.toString   20000  thrpt   15  108.254 ? 1.167  ops/us
> 
> +Benchmark           (size)   Mode  Cnt    Score   Error   Units (current c513087056b)
> +UUIDBench.toString   20000  thrpt   15  122.313 ? 0.820  ops/us +12.98%
> 
> 
> 
> ## 6. orange_pi5_aarch64 (CPU RK3588S)
> 
> -Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
> -UUIDBench.toString   20000  thrpt   15  37.783 ± 1.553  ops/us
> 
> +Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
> +UUIDBench.toString   20000  thrpt   15  42.928 ± 2.534  ops/us +13.61%
> 
> 
> 
> 
> ## 7. orange_aipro_aarch64 (CPU TAISHANV200M)
> 
> -Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
> -UUIDBench.toString   20000  thrpt   15  13.822 ± 0.203  ops/us
> 
> +Benchmark           (size)   M...

With regard to the aarch64 vector instrinsic, I don't have access to an aarch64 to try it on (I'm faking it x64 by disabling the intrinsic). @wenshao would it be possible for you to try the Long.expand version of this patch with the patch from https://github.com/openjdk/jdk/pull/23089 to see how aarch64 performs?

> ARMv8 includes Apple M1/M2, AWS Graviton 3; ARMv9.0 includes Apple M3/M4, Aliyun Yitian 710.

An interesting piece of trivia - while the M4 is ARMv9, it appears not to support SVE - in particular the bdep instruction that this code would use. See https://github.com/llvm/llvm-project/blob/14b44179cb61dd551c911dea54de57b588621005/llvm/lib/Target/AArch64/AArch64Processors.td#L923

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2590911374
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2614028489