RFR: 8353741: Improve UUID.toString performance by using SIMD within a register instead of table lookup

Shaojin Wen swen at openjdk.org
Fri Apr 4 16:47:19 UTC 2025


On Mon, 6 Jan 2025 13:18:50 GMT, Shaojin Wen <swen at openjdk.org> wrote:

> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD within a register) instead of table lookup. Eliminating the table lookup can also avoid the performance degradation problem when the cache misses.

Under the x64 architecture, performance is significantly improved. However, on some aarch64 platforms, performance regresses.. The performance numbers are as follows:

## 1. Script

git remote add wenshao git at github.com:wenshao/jdk.git
git fetch wenshao

# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"

 # current 010ab70c00b
git checkout 010ab70c00b7c0f417127c050654a381b489d052
make test TEST="micro:java.util.UUIDBench.toString"


## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  84.620 ± 15.957  ops/us

+Benchmark           (size)   Mode  Cnt    Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  130.913 ± 0.111  ops/us +54.70%



## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

-Benchmark           (size)   Mode  Cnt   Score    Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  84.754 ± 0.291  ops/us

+Benchmark           (size)   Mode  Cnt    Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  94.817 ± 0.231  ops/us +11.87%


## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
-UUIDBench.toString   20000  thrpt   15  70.288 ± 0.147  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units
+UUIDBench.toString   20000  thrpt   15  92.088 ± 0.137  ops/us +31.01%


## 5. MacBook M1 Pro (aarch64)

-Benchmark           (size)   Mode  Cnt    Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  109.001 ? 0.354  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  80.671 ? 0.722  ops/us -25.99%


## 6. orange_pi5_aarch64 (CPU RK3588S)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  37.752 ± 1.430  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  30.940 ± 1.474  ops/us -18.04



## 7. orange_aipro_aarch64 (CPU TAISHANV200M)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  13.764 ± 0.262  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current 010ab70c00b)
+UUIDBench.toString   20000  thrpt   15  13.310 ± 0.175  ops/us -3.29%

// Method 1:
i = Long.reverseBytes(Long.expand(i, 0x0F0F_0F0F_0F0F_0F0FL));

// Method 2:
i = ((i & 0xF0000000L) >> 28)
  | ((i & 0xF000000L) >> 16)
  | ((i & 0xF00000L) >> 4)
  | ((i & 0xF0000L) << 8)
  | ((i & 0xF000L) << 20)
  | ((i & 0xF00L) << 32)
  | ((i & 0xF0L) << 44)
  | ((i & 0xFL) << 56);


Note: Using Long.reverseBytes + Long.expand is faster on x64 and ARMv9. 
However, on AArch64 with ARMv8, it will be slower compared to the manual unrolling shown in Method 2.
ARMv8 includes Apple M1/M2, AWS Graviton 3; ARMv9.0 includes Apple M3/M4, Aliyun Yitian 710.

The new implementation improves performance on the aarch64 architecture but results in a performance regression on x64.

## 1. Script

git remote add wenshao git at github.com:wenshao/jdk.git
git fetch wenshao

# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"

 # current c513087056b
git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
make test TEST="micro:java.util.UUIDBench.toString"


## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  94.274 ± 0.452  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  80.241 ± 0.894  ops/us -14.88%



## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  85.323 ± 2.044  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  73.636 ± 0.590  ops/us -13.69%


## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  69.286 ± 1.136  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  80.475 ± 0.310  ops/us +16.14%



## 5. MacBook M1 Pro (aarch64)

-Benchmark           (size)   Mode  Cnt    Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  108.254 ? 1.167  ops/us

+Benchmark           (size)   Mode  Cnt    Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  122.313 ? 0.820  ops/us +12.98%



## 6. orange_pi5_aarch64 (CPU RK3588S)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  37.783 ± 1.553  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  42.928 ± 2.534  ops/us +13.61%




## 7. orange_aipro_aarch64 (CPU TAISHANV200M)

-Benchmark           (size)   Mode  Cnt   Score   Error   Units (baseline dfaa89162a3)
-UUIDBench.toString   20000  thrpt   15  13.822 ± 0.203  ops/us

+Benchmark           (size)   Mode  Cnt   Score   Error   Units (current c513087056b)
+UUIDBench.toString   20000  thrpt   15  18.946 ± 0.156  ops/us +37.07%

Keep it alive.

Keep it alive.

PR #23089 fixes the performance degradation of Long.expand under aarch64. After merging master, this PR uses Long.expand. Tests on various CPUs of x64 and aarch64 show performance improvements, as follows:

## 1. Script

git remote add wenshao git at github.com:wenshao/jdk.git
git fetch wenshao

# baseline 3241b4e111e
git checkout 3241b4e111e3dbf475c0e5be117c2a8d1a63ad35
make test TEST="micro:java.util.UUIDBench.toString"

 # current 1059d39f3fb
git checkout 1059d39f3fb3dc58bafb78cf71d387a140130b6f
make test TEST="micro:java.util.UUIDBench.toString"


## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)

Benchmark           (size)   Mode  Cnt   Score   Error   Units (3241b4e111e)
UUIDBench.toString   20000  thrpt   15  94.372 ± 0.227  ops/us

Benchmark           (size)   Mode  Cnt    Score   Error   Units (1059d39f3fb)
UUIDBench.toString   20000  thrpt   15  116.365 ± 0.405  ops/us +23.30%


## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)

Benchmark           (size)   Mode  Cnt   Score   Error   Units (3241b4e111e)
UUIDBench.toString   20000  thrpt   15  58.594 ± 0.673  ops/us

Benchmark           (size)   Mode  Cnt   Score   Error   Units (1059d39f3fb)
UUIDBench.toString   20000  thrpt   15  61.610 ± 0.677  ops/us +5.14%



## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710 ARM v9)

Benchmark           (size)   Mode  Cnt   Score   Error   Units (3241b4e111e)
UUIDBench.toString   20000  thrpt   15  69.094 ± 0.615  ops/us

Benchmark           (size)   Mode  Cnt   Score   Error   Units (1059d39f3fb)
UUIDBench.toString   20000  thrpt   15  80.880 ± 0.563  ops/us +17.05%


## 5. MacBook M1 Pro (aarch64)

Benchmark           (size)   Mode  Cnt   Score   Error   Units (3241b4e111e)
UUIDBench.toString   20000  thrpt   15  99.817 ± 2.557  ops/us

Benchmark           (size)   Mode  Cnt    Score   Error   Units (1059d39f3fb)
UUIDBench.toString   20000  thrpt   15  110.155 ± 0.957  ops/us +10.35%


## 6. orange_pi5_aarch64 (CPU RK3588S ARMv8.4)

Benchmark           (size)   Mode  Cnt   Score   Error   Units (3241b4e111e)
UUIDBench.toString   20000  thrpt   15  37.790 ± 1.828  ops/us

Benchmark           (size)   Mode  Cnt   Score   Error   Units (1059d39f3fb)
UUIDBench.toString   20000  thrpt   15  41.086 ± 1.676  ops/us +8.72%


## 7. aws_c8g_aarch64 (CPU Graviton4 ARM v9.0)

Benchmark           (size)   Mode  Cnt   Score   Error   Units (3241b4e111e)
UUIDBench.toString   20000  thrpt   15  78.927 ± 0.683  ops/us

Benchmark           (size)   Mode  Cnt   Score   Error   Units (1059d39f3fb)
UUIDBench.toString   20000  thrpt   15  88.432 ± 0.708  ops/us +12.04%

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2573456774
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2577693860
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2585077631
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2683406127
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2708921975
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2779245843


More information about the core-libs-dev mailing list