RFR: 8353741: Improve UUID.toString performance by using SIMD within a register instead of table lookup
Shaojin Wen
swen at openjdk.org
Fri Apr 4 16:47:19 UTC 2025
On Mon, 6 Jan 2025 13:18:50 GMT, Shaojin Wen <swen at openjdk.org> wrote:
> Improve the performance of UUID::toString by using Long.expand and SWAR (SIMD within a register) instead of table lookup. Eliminating the table lookup can also avoid the performance degradation problem when the cache misses.
Under the x64 architecture, performance is significantly improved. However, on some aarch64 platforms, performance regresses.. The performance numbers are as follows:
## 1. Script
git remote add wenshao git at github.com:wenshao/jdk.git
git fetch wenshao
# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"
# current 010ab70c00b
git checkout 010ab70c00b7c0f417127c050654a381b489d052
make test TEST="micro:java.util.UUIDBench.toString"
## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 84.620 ± 15.957 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 130.913 ± 0.111 ops/us +54.70%
## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 84.754 ± 0.291 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 94.817 ± 0.231 ops/us +11.87%
## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)
-Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
-UUIDBench.toString 20000 thrpt 15 70.288 ± 0.147 ops/us
+Benchmark (size) Mode Cnt Score Error Units
+UUIDBench.toString 20000 thrpt 15 92.088 ± 0.137 ops/us +31.01%
## 5. MacBook M1 Pro (aarch64)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 109.001 ? 0.354 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 80.671 ? 0.722 ops/us -25.99%
## 6. orange_pi5_aarch64 (CPU RK3588S)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 37.752 ± 1.430 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 30.940 ± 1.474 ops/us -18.04
## 7. orange_aipro_aarch64 (CPU TAISHANV200M)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 13.764 ± 0.262 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current 010ab70c00b)
+UUIDBench.toString 20000 thrpt 15 13.310 ± 0.175 ops/us -3.29%
// Method 1:
i = Long.reverseBytes(Long.expand(i, 0x0F0F_0F0F_0F0F_0F0FL));
// Method 2:
i = ((i & 0xF0000000L) >> 28)
| ((i & 0xF000000L) >> 16)
| ((i & 0xF00000L) >> 4)
| ((i & 0xF0000L) << 8)
| ((i & 0xF000L) << 20)
| ((i & 0xF00L) << 32)
| ((i & 0xF0L) << 44)
| ((i & 0xFL) << 56);
Note: Using Long.reverseBytes + Long.expand is faster on x64 and ARMv9.
However, on AArch64 with ARMv8, it will be slower compared to the manual unrolling shown in Method 2.
ARMv8 includes Apple M1/M2, AWS Graviton 3; ARMv9.0 includes Apple M3/M4, Aliyun Yitian 710.
The new implementation improves performance on the aarch64 architecture but results in a performance regression on x64.
## 1. Script
git remote add wenshao git at github.com:wenshao/jdk.git
git fetch wenshao
# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"
# current c513087056b
git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
make test TEST="micro:java.util.UUIDBench.toString"
## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 94.274 ± 0.452 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 80.241 ± 0.894 ops/us -14.88%
## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 85.323 ± 2.044 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 73.636 ± 0.590 ops/us -13.69%
## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 69.286 ± 1.136 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 80.475 ± 0.310 ops/us +16.14%
## 5. MacBook M1 Pro (aarch64)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 108.254 ? 1.167 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 122.313 ? 0.820 ops/us +12.98%
## 6. orange_pi5_aarch64 (CPU RK3588S)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 37.783 ± 1.553 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 42.928 ± 2.534 ops/us +13.61%
## 7. orange_aipro_aarch64 (CPU TAISHANV200M)
-Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
-UUIDBench.toString 20000 thrpt 15 13.822 ± 0.203 ops/us
+Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
+UUIDBench.toString 20000 thrpt 15 18.946 ± 0.156 ops/us +37.07%
Keep it alive.
Keep it alive.
PR #23089 fixes the performance degradation of Long.expand under aarch64. After merging master, this PR uses Long.expand. Tests on various CPUs of x64 and aarch64 show performance improvements, as follows:
## 1. Script
git remote add wenshao git at github.com:wenshao/jdk.git
git fetch wenshao
# baseline 3241b4e111e
git checkout 3241b4e111e3dbf475c0e5be117c2a8d1a63ad35
make test TEST="micro:java.util.UUIDBench.toString"
# current 1059d39f3fb
git checkout 1059d39f3fb3dc58bafb78cf71d387a140130b6f
make test TEST="micro:java.util.UUIDBench.toString"
## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
Benchmark (size) Mode Cnt Score Error Units (3241b4e111e)
UUIDBench.toString 20000 thrpt 15 94.372 ± 0.227 ops/us
Benchmark (size) Mode Cnt Score Error Units (1059d39f3fb)
UUIDBench.toString 20000 thrpt 15 116.365 ± 0.405 ops/us +23.30%
## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
Benchmark (size) Mode Cnt Score Error Units (3241b4e111e)
UUIDBench.toString 20000 thrpt 15 58.594 ± 0.673 ops/us
Benchmark (size) Mode Cnt Score Error Units (1059d39f3fb)
UUIDBench.toString 20000 thrpt 15 61.610 ± 0.677 ops/us +5.14%
## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710 ARM v9)
Benchmark (size) Mode Cnt Score Error Units (3241b4e111e)
UUIDBench.toString 20000 thrpt 15 69.094 ± 0.615 ops/us
Benchmark (size) Mode Cnt Score Error Units (1059d39f3fb)
UUIDBench.toString 20000 thrpt 15 80.880 ± 0.563 ops/us +17.05%
## 5. MacBook M1 Pro (aarch64)
Benchmark (size) Mode Cnt Score Error Units (3241b4e111e)
UUIDBench.toString 20000 thrpt 15 99.817 ± 2.557 ops/us
Benchmark (size) Mode Cnt Score Error Units (1059d39f3fb)
UUIDBench.toString 20000 thrpt 15 110.155 ± 0.957 ops/us +10.35%
## 6. orange_pi5_aarch64 (CPU RK3588S ARMv8.4)
Benchmark (size) Mode Cnt Score Error Units (3241b4e111e)
UUIDBench.toString 20000 thrpt 15 37.790 ± 1.828 ops/us
Benchmark (size) Mode Cnt Score Error Units (1059d39f3fb)
UUIDBench.toString 20000 thrpt 15 41.086 ± 1.676 ops/us +8.72%
## 7. aws_c8g_aarch64 (CPU Graviton4 ARM v9.0)
Benchmark (size) Mode Cnt Score Error Units (3241b4e111e)
UUIDBench.toString 20000 thrpt 15 78.927 ± 0.683 ops/us
Benchmark (size) Mode Cnt Score Error Units (1059d39f3fb)
UUIDBench.toString 20000 thrpt 15 88.432 ± 0.708 ops/us +12.04%
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2573456774
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2577693860
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2585077631
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2683406127
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2708921975
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2779245843
More information about the core-libs-dev
mailing list