RFR: 8353741: Improve UUID.toString performance by using SIMD within a register instead of table lookup
Shaojin Wen
swen at openjdk.org
Fri Apr 4 16:47:19 UTC 2025
On Tue, 14 Jan 2025 19:14:05 GMT, Johannes Graham <duke at openjdk.org> wrote:
>> The new implementation improves performance on the aarch64 architecture but results in a performance regression on x64.
>>
>> ## 1. Script
>>
>> git remote add wenshao git at github.com:wenshao/jdk.git
>> git fetch wenshao
>>
>> # baseline dfaa89162a3
>> git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
>> make test TEST="micro:java.util.UUIDBench.toString"
>>
>> # current c513087056b
>> git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
>> make test TEST="micro:java.util.UUIDBench.toString"
>>
>>
>> ## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
>>
>> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
>> -UUIDBench.toString 20000 thrpt 15 94.274 ± 0.452 ops/us
>>
>> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
>> +UUIDBench.toString 20000 thrpt 15 80.241 ± 0.894 ops/us -14.88%
>>
>>
>>
>> ## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
>>
>> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
>> -UUIDBench.toString 20000 thrpt 15 85.323 ± 2.044 ops/us
>>
>> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
>> +UUIDBench.toString 20000 thrpt 15 73.636 ± 0.590 ops/us -13.69%
>>
>>
>> ## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710)
>>
>> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
>> -UUIDBench.toString 20000 thrpt 15 69.286 ± 1.136 ops/us
>>
>> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
>> +UUIDBench.toString 20000 thrpt 15 80.475 ± 0.310 ops/us +16.14%
>>
>>
>>
>> ## 5. MacBook M1 Pro (aarch64)
>>
>> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
>> -UUIDBench.toString 20000 thrpt 15 108.254 ? 1.167 ops/us
>>
>> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
>> +UUIDBench.toString 20000 thrpt 15 122.313 ? 0.820 ops/us +12.98%
>>
>>
>>
>> ## 6. orange_pi5_aarch64 (CPU RK3588S)
>>
>> -Benchmark (size) Mode Cnt Score Error Units (baseline dfaa89162a3)
>> -UUIDBench.toString 20000 thrpt 15 37.783 ± 1.553 ops/us
>>
>> +Benchmark (size) Mode Cnt Score Error Units (current c513087056b)
>> +UUIDBench.toString 20000 thrpt 15 42.928 ± 2.534 ops/us +13.61%
>>
>>
>>
>>
>> ## 7. orange_aipro_aarch64 (CPU TAISHANV200M)
>>
>> -Benchmark (size) Mode Cnt Sco...
>
> With regard to the aarch64 vector instrinsic, I don't have access to an aarch64 to try it on (I'm faking it x64 by disabling the intrinsic). @wenshao would it be possible for you to try the Long.expand version of this patch with the patch from https://github.com/openjdk/jdk/pull/23089 to see how aarch64 performs?
@j3graham
Based on PR 23089, there has been a noticeable performance improvement in xor_const, except on AWS C7g (AArch64) machines.
## 1. Script
git remote add wenshao git at github.com:wenshao/jdk.git
git fetch wenshao
# baseline dfaa89162a3
git checkout dfaa89162a35acd20b1ed35e147f9626a181510a
make test TEST="micro:java.util.UUIDBench.toString"
# current c513087056b
git checkout c513087056be8c1e1a915625e0b425a7ecbb21d6
make test TEST="micro:java.util.UUIDBench.toString"
# xor_const + Long.expand 4f54ac68a9f
git checkout 4f54ac68a9fdb635ea2a3f03787cbf0d052dac25
make test TEST="micro:java.util.UUIDBench.toString"
## 2. aliyun_ecs_c8a_x64 (CPU AMD EPYC™ Genoa)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 94.273 ± 0.196 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 79.701 ± 0.979 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 131.954 ± 1.005 ops/us
## 3. aliyun_ecs_c8i_x64 (CPU Intel®Xeon®Emerald Rapids)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 110.221 ± 4.370 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 78.233 ± 0.790 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 136.119 ± 0.464 ops/us
## 4. aliyun_ecs_c8y_aarch64 (CPU Aliyun Yitian 710 ARM v9)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 70.538 ± 0.095 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 80.501 ± 0.280 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 93.289 ± 0.665 ops/us
## 5. MacBook M1 Pro (aarch64)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 106.552 ? 0.856 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 120.775 ? 0.755 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 121.762 ? 0.826 ops/us
## 6. orange_pi5_aarch64 (CPU RK3588S ARMv8.4)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 37.314 ± 1.616 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 43.791 ± 2.181 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 43.906 ± 1.287 ops/us
## 7. aws_c7g_aarch64 (CPU Graviton3 ARMv8.4)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 65.280 ± 0.742 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 59.123 ± 0.338 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 58.846 ± 0.729 ops/us
## 8. aws_c8g_aarch64 (CPU Graviton4 ARM v9.0)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 81.226 ± 0.374 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 87.328 ± 1.086 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 93.546 ± 1.623 ops/us
## 9. orange_aipro_aarch64 (CPU TAISHANV200M)
Benchmark (size) Mode Cnt Score Error Units (dfaa89162a3)
UUIDBench.toString 20000 thrpt 15 13.828 ± 0.142 ops/us
Benchmark (size) Mode Cnt Score Error Units (c513087056b)
UUIDBench.toString 20000 thrpt 15 18.870 ± 0.251 ops/us
Benchmark (size) Mode Cnt Score Error Units (4f54ac68a9f)
UUIDBench.toString 20000 thrpt 15 18.833 ± 0.192 ops/us
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22928#issuecomment-2593333971
More information about the core-libs-dev
mailing list