RFR: 8353558: x86: Use CLFLUSHOPT/CLWB/CPUID for ICache sync
Aleksey Shipilev
shade at openjdk.org
Tue Apr 15 10:51:02 UTC 2025
On Wed, 2 Apr 2025 18:42:03 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> For Leyden, that wants to load a lot of code as fast as it can, code cache flush costs are now significant part of the picture. There are single-digit percent startup time opportunities in better ICache syncs.
>
> It is not sufficiently clear why icache flushes are needed for x86. Intel/AMD manuals say the instruction caches are fully coherent. GCC intrinsic for `__builtin___clear_cache` is empty. It looks that a single serializing instruction like `cpuid` might be OK for the entire flush to happen, this is what our `OrderAccess::cross_modify_fence` does. Still, we can maintain the old behavior by flushing the caches smarter: there are CLFLUSHOPT and CLWB available on modern x86.
>
> See more discussion and references in the RFE. The performance data is in the comments in this PR.
>
> Additional testing:
> - [x] Linux x86_64 server fastdebug, `all`
> - [x] Linux x86_64 server fastdebug, `all` + `X86ICacheSync={0,1,2,3,4}`
On my 5950X and new gtest "microbenchmark":
$ CONF=linux-x86_64-server-release make test TEST=gtest:ICacheTest TEST_VM_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:X86ICacheSync=..."
# X86ICacheSync=0 (no flushing at all, lowest overhead "baseline")
256 bytes flushed in 29 ns, read back in 31 ns
512 bytes flushed in 30 ns, read back in 32 ns
1024 bytes flushed in 30 ns, read back in 33 ns
2048 bytes flushed in 31 ns, read back in 40 ns
4096 bytes flushed in 30 ns, read back in 47 ns
8192 bytes flushed in 31 ns, read back in 66 ns
16384 bytes flushed in 31 ns, read back in 109 ns
32768 bytes flushed in 31 ns, read back in 215 ns
65536 bytes flushed in 29 ns, read back in 683 ns
131072 bytes flushed in 29 ns, read back in 1289 ns
262144 bytes flushed in 29 ns, read back in 2531 ns
# X86ICacheSync=1 (CLFLUSH loop)
256 bytes flushed in 258 ns, read back in 138 ns
512 bytes flushed in 249 ns, read back in 154 ns
1024 bytes flushed in 274 ns, read back in 156 ns
2048 bytes flushed in 307 ns, read back in 246 ns
4096 bytes flushed in 390 ns, read back in 392 ns
8192 bytes flushed in 551 ns, read back in 671 ns
16384 bytes flushed in 879 ns, read back in 1287 ns
32768 bytes flushed in 1534 ns, read back in 1647 ns
65536 bytes flushed in 2810 ns, read back in 2492 ns
131072 bytes flushed in 5417 ns, read back in 4904 ns
262144 bytes flushed in 10642 ns, read back in 9692 ns
# X86ICacheSync=2 (CLFLUSHOPT loop)
256 bytes flushed in 34 ns, read back in 287 ns
512 bytes flushed in 42 ns, read back in 153 ns
1024 bytes flushed in 40 ns, read back in 310 ns
2048 bytes flushed in 70 ns, read back in 345 ns
4096 bytes flushed in 97 ns, read back in 490 ns
8192 bytes flushed in 260 ns, read back in 676 ns
16384 bytes flushed in 652 ns, read back in 1279 ns
32768 bytes flushed in 1234 ns, read back in 1641 ns
65536 bytes flushed in 2598 ns, read back in 2509 ns
131072 bytes flushed in 5128 ns, read back in 4846 ns
262144 bytes flushed in 10498 ns, read back in 9471 ns
# X86ICacheSync=3 (CLWB loop)
256 bytes flushed in 33 ns, read back in 151 ns
512 bytes flushed in 44 ns, read back in 147 ns
1024 bytes flushed in 39 ns, read back in 164 ns
2048 bytes flushed in 37 ns, read back in 209 ns
4096 bytes flushed in 125 ns, read back in 209 ns
8192 bytes flushed in 220 ns, read back in 322 ns
16384 bytes flushed in 492 ns, read back in 512 ns
32768 bytes flushed in 1030 ns, read back in 805 ns
65536 bytes flushed in 2269 ns, read back in 1240 ns
131072 bytes flushed in 4828 ns, read back in 2010 ns
262144 bytes flushed in 10106 ns, read back in 3366 ns
# X86ICacheSync=4 (single CPUID)
256 bytes flushed in 57 ns, read back in 30 ns
512 bytes flushed in 57 ns, read back in 32 ns
1024 bytes flushed in 65 ns, read back in 33 ns
2048 bytes flushed in 57 ns, read back in 37 ns
4096 bytes flushed in 57 ns, read back in 47 ns
8192 bytes flushed in 59 ns, read back in 64 ns
16384 bytes flushed in 62 ns, read back in 110 ns
32768 bytes flushed in 56 ns, read back in 208 ns
65536 bytes flushed in 56 ns, read back in 685 ns
131072 bytes flushed in 56 ns, read back in 1285 ns
262144 bytes flushed in 57 ns, read back in 2610 ns
Motivational improvements on Leyden benchmarks and Ryzen 5950X:
Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
-cp JavacBenchApp.jar -XX:AOTCache=app.aot \
-XX:+UnlockDiagnosticVMOptions -XX:X86ICacheSync=... JavacBenchApp 50
--- X86ICacheSync=0 (no flushing at all)
Time (mean ± σ): 362.4 ms ± 4.6 ms [User: 679.1 ms, System: 103.0 ms]
Range (min … max): 350.8 ms … 369.1 ms 30 runs
--- X86ICacheSync=1 (CLFLUSH loop)
Time (mean ± σ): 384.0 ms ± 5.2 ms [User: 729.0 ms, System: 108.1 ms]
Range (min … max): 373.3 ms … 394.4 ms 30 runs
--- X86ICacheSync=2 (CLFLUSHOPT loop)
Time (mean ± σ): 381.5 ms ± 2.8 ms [User: 721.3 ms, System: 107.0 ms]
Range (min … max): 375.3 ms … 386.6 ms 30 runs
--- X86ICacheSync=3 (CLWB loop)
Time (mean ± σ): 374.1 ms ± 3.6 ms [User: 704.3 ms, System: 105.7 ms]
Range (min … max): 363.6 ms … 382.1 ms 30 runs
--- X86ICacheSync=4 (single CPUID)
Time (mean ± σ): 368.1 ms ± 4.0 ms [User: 689.0 ms, System: 106.0 ms]
Range (min … max): 359.0 ms … 376.0 ms 30 runs
Note the improvement in "User" time, as well as the improvement in end-to-end execution time. This is because we are spending less time in flushes when installing AOT code, and/or do it more nicely for temporally close uses. Note that CLWB/CPUID get fairly close to no flushing at all.
The improvements are visible even on simple HelloWorld scenarios:
$ taskset -c 0-7 hyperfine -w 100 -r 300 "build/linux-x86_64-server-release/images/jdk/bin/java \
-XX:+UnlockDiagnosticVMOptions -XX:X86ICacheSync=0 \
-Xmx64m -Xms64m -cp ../leyden-perf/hellostream.jar HelloStream"
# -XX:X86ICacheSync=0 (no flushing at all, lowest overhead "baseline")
Time (mean ± σ): 30.9 ms ± 0.3 ms [User: 24.3 ms, System: 17.1 ms]
Range (min … max): 30.4 ms … 32.6 ms 300 runs
# -XX:X86ICacheSync=1 (CLFLUSH loop)
Time (mean ± σ): 31.7 ms ± 0.2 ms [User: 25.1 ms, System: 17.3 ms]
Range (min … max): 31.2 ms … 32.5 ms 300 runs
# -XX:X86ICacheSync=2 (CLFLUSHOPT loop)
Time (mean ± σ): 31.4 ms ± 0.2 ms [User: 24.8 ms, System: 17.3 ms]
Range (min … max): 31.0 ms … 32.3 ms 300 runs
# -XX:X86ICacheSync=3 (CLWB loop)
Time (mean ± σ): 31.2 ms ± 0.2 ms [User: 24.6 ms, System: 17.2 ms]
Range (min … max): 30.7 ms … 31.8 ms 300 runs
# -XX:X86ICacheSync=4 (single CPUID)
Time (mean ± σ): 30.9 ms ± 0.2 ms [User: 24.2 ms, System: 17.3 ms]
Range (min … max): 30.5 ms … 31.9 ms 300 runs
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24389#issuecomment-2773416930
PR Comment: https://git.openjdk.org/jdk/pull/24389#issuecomment-2773436318
PR Comment: https://git.openjdk.org/jdk/pull/24389#issuecomment-2801077127
More information about the hotspot-dev
mailing list