RFR: 8353558: x86: Use CLFLUSHOPT/CLWB/CPUID for ICache sync

Tue Apr 15 10:51:02 UTC 2025

On Wed, 2 Apr 2025 18:42:03 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> For Leyden, that wants to load a lot of code as fast as it can, code cache flush costs are now significant part of the picture. There are single-digit percent startup time opportunities in better ICache syncs.
> 
> It is not sufficiently clear why icache flushes are needed for x86. Intel/AMD manuals say the instruction caches are fully coherent. GCC intrinsic for `__builtin___clear_cache` is empty. It looks that a single serializing instruction like `cpuid` might be OK for the entire flush to happen, this is what our `OrderAccess::cross_modify_fence` does. Still, we can maintain the old behavior by flushing the caches smarter: there are CLFLUSHOPT and CLWB available on modern x86.
> 
> See more discussion and references in the RFE. The performance data is in the comments in this PR.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `all`
>  - [x] Linux x86_64 server fastdebug, `all` + `X86ICacheSync={0,1,2,3,4}`

On my 5950X and new gtest "microbenchmark":

$ CONF=linux-x86_64-server-release make test TEST=gtest:ICacheTest TEST_VM_OPTS="-XX:+UnlockDiagnosticVMOptions -XX:X86ICacheSync=..."

# X86ICacheSync=0 (no flushing at all, lowest overhead "baseline")
       256 bytes flushed in         29 ns, read back in         31 ns
       512 bytes flushed in         30 ns, read back in         32 ns
      1024 bytes flushed in         30 ns, read back in         33 ns
      2048 bytes flushed in         31 ns, read back in         40 ns
      4096 bytes flushed in         30 ns, read back in         47 ns
      8192 bytes flushed in         31 ns, read back in         66 ns
     16384 bytes flushed in         31 ns, read back in        109 ns
     32768 bytes flushed in         31 ns, read back in        215 ns
     65536 bytes flushed in         29 ns, read back in        683 ns
    131072 bytes flushed in         29 ns, read back in       1289 ns
    262144 bytes flushed in         29 ns, read back in       2531 ns

# X86ICacheSync=1 (CLFLUSH loop)
       256 bytes flushed in        258 ns, read back in        138 ns
       512 bytes flushed in        249 ns, read back in        154 ns
      1024 bytes flushed in        274 ns, read back in        156 ns
      2048 bytes flushed in        307 ns, read back in        246 ns
      4096 bytes flushed in        390 ns, read back in        392 ns
      8192 bytes flushed in        551 ns, read back in        671 ns
     16384 bytes flushed in        879 ns, read back in       1287 ns
     32768 bytes flushed in       1534 ns, read back in       1647 ns
     65536 bytes flushed in       2810 ns, read back in       2492 ns
    131072 bytes flushed in       5417 ns, read back in       4904 ns
    262144 bytes flushed in      10642 ns, read back in       9692 ns

# X86ICacheSync=2 (CLFLUSHOPT loop)
       256 bytes flushed in         34 ns, read back in        287 ns
       512 bytes flushed in         42 ns, read back in        153 ns
      1024 bytes flushed in         40 ns, read back in        310 ns
      2048 bytes flushed in         70 ns, read back in        345 ns
      4096 bytes flushed in         97 ns, read back in        490 ns
      8192 bytes flushed in        260 ns, read back in        676 ns
     16384 bytes flushed in        652 ns, read back in       1279 ns
     32768 bytes flushed in       1234 ns, read back in       1641 ns
     65536 bytes flushed in       2598 ns, read back in       2509 ns
    131072 bytes flushed in       5128 ns, read back in       4846 ns
    262144 bytes flushed in      10498 ns, read back in       9471 ns

# X86ICacheSync=3 (CLWB loop)
       256 bytes flushed in         33 ns, read back in        151 ns
       512 bytes flushed in         44 ns, read back in        147 ns
      1024 bytes flushed in         39 ns, read back in        164 ns
      2048 bytes flushed in         37 ns, read back in        209 ns
      4096 bytes flushed in        125 ns, read back in        209 ns
      8192 bytes flushed in        220 ns, read back in        322 ns
     16384 bytes flushed in        492 ns, read back in        512 ns
     32768 bytes flushed in       1030 ns, read back in        805 ns
     65536 bytes flushed in       2269 ns, read back in       1240 ns
    131072 bytes flushed in       4828 ns, read back in       2010 ns
    262144 bytes flushed in      10106 ns, read back in       3366 ns

# X86ICacheSync=4 (single CPUID)
       256 bytes flushed in         57 ns, read back in         30 ns
       512 bytes flushed in         57 ns, read back in         32 ns
      1024 bytes flushed in         65 ns, read back in         33 ns
      2048 bytes flushed in         57 ns, read back in         37 ns
      4096 bytes flushed in         57 ns, read back in         47 ns
      8192 bytes flushed in         59 ns, read back in         64 ns
     16384 bytes flushed in         62 ns, read back in        110 ns
     32768 bytes flushed in         56 ns, read back in        208 ns
     65536 bytes flushed in         56 ns, read back in        685 ns
    131072 bytes flushed in         56 ns, read back in       1285 ns
    262144 bytes flushed in         57 ns, read back in       2610 ns

Motivational improvements on Leyden benchmarks and Ryzen 5950X:

Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -Xms64m -Xmx1g -XX:+UseSerialGC \
   -cp JavacBenchApp.jar -XX:AOTCache=app.aot \
   -XX:+UnlockDiagnosticVMOptions -XX:X86ICacheSync=... JavacBenchApp 50

--- X86ICacheSync=0 (no flushing at all)
  Time (mean ± σ):     362.4 ms ±   4.6 ms    [User: 679.1 ms, System: 103.0 ms]
  Range (min … max):   350.8 ms … 369.1 ms    30 runs

--- X86ICacheSync=1 (CLFLUSH loop)
  Time (mean ± σ):     384.0 ms ±   5.2 ms    [User: 729.0 ms, System: 108.1 ms]
  Range (min … max):   373.3 ms … 394.4 ms    30 runs

--- X86ICacheSync=2 (CLFLUSHOPT loop)
  Time (mean ± σ):     381.5 ms ±   2.8 ms    [User: 721.3 ms, System: 107.0 ms]
  Range (min … max):   375.3 ms … 386.6 ms    30 runs

--- X86ICacheSync=3 (CLWB loop)
  Time (mean ± σ):     374.1 ms ±   3.6 ms    [User: 704.3 ms, System: 105.7 ms]
  Range (min … max):   363.6 ms … 382.1 ms    30 runs

--- X86ICacheSync=4 (single CPUID)
  Time (mean ± σ):     368.1 ms ±   4.0 ms    [User: 689.0 ms, System: 106.0 ms]
  Range (min … max):   359.0 ms … 376.0 ms    30 runs

Note the improvement in "User" time, as well as the improvement in end-to-end execution time. This is because we are spending less time in flushes when installing AOT code, and/or do it more nicely for temporally close uses. Note that CLWB/CPUID get fairly close to no flushing at all.

The improvements are visible even on simple HelloWorld scenarios:

$ taskset -c 0-7 hyperfine -w 100 -r 300 "build/linux-x86_64-server-release/images/jdk/bin/java \
  -XX:+UnlockDiagnosticVMOptions -XX:X86ICacheSync=0 \
  -Xmx64m -Xms64m -cp ../leyden-perf/hellostream.jar HelloStream"

# -XX:X86ICacheSync=0 (no flushing at all, lowest overhead "baseline")
  Time (mean ± σ):      30.9 ms ±   0.3 ms    [User: 24.3 ms, System: 17.1 ms]
  Range (min … max):    30.4 ms …  32.6 ms    300 runs

# -XX:X86ICacheSync=1 (CLFLUSH loop)
  Time (mean ± σ):      31.7 ms ±   0.2 ms    [User: 25.1 ms, System: 17.3 ms]
  Range (min … max):    31.2 ms …  32.5 ms    300 runs

# -XX:X86ICacheSync=2 (CLFLUSHOPT loop)
  Time (mean ± σ):      31.4 ms ±   0.2 ms    [User: 24.8 ms, System: 17.3 ms]
  Range (min … max):    31.0 ms …  32.3 ms    300 runs

# -XX:X86ICacheSync=3 (CLWB loop)
  Time (mean ± σ):      31.2 ms ±   0.2 ms    [User: 24.6 ms, System: 17.2 ms]
  Range (min … max):    30.7 ms …  31.8 ms    300 runs

# -XX:X86ICacheSync=4 (single CPUID)
  Time (mean ± σ):      30.9 ms ±   0.2 ms    [User: 24.2 ms, System: 17.3 ms]
  Range (min … max):    30.5 ms …  31.9 ms    300 runs

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24389#issuecomment-2773416930
PR Comment: https://git.openjdk.org/jdk/pull/24389#issuecomment-2773436318
PR Comment: https://git.openjdk.org/jdk/pull/24389#issuecomment-2801077127