RFR: Fix SCCache closing races

Aleksey Shipilev shade at openjdk.org
Wed Sep 4 21:53:16 UTC 2024


I have been scratching my head why Leyden builds crash with multiple iterations of `HelloStream` example.

The crashes are of various shapes, but they have a similar clue: it happens when we touch `SCCache` when it was already closed: `hs_err` shows we have called `before_exit`, `gdb` shows `SCCache::_cache` is already `nullptr`, but the crashes are in compiler code that accesses `SCCache`.

Looking at `SCCache` closing code, I found two major bugs:

1. For cache writer side, we piggyback on `Compile_lock` to abitrate the access to maybe-closing cache. That does not work: in `ciEnv::register_method`, we already have `SCEntry*` in our hands, and once we acquire the `Compile_lock`, the whole `SCCache` might be gone, `SCEntry*` points to garbage, and we SEGV. This one is easy to solve: check that cache is still on, under the lock.

2. For cache reader side, we have a `reading_nmethod` counter, which is supposed to track how many nmethod readers are currently active. Unfortunately, this is still racy: nothing prevents new nmethod readers to appear _"after"_ we closed the cache. So those nmethod readers can access the cache in invalid state. There is also piggybacking on `Compilation_lock`, which AFAICS is for WhiteBox paths! So this is not great. I replaced the whole thing with carefully tracking the nmethod reader counts, and making sure we have the phasing right.

Eliminating `Compilation_lock` wait improves performance and variance on short workloads as well. Observe:


$ rm -f app.cds*; taskset -c 0-7 hyperfine -w 50 -r 100 "build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream.jar HelloStream"

# BEFORE (this also crashes every so often)
Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream.jar HelloStream
  Time (mean ± σ):      24.0 ms ±   4.3 ms    [User: 14.1 ms, System: 9.8 ms]
  Range (min … max):    17.0 ms …  27.3 ms    100 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

# AFTER
Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream.jar HelloStream
  Time (mean ± σ):      17.3 ms ±   0.3 ms    [User: 10.8 ms, System: 9.4 ms]
  Range (min … max):    16.8 ms …  18.6 ms    100 runs


Additional testing:
 - [x] Linux x86_64 server fastdebug, `runtime/cds`

-------------

Commit messages:
 - Fix for reader side
 - Fix for writer side

Changes: https://git.openjdk.org/leyden/pull/13/files
  Webrev: https://webrevs.openjdk.org/?repo=leyden&pr=13&range=00
  Stats: 101 lines in 3 files changed: 84 ins; 5 del; 12 mod
  Patch: https://git.openjdk.org/leyden/pull/13.diff
  Fetch: git fetch https://git.openjdk.org/leyden.git pull/13/head:pull/13

PR: https://git.openjdk.org/leyden/pull/13


More information about the leyden-dev mailing list