RFR: Fix SCCache closing races

Vladimir Kozlov kvn at openjdk.org
Wed Sep 4 21:53:17 UTC 2024


On Wed, 4 Sep 2024 12:46:31 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> I have been scratching my head why Leyden builds crash with multiple iterations of `HelloStream` example.
> 
> The crashes are of various shapes, but they have a similar clue: it happens when we touch `SCCache` when it was already closed: `hs_err` shows we have called `before_exit`, `gdb` shows `SCCache::_cache` is already `nullptr`, but the crashes are in compiler code that accesses `SCCache`.
> 
> Looking at `SCCache` closing code, I found two major bugs:
> 
> 1. For cache writer side, we piggyback on `Compile_lock` to abitrate the access to maybe-closing cache. That does not work: in `ciEnv::register_method`, we already have `SCEntry*` in our hands, and once we acquire the `Compile_lock`, the whole `SCCache` might be gone, `SCEntry*` points to garbage, and we SEGV. This one is easy to solve: check that cache is still on, under the lock.
> 
> 2. For cache reader side, we have a `reading_nmethod` counter, which is supposed to track how many nmethod readers are currently active. Unfortunately, this is still racy: nothing prevents new nmethod readers to appear _"after"_ we closed the cache. So those nmethod readers can access the cache in invalid state. There is also piggybacking on `Compilation_lock`, which AFAICS is for WhiteBox paths! So this is not great. I replaced the whole thing with carefully tracking the nmethod reader counts, and making sure we have the phasing right.
> 
> Eliminating `Compilation_lock` wait improves performance and variance on short workloads as well. Observe:
> 
> 
> $ rm -f app.cds*; taskset -c 0-7 hyperfine -w 50 -r 100 "build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream.jar HelloStream"
> 
> # BEFORE (this also crashes every so often)
> Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream.jar HelloStream
>   Time (mean ± σ):      24.0 ms ±   4.3 ms    [User: 14.1 ms, System: 9.8 ms]
>   Range (min … max):    17.0 ms …  27.3 ms    100 runs
>  
>   Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
> 
> # AFTER
> Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream....

Looks good. Thank you for fixing it.

Please, use [premain] prefix in subjects which we use in JBS for our bugs.
And Ioi filed SKARA request to automatically send our PR requests to Leyden-dev mailing list because I missed this PR.

src/hotspot/share/code/SCCache.cpp line 4198:

> 4196:     int cur = Atomic::load(&_nmethod_readers);
> 4197:     int upd = -(cur + 1);
> 4198:     if (cur >= 0 && Atomic::cmpxchg(&_nmethod_readers, cur, upd) == cur) {

Clever.

-------------

Marked as reviewed by kvn (Committer).

PR Review: https://git.openjdk.org/leyden/pull/13#pullrequestreview-2281435783
PR Comment: https://git.openjdk.org/leyden/pull/13#issuecomment-2330191492
PR Review Comment: https://git.openjdk.org/leyden/pull/13#discussion_r1744526295


More information about the leyden-dev mailing list