RFR: Fix SCCache closing races
Aleksey Shipilev
shade at openjdk.org
Wed Sep 4 21:53:18 UTC 2024
On Wed, 4 Sep 2024 12:46:31 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> I have been scratching my head why Leyden builds crash with multiple iterations of `HelloStream` example.
>
> The crashes are of various shapes, but they have a similar clue: it happens when we touch `SCCache` when it was already closed: `hs_err` shows we have called `before_exit`, `gdb` shows `SCCache::_cache` is already `nullptr`, but the crashes are in compiler code that accesses `SCCache`.
>
> Looking at `SCCache` closing code, I found two major bugs:
>
> 1. For cache writer side, we piggyback on `Compile_lock` to abitrate the access to maybe-closing cache. That does not work: in `ciEnv::register_method`, we already have `SCEntry*` in our hands, and once we acquire the `Compile_lock`, the whole `SCCache` might be gone, `SCEntry*` points to garbage, and we SEGV. This one is easy to solve: check that cache is still on, under the lock.
>
> 2. For cache reader side, we have a `reading_nmethod` counter, which is supposed to track how many nmethod readers are currently active. Unfortunately, this is still racy: nothing prevents new nmethod readers to appear _"after"_ we closed the cache. So those nmethod readers can access the cache in invalid state. There is also piggybacking on `Compilation_lock`, which AFAICS is for WhiteBox paths! So this is not great. I replaced the whole thing with carefully tracking the nmethod reader counts, and making sure we have the phasing right.
>
> Eliminating `Compilation_lock` wait improves performance and variance on short workloads as well. Observe:
>
>
> $ rm -f app.cds*; taskset -c 0-7 hyperfine -w 50 -r 100 "build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream.jar HelloStream"
>
> # BEFORE (this also crashes every so often)
> Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream.jar HelloStream
> Time (mean ± σ): 24.0 ms ± 4.3 ms [User: 14.1 ms, System: 9.8 ms]
> Range (min … max): 17.0 ms … 27.3 ms 100 runs
>
> Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
>
> # AFTER
> Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:CacheDataStore=app.cds -Xmx256m -Xms256m -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -cp hellostream....
@iklam, @veresov -- you might want to take a look. This blocks performance testing, because current `-XX:CacheDataStore` tests crash for me here a lot.
-------------
PR Comment: https://git.openjdk.org/leyden/pull/13#issuecomment-2328966708
More information about the leyden-dev
mailing list