RFR: 8348402: PerfDataManager stalls shutdown for 1ms

Thu Jan 30 16:34:12 UTC 2025

On Fri, 24 Jan 2025 09:27:07 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> Found this when studying Leyden performance. [JDK-8049304](https://bugs.openjdk.org/browse/JDK-8049304) added 1ms sleep on destruction path to catch up with threads updating the counters. 
> 
> I was not able to confidently prove the deletion race is benign. Even though `sleep` is not really a fix for that race either, I think it is safer to use GlobalCounter to coordinate deletions.
> 
> This improves startup (roundtrip) tests for the expected 1ms:
> 
> 
> $ hyperfine -w 10 -r 100 ...
> 
> # Before
> Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xmx128m Hello
>   Time (mean ± σ):      20.0 ms ±   0.3 ms    [User: 11.7 ms, System: 16.0 ms]
>   Range (min … max):    19.0 ms …  21.7 ms    1000 runs
> 
> # After
> Benchmark 1: build/linux-x86_64-server-release/images/jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -Xmx128m Hello
>   Time (mean ± σ):      19.0 ms ±   0.3 ms    [User: 11.9 ms, System: 15.8 ms]
>   Range (min … max):    18.2 ms …  20.8 ms    1000 runs
> 
> 
> Additional testing:
>  - [x] Linux AArch64 server fastdebug, `all`
>  - [x] Linux x86_64 server fastdebug, `all`

Yeah, as I said, there would not be a squeaky clean way out of this.

OK, how about this: add the blurb in the `PerfData` usage docs about the way to coordinate with shutdown if any other code ever updates the counters outside the safepoint, use that protection on the only that @pchilano noted, leave this in `PerfDataManager` for any future use. I pushed this variant to PR now.

> @pchilano: But we could increment a local variable instead (passed as argument or returned by EnterI), and then add it to FutileWakeups if we exit the ThreadBlockInVMPreprocess scope.

Right; this sounds quite a bit ad-hoc, and might not work for any other/future counters? I think it would be beneficial to build in the generic mechanism that any future counter might use.

> @dholmes-ora: My main concerns here are the efficiency of it - the release_store_fence in the middle of low-level OM code does concern me a little (it tends to follow an existing fence however so I expect the impact is not great).

With new patch we only do that in a single place, near a futile wakeup, and I assume the costs of acquiring the `GC::CS` drown in everything else.

> @dholmes-ora: Also so the iteration through all threads in GlobalCounter::write_synchronize does make me wonder how long it actually takes to execute the exit path now, compared to the 1ms sleep? 

Good question. I instrumented the `GC::write_synchronize()` path in `PerfDataManager`, and on my x86_64 5950X I am seeing ~1us for 100 threads, ~10us for 1000 threads, ~100us for 10000 threads, and ~300us for 30000 threads (system limit). So it adds up to ~10ns per thread. I think for all thread counts this is a major win over 1ms sleep, and for practical thread counts it is a wash in the whole startup/shutdown time. Removing this 1ms sleep really optimizes very short-lived applications, for which I think it is fair to not expect thousands of threads.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23293#issuecomment-2624977841