RFR: 8344831: [REDO] CDS: Parallel relocation
David Holmes
dholmes at openjdk.org
Wed Nov 27 05:54:37 UTC 2024
On Mon, 25 Nov 2024 17:58:35 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> This re-does [JDK-8341334](https://bugs.openjdk.org/browse/JDK-8341334), being mindful of failures reported in [JDK-8344583](https://bugs.openjdk.org/browse/JDK-8344583). See the original motivation in [JDK-8341334](https://bugs.openjdk.org/browse/JDK-8341334).
>
> To simplify reviews, this PR contains the original commit (https://github.com/openjdk/jdk/commit/4218b882cd0a75f3159495ec2a57305ecfc8bf69), and amendments are stacked on top of it.
>
> The common theme in all failures that we have seen is VM exiting when `ArchiveWorkers` (`AW`) are still actively waiting on related `Semaphores`. Since the `AW` were static, the destructors for `Semaphores` would run as part of `AW` destruction sequence, and what would make the active workers fail.
>
> We resolve this trouble and strengthen the code with the following amendments:
> 1. `AW` is now a structured stack object. `AW` instances are not exposed statically, so even the abnormal VM termination would not break semaphores. The normal scoping rules would make sure `AW` is shutdown on all normal paths.
> 2. We move `AW` use mark to the only place it is currently used. This allows us to further insulate from `AW` problems if we flip the `AOTCacheParallelRelocation` flag back to `false` in the field.
> 3. `AW` is now a single-use object. Making it reusable proves to be fairly hard, especially when workers are lagging behind during the startup or wakeups. Quite a bit of that can be mitigated by smart shutdown protocol, but it requires more coordination, and extra pool work. Single use `AW` is significantly easier to reason about. We can re-instate reusability in/when we need it.
>
> I will put the performance data in a separate commit.
>
> Additional testing:
> - [x] GHA
> - [x] linux-x86_64-server-fastdebug, `runtime/cds`
> - [x] linux-x86_64-server-fastdebug, `all`
> - [x] macos-aarch64-server-fastdebug, `compiler/compilercontrol compiler/ciReplay runtime/CommandLine/` (100x, no failures)
@shipilev I think this still suffers from a race condition. It is very tricky to coordinate the destruction of a synchronization object such that it cannot be in use - whether a mutex or semaphore etc. - at the time it is destroyed.
When your last worker calls `signal()` it will internally reach a point inside the semaphore code where the thread calling `wait()` is allowed to proceed. That thread can return from `run_task` and trigger the destructor for `ArchiveWorkers` which will destruct the semaphore as well. But that last worker is still inside the code in `signal` and it will depend on how the low-level semaphore implementation is written as to whether that thread will be able to safely complete execution of that code.
More generally if after the `signal` the worker thread touched any of the state of the `ArchiveWorker`, it could be accessing the state of a deleted object.
These are same kinds of problems that required us to develop ThreadSMR.
-------------
PR Review: https://git.openjdk.org/jdk/pull/22369#pullrequestreview-2463956183
More information about the hotspot-runtime-dev
mailing list