RFR: 8344831: [REDO] CDS: Parallel relocation [v6]
Thomas Stuefe
stuefe at openjdk.org
Mon Dec 2 13:42:47 UTC 2024
On Mon, 2 Dec 2024 11:58:08 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> This re-does [JDK-8341334](https://bugs.openjdk.org/browse/JDK-8341334), being mindful of failures reported in [JDK-8344583](https://bugs.openjdk.org/browse/JDK-8344583). See the original motivation in [JDK-8341334](https://bugs.openjdk.org/browse/JDK-8341334).
>>
>> To simplify reviews, this PR contains the original commit (https://github.com/openjdk/jdk/commit/4218b882cd0a75f3159495ec2a57305ecfc8bf69), and amendments are stacked on top of it.
>>
>> The common theme in all failures that we have seen is VM exiting when `ArchiveWorkers` (`AW`) are still actively waiting on related `Semaphores`. Since the `AW` were static, the destructors for `Semaphores` would run as part of `AW` destruction sequence, and what would make the active workers fail.
>>
>> We resolve this trouble and strengthen the code with the following amendments:
>> 1. `AW` is now a structured stack object. `AW` instances are not exposed statically, so even the abnormal VM termination would not break semaphores. The normal scoping rules would make sure `AW` is shutdown on all normal paths.
>> 2. We move `AW` use mark to the only place it is currently used. This allows us to further insulate from `AW` problems if we flip the `AOTCacheParallelRelocation` flag back to `false` in the field.
>> 3. `AW` is now a single-use object. Making it reusable proves to be fairly hard, especially when workers are lagging behind during the startup or wakeups. Quite a bit of that can be mitigated by smart shutdown protocol, but it requires more coordination, and extra pool work. Single use `AW` is significantly easier to reason about. We can re-instate reusability in/when we need it.
>>
>> I will put the performance data in a separate commit.
>>
>> Additional testing:
>> - [x] GHA
>> - [x] linux-x86_64-server-fastdebug, `runtime/cds`
>> - [x] linux-x86_64-server-fastdebug, `all`
>> - [x] macos-aarch64-server-fastdebug, `compiler/compilercontrol compiler/ciReplay runtime/CommandLine/` (100x, no failures)
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 10 additional commits since the last revision:
>
> - Rewrite FIXMEs as proper comments
> - Merge branch 'master' into JDK-8344831-cds-parallel-relocation-redo
> - Merge branch 'master' into JDK-8344831-cds-parallel-relocation-redo
> - Attempt at deleting the threads
> - Atomicity improvements
> - We do not actually need to track finished workers
> - Two-stage termination to clean up Semaphore correctly
> - Gross simplification: single-use ArchiveWorkers
> - Actual REDO increment: amendments to fix failures
> - Original JDK-8341334
Almost good, two nits remain.
src/hotspot/share/cds/archiveUtils.cpp line 430:
> 428: while (true) {
> 429: int cur = Atomic::load(&_started_workers);
> 430: if (cur >= _num_workers) {
Trying to understand how we could ever overshoot _num_workers. I think this should work with == too, right? In any case I'm fine with it, just puzzled.
test/hotspot/gtest/cds/test_archiveWorkers.cpp line 46:
> 44: TEST_VM(ArchiveWorkersTest, continuous_restart) {
> 45: for (int c = 0; c < 1000; c++) {
> 46: ArchiveWorkers workers;
Maybe I am dense, but I don't understand what this test does. Is this not just object creation without anything happening? If so, should something happen here?
-------------
PR Review: https://git.openjdk.org/jdk/pull/22369#pullrequestreview-2472630452
PR Review Comment: https://git.openjdk.org/jdk/pull/22369#discussion_r1865862784
PR Review Comment: https://git.openjdk.org/jdk/pull/22369#discussion_r1865806482
More information about the hotspot-runtime-dev
mailing list