RFR: 8344831: [REDO] CDS: Parallel relocation [v5]

Sun Dec 1 21:41:46 UTC 2024

On Fri, 29 Nov 2024 10:35:58 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> This re-does [JDK-8341334](https://bugs.openjdk.org/browse/JDK-8341334), being mindful of failures reported in [JDK-8344583](https://bugs.openjdk.org/browse/JDK-8344583). See the original motivation in [JDK-8341334](https://bugs.openjdk.org/browse/JDK-8341334).
>> 
>> To simplify reviews, this PR contains the original commit (https://github.com/openjdk/jdk/commit/4218b882cd0a75f3159495ec2a57305ecfc8bf69), and amendments are stacked on top of it.
>> 
>> The common theme in all failures that we have seen is VM exiting when `ArchiveWorkers` (`AW`) are still actively waiting on related `Semaphores`. Since the `AW` were static, the destructors for `Semaphores` would run as part of `AW` destruction sequence, and what would make the active workers fail. 
>> 
>> We resolve this trouble and strengthen the code with the following amendments:
>>   1. `AW` is now a structured stack object. `AW` instances are not exposed statically, so even the abnormal VM termination would not break semaphores. The normal scoping rules would make sure `AW` is shutdown on all normal paths.
>>   2. We move `AW` use mark to the only place it is currently used. This allows us to further insulate from `AW` problems if we flip the `AOTCacheParallelRelocation` flag back to `false` in the field.
>>   3. `AW` is now a single-use object. Making it reusable proves to be fairly hard, especially when workers are lagging behind during the startup or wakeups. Quite a bit of that can be mitigated by smart shutdown protocol, but it requires more coordination, and extra pool work. Single use `AW` is significantly easier to reason about. We can re-instate reusability in/when we need it.
>> 
>> I will put the performance data in a separate commit.
>> 
>> Additional testing:
>>  - [x] GHA
>>  - [x] linux-x86_64-server-fastdebug, `runtime/cds`
>>  - [x] linux-x86_64-server-fastdebug, `all`
>>  - [x] macos-aarch64-server-fastdebug, `compiler/compilercontrol compiler/ciReplay runtime/CommandLine/` (100x, no failures)
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision:
> 
>  - Merge branch 'master' into JDK-8344831-cds-parallel-relocation-redo
>  - Attempt at deleting the threads
>  - Atomicity improvements
>  - We do not actually need to track finished workers
>  - Two-stage termination to clean up Semaphore correctly
>  - Gross simplification: single-use ArchiveWorkers
>  - Actual REDO increment: amendments to fix failures
>  - Original JDK-8341334

Yes this is a bug in `os::free_thread()`. We have this:

  // osthread() can be null, if creation of thread failed.
  if (osthread() != nullptr) os::free_thread(osthread());

  // Clear Thread::current if thread is deleting itself and it has not
  // already been done. This must be done before the memory is deallocated.
  // Needed to ensure JNI correctly detects non-attached threads.
  if (this == Thread::current_or_null()) {
    Thread::clear_thread_current();
  }

where we clearly state `Thread::current` may have already been cleared - which is the case for `NonJavaThread` (but not `JavaThread`) - yet `free_thread` relies on it being set.

The `NonJavaThread` lifecycle is a bit ad-hoc as different NJT's behave differently. In many cases they never  really terminate, and in others they never delete the thread (e.g. WatcherThread). So we have not hit this bug before.

Your fix is correct (but without the fixme comment please).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22369#issuecomment-2510262380