RFR: 8344583: Make ArchiveWorkers lifecycle robust

Thu Nov 21 02:16:13 UTC 2024

On Wed, 20 Nov 2024 12:57:52 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> It is obvious from the bug report that `ArchiveWorkers` lifecycle is not robust enough. Looks like archive workers escape their normal shutdown sequence when an unusual VM exit path is taken. We have tried to capture this in before_exit, but that is obviously not enough.
> 
> This PR reworks the lifecycle a bit: we now use the scoped object to manage `ArchiveWorkers` state. Previous version was carrying `ArchiveWorkers` for the entire lifecycle of CDS archive load path, but that runs into problems when VM exits in the middle of it. This PR shortens the lifecycle of `ArchiveWorkers` to the only place where they are used. To avoid the loss of usefulness, we now allow multiple pool restarts, so if any other code would need these workers, they can restart the pool again. New gtest checks this works.
> 
> Additional testing:
>  - [x] macos-aarch64-server-fastdebug: reported failing tests are not failing anymore
>  - [x] macos-aarch64-server-fastdebug, `tier{1,2}`
>  - [x] linux-x86_64-server-fastdebug, `all`

I'm struggling to see the complete picture with this code (and I still have concerns about the impact of this new thread pool on startup!).

As far as I can tell the crash arises because we have a static `ArchiveWorkers` which embeds two `Semaphore`s. When the VM terminates and the static destructors are run, those `Semaphore`s get blown away. If the `ArchiveWorkerThreads` are still running in the process when that happens they trigger the crash.

I can't determine from this PR how the changes prevent that from happening. ??

Also with this code:

ArchiveWorkerThread::ArchiveWorkerThread(ArchiveWorkers* pool) : NamedThread(), _pool(pool) {
  set_name("ArchiveWorkerThread");
  os::create_thread(this, os::os_thread);
  os::start_thread(this);
}

please note the cautionary comment in os.cpp:

// The INITIALIZED state is distinguished from the SUSPENDED state because the
// conditions in which a thread is first started are different from those in which
// a suspension is resumed.  These differences make it hard for us to apply the
// tougher checks when starting threads that we want to do when resuming them.
// However, when start_thread is called as a result of Thread.start, on a Java
// thread, the operation is synchronized on the Java Thread object.  So there
// cannot be a race to start the thread and hence for the thread to exit while
// we are working on it.  **Non-Java threads that start Java threads either have
// to do so in a context in which races are impossible, or should do appropriate
// locking.**

void os::start_thread(Thread* thread) {

It could perhaps be made clearer that it applies to creation/starting of all threads in this way: you must ensure the thread being started cannot run to completion and terminate and be deallocated before the starting thread can fully return from the synchronization code that is part of the thread startup protocol.

-------------

PR Review: https://git.openjdk.org/jdk/pull/22276#pullrequestreview-2450077820