RFR: 8323807: Async UL: Add a stalling mode to async UL [v10]
Johan Sjölen
jsjolen at openjdk.org
Thu Jan 23 07:54:54 UTC 2025
On Thu, 23 Jan 2025 03:39:30 GMT, David Holmes <dholmes at openjdk.org> wrote:
>> Johan Sjölen has updated the pull request incrementally with one additional commit since the last revision:
>>
>> NOT_DEBUG is not the same as PRODUCT_ONLY I assume
>
> src/hotspot/share/logging/logAsyncWriter.cpp line 309:
>
>> 307: AsyncLogWriter* instance = AsyncLogWriter::instance();
>> 308: if (instance != nullptr) {
>> 309: if ((uintptr_t)instance == (uintptr_t)Thread::current_or_null()) {
>
> Why the casts ??
Sorry, that was sloppy of me. I still need a cast, but to the proper `Thread` type.
> src/hotspot/share/logging/logAsyncWriter.cpp line 312:
>
>> 310: // If logging from the consuming thread then fall back on synchronous logging.
>> 311: // Otherwise, the consuming thread may wait on itself to print the message,
>> 312: // this obviously leads to a deadlocked system.
>
> I'm somewhat surprised that we already allowing for logging within the logging thread as that potentially leads to infinite recursion. ??
The comment can probably be clarified.
The bug that we're fixing is specifically that the consumer thread logs, not the logging/consumption code which the consumer thread executes. Look at `Thread::call_run`:
```c++
log_debug(os, thread)("Thread " UINTX_FORMAT " stack dimensions: "
PTR_FORMAT "-" PTR_FORMAT " (" SIZE_FORMAT "k).",
os::current_thread_id(), p2i(stack_end()),
p2i(stack_base()), stack_size()/1024);
// Perform <ChildClass> initialization actions
DEBUG_ONLY(_run_state = PRE_RUN;)
this->pre_run();
// Invoke <ChildClass>::run()
DEBUG_ONLY(_run_state = RUN;)
this->run();
// Returned from <ChildClass>::run(). Thread finished.
Clearly we're logging before calling `this->run()` which will enter `AsyncLogWriter::run`.
And now look at the code for advertising that the asynchronous mode is running:
```c++
AsyncLogWriter* self = new AsyncLogWriter();
if (self->_initialized) {
Atomic::release_store_fence(&AsyncLogWriter::_instance, self);
// All readers of _instance after the fence see non-null.
// We use LogOutputList's RCU counters to ensure all synchronous logsites have completed.
// After that, we start AsyncLog Thread and it exclusively takes over all logging I/O.
for (LogTagSet* ts = LogTagSet::first(); ts != nullptr; ts = ts->next()) {
ts->wait_until_no_readers();
}
os::start_thread(self);
So before the thread has been started all new logs are done using asynchronous logging. This means that the `log_debug` in the first snippet will cause the consumer thread to stall, as it's waiting for itself to take the stalled message and print it.
I was first considering fixing the initialization order, by doing something like below. That would also work, but I thought it's easier to reason about synchronous logging for the consumer thread, as it still gives us the same guarantees. After considering this some more, I think that both are necessary. If any of the code in `os::pd_start_thread` ever logs, then we can run into a deadlock in stalling mode.
```c++
// We use LogOutputList's RCU counters to ensure all synchronous logsites have completed.
// After that, we start AsyncLog Thread and it exclusively takes over all logging I/O.
for (LogTagSet* ts = LogTagSet::first(); ts != nullptr; ts = ts->next()) {
ts->wait_until_no_readers();
}
AsyncLogWriter* self = new AsyncLogWriter();
if (self->_initialized) {
os::start_thread(self);
while (!self->_has_entered_printing_loop) {
self->_some_monitor.wait();
}
Atomic::release_store_fence(&AsyncLogWriter::_instance, self);
// All readers of _instance after the fence see non-null.
I actually find this code very strange. Loading the `_instance` was never done using an `acquire` fence, so what guarantees that "All reads of _instance after the fence see non-null"?
The reason that none of this was discovered in testing is that it's extremely improbably that we run out of buffer space while starting the async log thread. This is why I added a test that reduces the buffer size to something very small.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/22770#discussion_r1926515464
PR Review Comment: https://git.openjdk.org/jdk/pull/22770#discussion_r1926514366
More information about the hotspot-runtime-dev
mailing list