RFR: 8229517: Support for optional asynchronous/buffered logging

Wed May 26 05:40:07 UTC 2021

On 25/05/2021 5:30 pm, Xin Liu wrote:
> On Tue, 25 May 2021 06:49:41 GMT, David Holmes <david.holmes at oracle.com> wrote:
> 
>> It isn't at all clear why it was necessary to introduce the io_sem just
> so some gtest will pass. Why did these tests fail?
> 
> You brought up this issue when you ran into crashes. https://github.com/openjdk/jdk/pull/3135#issuecomment-845993038

Yes but I was unclear why these tests were failing the way they were.

> Here is my analysis. AsyncLogWriter::flush() was unsafe.  Previously,  flush() only flushed data out of the buffer. yes, it try to write what it dequeued, but there's a race condition I didn't think of!
> 
> Here is the case:
> A thread invokes AsyncLogWriter::flush() and the buffer is empty() at that time. Previously, write() returned because it had nothing to write. However, the following code in log writer thread may be executing in parallel!
> 
> 
>    while (!it.is_empty()) {
>      AsyncLogMessage* e = it.next();
>      char* msg = e->message();
> 
>      if (msg != nullptr) {
>        e->output()->write_blocking(e->decorations(), msg);
>        os::free(msg);
>      }
>    }
> 
> If that race condition happens, it's unsafe to read log files after AsyncLogWriter::flush(). log writer thread is writing them!

If I understand what you are saying the issue was not the flush() itself 
but the fact that we could continue past the flush() and call 
LogConfiguration::finalize() while the writer thread was still writing 
out messages.

So really what that was indicating was a failure to synchronize with the 
writer thread during "shutdown". That could cause a crash or simply 
result in missing log output.

> It's even not safe to destroy LogOutput instances here. (That's the byproduct bug I discovered).
> https://github.com/openjdk/jdk/pull/3135/files#diff-64ba5a15efefc31cf8bc2b1c51100933075913fc8ed6028ad348197e086c3c33R287
> 
> How to guarantee file writings have done?  That's why I introduced _io_sem.
> Now write() acquires and releases _io_sem no matter what. Even the buffer is empty.
> This can guarantee that the buffer used to be empty and all dequeued messages are written by flush().

Okay so let me walk through this:

- Some logging happens and writer thread is logically woken up but 
doesn't yet execute.
- a shutdown occurs and we reach the point where we call 
LogConfiguration::finalize() which calls disable_outputs()
- at this point no new logging is possible
- we then have a race between the shutdown thread calling flush() and 
the async writer thread
- whomever gets the AsyncLogLock first claims all the buffered messages 
and will write them out
- if the shutdown thread wins there is no problem, the async writer 
thread will just redo _sem.wait()
- if the async writer thread wins we need the shutdown thread to wait 
until the async writer has finished writing

So we could at this point do some direct kind of "handshake" between the 
shutdown thread and the async writer thread to coordinate them. But 
instead the _iosem is introduced. The async writer thread holds the 
_iosem for the duration of the writing. The shutdown thread, after 
finding no messages buffered, can't acquire the _iosem until the async 
writer is done.

Okay I understand the protocol now.

> After I strengthen `flush()`, I found I could even enable all unified logging tests in async mode.

So again prior to this the real problem was a race between final logging 
output and VM termination, that could result in either crashes or lost 
logs messages, which in turn could cause test failures.

At one stage I believe you had intended to terminate the async writer 
thread, and that would have fixed the problem as well - assuming you 
handled the termination properly.

Thanks,
David
-----

> -------------
> 
> PR: https://git.openjdk.java.net/jdk/pull/3135
>