RFR: 8229517: Support for optional asynchronous/buffered logging [v20]

Tue May 25 05:23:29 UTC 2021

Hi Xin,

On 25/05/2021 7:57 am, Xin Liu wrote:
> On Mon, 24 May 2021 04:12:54 GMT, Xin Liu <xliu at openjdk.org> wrote:
> 
>>> This patch provides a buffer to store asynchrounous messages and flush them to
>>> underlying files periodically.
>>
>> Xin Liu has updated the pull request incrementally with one additional commit since the last revision:
>>
>>    flush() waits until all pending logging IO operations are done.
>>    
>>    This patch support all gtests in async mode.
>>    make test TEST="gtest:all" TEST_OPTS='VM_OPTIONS=-Xlog:async'
> 
> I have a workaround.  this is a biased solution. Who is taking the buffer lock has the priority to obtain the IO channel as well.
> 
> PS: I have to use trywait here to guarantee non-blocking. wait() will make one thread sleep with the buffer lock.  As a result, logsites can't enqueue messages.
> 
> 
> void AsyncLogWriter::write() {
> ...
>    AsyncLogBuffer logs;
>    bool own_io = false;
> 
>    { // critical region
>      AsyncLogLocker lock;
> 
>      _buffer.pop_all(&logs);
>      // append meta-messages of dropped counters
>      AsyncLogMapIterator dropped_counters_iter(logs);
>      _stats.iterate(&dropped_counters_iter);
>      own_io = _io_sem.trywait();
>    }
> 
>    LinkedListIterator<AsyncLogMessage> it(logs.head());
>    if (!own_io) {
>      _io_sem.wait();
>    }
> ...
> 
> 
> If we only deal with two threads situation, it can prevent the "interleaving" issue.
> Whoever fail get the second chance at `_io_sem.wait()`.

Yes this seems quite reasonable for dealing with that part of the problem.

>> Do the log messages already contain the appropriate timestamps from the original log call, or is that added when the log message is actually output to the destination?
> 
> Current implementation can guarantee both of them.

I'm not sure what that means as I only expect there to be one timestamp. 
I'm just not sure if it is recording the time at which the log call was 
made, or the time (later) when that log record was actually written out?

> I have to say that we may over-thinking here. 

Possibly. :) But best to flesh this out now rather than trying to chase 
down issues later that may not become apparent for a while.

> There are 3 situations where we used flush(). they all belong to "two-thread situation". one log writer thread. the other thread invokes `flush()`.
> 
> 1. termination.
> 2. abortion
> 3. gtest.
> 
> It seems that only os::abort() is tricky.  I understand that it will execute with other parallel threads. Actually, I used to run into problem for this one. TEST="gtest:os.page_size_for_region_with_zero_min_pages_vm_assert_test", which trigger segfault on purpose. Now it works fine.
> 
> What do you think about my workaround?

I like it.

> If you still feel os::abort() is unsafe, how about we just remove flush() from os::abort().

I think doing a hard wait() when aborting is risky - but again I'd like 
to hear other opinions on that (paging @stuefe!).

Not flushing on abort is one option. If we can pass through to write() 
the fact that we are aborting then another option would be to simply 
skip the write if the try_wait fails.

> I have to say that write() is pretty frequent, and buffer is always empty().

Using what applications and logging settings?

If the buffer is nearly always empty and so it is very unlikely that the 
final flush will have much, if anything to do, then arguably you don't 
need the io_sem complexity, as at worst you will potentially see a 
handful of log messages out-of-order.

PS. Testing tiers 1-3 on our CI went okay.

Thanks,
David

> -------------
> 
> PR: https://git.openjdk.java.net/jdk/pull/3135
>