Request for Comment: Add a stalling mode to asynchronous UL

Fri Feb 16 14:56:33 UTC 2024

CC'ing Xin as original author.

At first glance, I like this. I also like the ringbuffer approach better.
Not sure if it had been discussed originally, but the original PR went on
for a while IIRC.

Do we even need the "drop" mode? I never liked the idea of losing output,
and maybe stalls are that rare that they won't matter. Arguably, if you
stall you should increase the log buffer size.

..Thomas

On Fri, Feb 16, 2024 at 3:45 PM Johan Sjolen <johan.sjolen at oracle.com>
wrote:

> Hello Hotspot,
>
> Today, asynchronous UL depends on off-loading the work of logging to
> output devices by having log sites copying messages onto a buffer and an
> output thread swapping it with an empty buffer and emptying it.
> If the current buffer runs out of room, then any attempt at logging leads
> to the message being dropped, that is: The message is not printed at all.
> I'd like to suggest that we add a stalling mode to async UL where in no
> message dropping occurs and all log sites stall until there is sufficient
> room in the buffer for their respective message.
>
> My purpose with this e-mail is twofold. First, I would like to know
> whether there is any interest in such a feature being added to Hotspot.
> Second, I'd like to discuss a change in the design of async UL which will
> permit this feature and other improvements.
> I imagine that the user facing change for async UL would be to equip
> -Xlog:async with an option, as such:
>
> ```
> $ java -Xlog:async:drop  ... # Drop messages
> $ java -Xlog:async:stall ... # Stall log sites
> $ java -Xlog:async       ... # Drop messages by default
> ```
>
> The rest of this e-mail will describe the design changes and motivation
> behind these changes, so if you're mostly interested in answering whether
> or not you want there to be a stalling mode in async UL then you don't need
> to bother with the rest 🙂.
>
> There's a draft PR for these changes, it runs on my Linux computer, but I
> have implemented the necessary changes for Mac and Windows also. The PR is
> available here: https://github.com/openjdk/jdk/pull/17757
>
> I have described the current async protocol which is in mainline in
> appendix 0. There are two issues with this protocol:
>
> 1. During times when there is a high amount of logging in the system there
> is also high contention on the mutex. This can lead the output thread to
> suffer from starvation, as it cannot take the mutex to swap the buffers.
>    This in turn leads to excessive dropping of messages and poor scaling
> of buffer increases.
> 2. Implementing stalling (as opposed to dropping messages) under this
> protocol leads to unnecessary waiting on the logsites.
>    Stalling would mean that a log site has to wait for the entire log
> buffer to be emptied before logging.
>    This may mean 1MiB of data having to be output before the swap can
> occur, and of course the swap requires the writing thread to acquire the
> mutex which has its own issues.
>
> Issue number 1 can most easily be observed by inducing a (perhaps
> artificially so) high amount of logging through setting logging to
> all=trace in a slow-debug build:
>
> ```
> $ slow-debug/jdk/bin/java -Xlog:async
> -Xlog:all=trace:file=asynctest::filecount=0
> $ grep 'dropped due to' asynctest
> [0.147s][warning][                       ]   1338 messages dropped due to
> async logging
> [0.171s][warning][                       ]    496 messages dropped due to
> async logging
> [0.511s][warning][                       ]     98 messages dropped due to
> async logging
> [0.542s][warning][                       ]     13 messages dropped due to
> async logging
> [0.762s][warning][                       ]   5620 messages dropped due to
> async logging
> [1.286s][warning][                       ]  12440 messages dropped due to
> async logging
> [1.316s][warning][                       ]   2610 messages dropped due to
> async logging
> [1.352s][warning][                       ]   2497 messages dropped due to
> async logging
> ```
>
> I'd like to suggest that we re-implement async logging using a
> producer/consumer protocol backed by circular buffers instead. This means
> that the output thread no longer suffers from starvation and can
> continously make progress.
> It also means that we don't have to split up the buffer memory in two,
> both log threads and output thread can share the whole buffer at the same
> time.
>
> My solution does seem to improve the dropping situation, as per the same
> test:
>
> ```
> $ ul/jdk/bin/java -Xlog:async:drop
> -Xlog:all=trace:file=asyncdrop::filecount=0
> $ grep 'dropped due to' asyncdrop
> [0.295s][warning][                       ]    195 messages dropped due to
> async logging
> [0.949s][warning][                       ]   3012 messages dropped due to
> async logging
> [1.315s][warning][                       ]   7992 messages dropped due to
> async logging
> ```
> That is a bit over 10 000 dropped messages, as opposed to over 20 000 in
> the old solution. I also did some throughput benchmarking, which shows that
> stalling has minimal impact on throughput, see appendix 2 for the
> throughput timings.
>
>
> The new protocol is a basic producer/consumer pattern with two mutexes and
> a circular buffer with two variables: head and tail. The consumer reads
> tail and sets head, the producer reads head and writes to tail. In this
> case the producer are the log sites and the consumer is the output thread.
> See appendix 1 for a pseudo-code description of the protocol. There are
> some details ommitted, such as how stalling is implemented (hint:
> wait/notify on the producer lock), please read the code for details.
>
> TL;DR:
>
> 1. Add a 'stall' strategy for the async logging so that we can guarantee 0
> dropped messages at the expense of delays at log sites, this is exposed by
> changing the command line option. Before: -Xlog:async, after:
> -Xlog:async[:[drop|stall]]. Default is still drop.
> 2. Ping-pong buffers are not a good idea when also supporting stalling,
> because a log site can be stalled for far longer than in the synchronous
> case.
> 3. Therefore switch out ping-pong buffers and single mutex with a circular
> buffer and separate producer and consumer mutexes. This also has the
> benefit of reducing dropped messages in my testing.
>
> Thank you for reading and I appreciate any input on this.
> Johan Sjölén
>
>
> Appendix 0, the old async protocol.
>
> Nota bene: Any absence or presence of bugs may be the opposite in the
> actual source code.
>
> ```
> Variables:
> Mutex m;
> Buffer* cb;
> Buffer a;
> Buffer b;
> const char* message;
>
> Methods:
> void Buffer::write(const char* message) {
>   if (has_space(strlen(message))) {
>     internal_write(message);
>   }
> }
>
> LogSite:
> 1. m.lock();
> 2. cb->write(message);
> 3. m.unlock();
> 4. m.notify();
>
> OutputThread:
> 1. m.wait();
> 2. m.lock();
> 3. buffer := cb;
> 4. if cb == &a then cb = &b; else cb = &a;
> 5. m.unlock();
> 6. for each message in buffer: print to log site
> 7. go to 1.
> ```
>
> Appendix 1, the new protocol.
>
> Nota bene: Any absence or presence of bugs may be the opposite in the
> actual source code.
> ```
> Mutex consumer;
> Mutex producer;
> CircularBuffer cb;
> size_t head;
> size_t tail;
> const char* message;
> char* out;
>
> LogSite:
> 1. producer.lock();
> 2. t = tail; h = head; // Read the values
> 2. if unused_memory(h, t) < strlen(message) then drop_or_stall();
> 3. size_t bytes_written = cb.write(message); // Writes string and envelope
> 4. tail = (t + bytes_written) % cb.size(); // Message ready to consume,
> move tail.
> 5. producer.unlock();
>
> OutputThread:
> 1. consumer.lock();
> 2. t = tail; h = head;
> 3. if h == t return; // No message available
> 4. size_t bytes_read = cb.read_bytes(out, h) // Write the next message
> 5. head = (h + bytes_read) % cb.size() // More memory available for
> producer
> 6. consumer.unlock();
> ```
>
> Appendix 2, the throughput tests.
>
> ```
> Testing requires the tool pv, example of how to try it yourself:
> $ ./build/slow/jdk/bin/java -Xlog:async -Xlog:all=trace | (pv --rate
> --timer --bytes > /dev/null)
>
> # Slow debug build
> ## Default sync logging
> ### /dev/null
> 48.5MiB 0:00:02 [17.5MiB/s]
> ### To file
> 48.5MiB 0:00:03 [14.4MiB/s]
>
> ## Old async logging
> ### /dev/null
> 51.1MiB 0:00:02 [20.9MiB/s]
> ### To file
> 57.1MiB 0:00:02 [19.7MiB/s]
> Dropped messages:
> [0.202s][warning][                       ]    350 messages dropped due to
> async logging
> [1.302s][warning][                       ]   1981 messages dropped due to
> async logging
> [1.328s][warning][                       ]   2193 messages dropped due to
> async logging
> [1.354s][warning][                       ]   2464 messages dropped due to
> async logging
> [1.379s][warning][                       ]   4158 messages dropped due to
> async logging
> [1.405s][warning][                       ]   3017 messages dropped due to
> async logging
> [1.435s][warning][                       ]   3818 messages dropped due to
> async logging
> [1.479s][warning][                       ]   2286 messages dropped due to
> async logging
>
> ## New async logging (dropping mode)
> ### /dev/null
> 50.6MiB 0:00:02 [20.9MiB/s]
> ### To file
> 50.8MiB 0:00:02 [18.1MiB/s]
> Dropped messages: 0 on this run.
> Did another run as there is high variability in dropped messages:
> 50.2MiB 0:00:02 [18.3MiB/s]
> [1.276s][warning][                       ]   4942 messages dropped due to
> async logging
>
> ## New async logging (stalling mode)
> ### /dev/null
> 51.3MiB 0:00:02 [19.9MiB/s]
> ### To file
> 50.9MiB 0:00:02 [18.8MiB/s]
>
> # Release build Linux-x64 (to a tmpfs mounted file)
> ## New async drop
> 20,1MiB 0:00:00 [43,3MiB/s]
> 0 dropped messages
> ## New async stall
> 33,7MiB 0:00:00 [41,5MiB/s]
> N/A dropped messages
> 0 dropped messages
> ## Old async
> 17,7MiB 0:00:00 [42,2MiB/s]
> [0.059s][warning][                       ]   6860 messages dropped due to
> async logging
> [0.082s][warning][                       ]  20601 messages dropped due to
> async logging
> [0,107s][warning][                       ]   9147 messages dropped due to
> async logging
> [0,131s][warning][                       ]  11113 messages dropped due to
> async logging
> [0,154s][warning][                       ]  11845 messages dropped due to
> async logging
> [0,178s][warning][                       ]  10383 messages dropped due to
> async logging
> [0,197s][warning][                       ]   9376 messages dropped due to
> async logging
> [0,216s][warning][                       ]  10997 messages dropped due to
> async logging
> [0,234s][warning][                       ]   5978 messages dropped due to
> async logging
> [0,252s][warning][                       ]  10439 messages dropped due to
> async logging
> [0,270s][warning][                       ]   6562 messages dropped due to
> async logging
> [0,325s][warning][                       ]    391 messages dropped due to
> async logging
> [0,343s][warning][                       ]   3532 messages dropped due to
> async logging
> [0,362s][warning][                       ]   3268 messages dropped due to
> async logging
> [0,395s][warning][                       ]   4182 messages dropped due to
> async logging
> [0,411s][warning][                       ]   2490 messages dropped due to
> async logging
> ```
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20240216/370e1a93/attachment-0001.htm>