RFR: JDK-8299415: Saturate the async log buffer more often
Johan Sjölen
jsjolen at openjdk.org
Sat Dec 31 09:38:47 UTC 2022
On Thu, 29 Dec 2022 19:23:07 GMT, Johan Sjölen <jsjolen at openjdk.org> wrote:
> Hi,
>
> This change introduces a heuristic in order to ensure that the asynchronous logging buffer is saturated more often. Before this change the logger often wrote single messages to the output, now it outputs singificantly more data at a time. I also believe that less uses of `notify()` means that the performance implication of using asynchronous logging is improved. Running `stat perf` consistently shows a smaller amount of context switches when run with this change, which supports this hypothesis.
>
> The asynchronous logging tests was run and passed with:
>
>
> _JAVA_OPTIONS='-Xlog:async' make CONF=linux-x64 test TEST='gtest:*Async*'
>
>
> The choice of a ratio of 30% and a timeout of 1 second isn't based on data. It seemed reasonable that waiting 1 second for your *sole* log message is acceptable and that 30% is enough to not cause any message drops.
>
> # Data
>
> All invocations of the JVM were run with `-Xlog:async -Xlog:all=trace:file=logs.txt`
>
> As a measurement of buffer usage I logged the current number of messages and size of the buffer each time `AsyncLogWriter::write()`. Below I present how often `AsyncLogWriter::write()` was called with a certain number of messages in the buffer. For example, `1: 11` is read "`AsyncLogWriter::write()` was called with 1 message in the buffer 11 times".
>
> Before:
>
>
> message_count : occurrence_count
> 1 : 11
> 2 : 6
> 3 : 294
> 4 : 8
> 5 : 10
> 6 : 47
> 7 : 4
> 8 : 1
> 9 : 28
> 10 : 2
> 11 : 2
> 12 : 14
> 15 : 5
> 18 : 3
> 19 : 1
> 20 : 1
> 21 : 1
> 22 : 1
> 26 : 2
> 27 : 3
> 30 : 2
> 33 : 1
> 36 : 1
> 37 : 1
> 50 : 1
> 62 : 1
> 63 : 1
> 75 : 1
> 207 : 1
> 231 : 1
> 1395 : 1
> 2724 : 1
> 2789 : 1
> 3212 : 1
> 3278 : 1
> 3503 : 1
> 4069 : 1
> 4231 : 1
> 4305 : 1
> 4712 : 1
> 5640 : 1
> 5864 : 1
> 5912 : 1
> 5943 : 1
> 6096 : 1
> 6144 : 1
> 6204 : 1
> 6220 : 1
> 6222 : 1
> 6226 : 1
> 6268 : 1
> 6278 : 1
> 6302 : 1
> 6309 : 1
> 6339 : 1
> 6343 : 1
> 6369 : 1
> 6379 : 1
> 6439 : 1
> 6472 : 1
> 6516 : 1
> 6602 : 1
> 6714 : 1
> 6834 : 1
> 7078 : 1
> 7475 : 1
>
> Call count Sum: 491
>
>
> After
>
> 2002 : 1
> 2572 : 1
> 2821 : 1
> 3279 : 1
> 5003 : 1
> 5226 : 1
> 5749 : 1
> 5771 : 1
> 5803 : 1
> 5979 : 1
> 6011 : 1
> 6103 : 1
> 6140 : 1
> 6201 : 1
> 6252 : 1
> 6265 : 1
> 6266 : 1
> 6271 : 1
> 6288 : 1
> 6336 : 1
> 6360 : 1
> 6373 : 1
> 6451 : 1
> 6473 : 1
> 6515 : 1
> 6582 : 1
> 6600 : 1
> 6774 : 1
> 7228 : 1
>
> Call count sum: 29
>
>
>
> I ran the following command multiple times for gaining CPU counter information: `sudo perf stat -ddd ./build/linux-x64/jdk/bin/java -Xlog:async -Xlog:all=trace:file=logz.txt`. A before and after sample is provided below. See context-switches in particular.
>
> Before:
>
>
> 5 568,53 msec task-clock # 2,852 CPUs utilized
> 1 481 context-switches # 265,959 /sec
> 139 cpu-migrations # 24,962 /sec
> 17 148 page-faults # 3,079 K/sec
> 7 492 468 406 cycles # 1,346 GHz (39,34%)
> 11 820 424 570 instructions # 1,58 insn per cycle (46,99%)
> 2 497 513 512 branches # 448,505 M/sec (54,60%)
> 23 018 472 branch-misses # 0,92% of all branches (62,21%)
> 26 465 666 080 slots # 4,753 G/sec (69,82%)
> 9 638 436 928 topdown-retiring # 33,4% retiring (69,82%)
> 10 041 432 678 topdown-bad-spec # 34,8% bad speculation (69,82%)
> 8 310 642 872 topdown-fe-bound # 28,8% frontend bound (69,82%)
> 879 846 428 topdown-be-bound # 3,0% backend bound (69,82%)
> 3 339 457 604 L1-dcache-loads # 599,702 M/sec (69,38%)
> 92 744 182 L1-dcache-load-misses # 2,78% of all L1-dcache accesses (69,35%)
> 5 575 831 LLC-loads # 1,001 M/sec (69,32%)
> 1 173 449 LLC-load-misses # 21,05% of all LL-cache accesses (69,30%)
> <not supported> L1-icache-loads
> 339 072 313 L1-icache-load-misses (30,94%)
> 3 319 520 527 dTLB-loads # 596,121 M/sec (31,19%)
> 637 308 dTLB-load-misses # 0,02% of all dTLB cache accesses (31,38%)
> <not supported> iTLB-loads
> 3 026 652 iTLB-load-misses (31,65%)
> <not supported> L1-dcache-prefetches
> <not supported> L1-dcache-prefetch-misses
>
> 1,952442620 seconds time elapsed
>
> 4,277701000 seconds user
> 1,303833000 seconds sys
>
>
> After:
>
>
> 5 344,15 msec task-clock # 2,867 CPUs utilized
> 994 context-switches # 185,998 /sec
> 101 cpu-migrations # 18,899 /sec
> 16 902 page-faults # 3,163 K/sec
> 7 220 944 481 cycles # 1,351 GHz (39,36%)
> 11 277 945 316 instructions # 1,56 insn per cycle (47,12%)
> 2 371 729 681 branches # 443,799 M/sec (54,73%)
> 22 683 816 branch-misses # 0,96% of all branches (62,18%)
> 24 913 623 884 slots # 4,662 G/sec (69,76%)
> 9 354 630 464 topdown-retiring # 34,6% retiring (69,76%)
> 9 339 435 252 topdown-bad-spec # 34,5% bad speculation (69,76%)
> 7 563 625 485 topdown-fe-bound # 27,9% frontend bound (69,76%)
> 815 220 376 topdown-be-bound # 3,0% backend bound (69,76%)
> 3 209 639 698 L1-dcache-loads # 600,589 M/sec (69,62%)
> 89 008 704 L1-dcache-load-misses # 2,77% of all L1-dcache accesses (69,55%)
> 5 369 924 LLC-loads # 1,005 M/sec (69,16%)
> 1 093 971 LLC-load-misses # 20,37% of all LL-cache accesses (69,32%)
> <not supported> L1-icache-loads
> 323 572 438 L1-icache-load-misses (30,75%)
> 3 096 980 739 dTLB-loads # 579,509 M/sec (31,33%)
> 616 949 dTLB-load-misses # 0,02% of all dTLB cache accesses (31,46%)
> <not supported> iTLB-loads
> 2 741 468 iTLB-load-misses (31,51%)
> <not supported> L1-dcache-prefetches
> <not supported> L1-dcache-prefetch-misses
>
> 1,864260247 seconds time elapsed
>
> 4,129709000 seconds user
> 1,225694000 seconds sys
>Does this increase context switch? In particular, hotspot may not enable unified log at all. The flush thread wakes up every second for nothing?
Yes, it increases context switching. If Hotspot hasn't enabled `-Xlog:async` then this code won't run.
>I understand that we somehow need to flush the queue if the log stream is end and fraction_used() is still lower than 0.3. I think we can remember a historic fraction_used. As long as this value increases, it suggests that logs still trickle in. we don't need to worry about it if so.
The reasoning for the timeout was two fold:
1. As you said: the log stream is end and fraction_used() is still lower than 0.3, we should still log in a timely manner.
2. Flushing logs in a timely manner, making it suitable for interactive use
Let's say that a historic fraction is used, then where would the comparison be placed? The only place I can think of is the `enqueue` method, but then we already know that messages are being supplied. We could compute a rate of change and if that rate is too low then we flush each time. Both of these solutions depend on an infinite stream of logs (however far apart the messages are) if the timeout is not used. Perhaps this should be combined with a much longer timeout? Such as 30 seconds. I think spuriously waking up every 30 seconds is acceptable.
>Using this algorithm, I think we can make even bigger threshold. The reasonable threshold is up to the speed of IO. In reality, the speed of IO is not known until runtime. I think we can make it adaptive in the future. if we detect overflow, we lower the threshold down to 0.0.
It's reasonable for this RFE to lower the threshold directly down to 0.0 if overflow is detected, I agree that it's a good idea for a future RFE to implement something more dynamic.
-------------
PR: https://git.openjdk.org/jdk/pull/11804
More information about the hotspot-runtime-dev
mailing list