RFR: JDK-8299415: Saturate the async log buffer more often

Thu Jan 5 11:19:49 UTC 2023

On Thu, 29 Dec 2022 19:23:07 GMT, Johan Sjölen <jsjolen at openjdk.org> wrote:

> Hi,
> 
> This change introduces a heuristic in order to ensure that the asynchronous logging buffer is saturated more often. Before this change the logger often wrote single messages to the output, now it outputs singificantly more data at a time. I also believe that less uses of `notify()` means that the performance implication of using asynchronous logging is improved. Running `stat perf` consistently shows a smaller amount of context switches when run with this change, which supports this hypothesis.
> 
> The asynchronous logging tests was run and passed with:
> 
> 
> _JAVA_OPTIONS='-Xlog:async' make CONF=linux-x64 test TEST='gtest:*Async*'
> 
> 
> The choice of a ratio of 30% and a timeout of 1 second isn't based on data. It seemed reasonable that waiting 1 second for your *sole* log message is acceptable and that 30% is enough to not cause any message drops.
> 
> # Data
> 
> All invocations of the JVM were run with `-Xlog:async -Xlog:all=trace:file=logs.txt`
> 
> As a measurement of buffer usage I logged the current number of messages and size of the buffer each time `AsyncLogWriter::write()`. Below I present how often `AsyncLogWriter::write()` was called with a certain number of messages in the buffer. For example, `1: 11` is read "`AsyncLogWriter::write()` was called with 1 message in the buffer 11 times".
> 
> Before:
> 
> 
> message_count : occurrence_count
> 1 : 11
> 2 : 6
> 3 : 294
> 4 : 8
> 5 : 10
> 6 : 47
> 7 : 4
> 8 : 1
> 9 : 28
> 10 : 2
> 11 : 2
> 12 : 14
> 15 : 5
> 18 : 3
> 19 : 1
> 20 : 1
> 21 : 1
> 22 : 1
> 26 : 2
> 27 : 3
> 30 : 2
> 33 : 1
> 36 : 1
> 37 : 1
> 50 : 1
> 62 : 1
> 63 : 1
> 75 : 1
> 207 : 1
> 231 : 1
> 1395 : 1
> 2724 : 1
> 2789 : 1
> 3212 : 1
> 3278 : 1
> 3503 : 1
> 4069 : 1
> 4231 : 1
> 4305 : 1
> 4712 : 1
> 5640 : 1
> 5864 : 1
> 5912 : 1
> 5943 : 1
> 6096 : 1
> 6144 : 1
> 6204 : 1
> 6220 : 1
> 6222 : 1
> 6226 : 1
> 6268 : 1
> 6278 : 1
> 6302 : 1
> 6309 : 1
> 6339 : 1
> 6343 : 1
> 6369 : 1
> 6379 : 1
> 6439 : 1
> 6472 : 1
> 6516 : 1
> 6602 : 1
> 6714 : 1
> 6834 : 1
> 7078 : 1
> 7475 : 1
> 
> Call count Sum: 491
> 
> 
> After
> 
> 2002 : 1
> 2572 : 1
> 2821 : 1
> 3279 : 1
> 5003 : 1
> 5226 : 1
> 5749 : 1
> 5771 : 1
> 5803 : 1
> 5979 : 1
> 6011 : 1
> 6103 : 1
> 6140 : 1
> 6201 : 1
> 6252 : 1
> 6265 : 1
> 6266 : 1
> 6271 : 1
> 6288 : 1
> 6336 : 1
> 6360 : 1
> 6373 : 1
> 6451 : 1
> 6473 : 1
> 6515 : 1
> 6582 : 1
> 6600 : 1
> 6774 : 1
> 7228 : 1
> 
> Call count sum: 29
> 
> 
> 
> I ran the following command multiple times for gaining CPU counter information: `sudo perf stat -ddd ./build/linux-x64/jdk/bin/java -Xlog:async -Xlog:all=trace:file=logz.txt`. A before and after sample is provided below. See context-switches in particular.
> 
> Before:
> 
> 
>                  5 568,53 msec task-clock                #    2,852 CPUs utilized
>              1 481      context-switches          #  265,959 /sec
>                139      cpu-migrations            #   24,962 /sec
>             17 148      page-faults               #    3,079 K/sec
>      7 492 468 406      cycles                    #    1,346 GHz                      (39,34%)
>     11 820 424 570      instructions              #    1,58  insn per cycle           (46,99%)
>      2 497 513 512      branches                  #  448,505 M/sec                    (54,60%)
>         23 018 472      branch-misses             #    0,92% of all branches          (62,21%)
>     26 465 666 080      slots                     #    4,753 G/sec                    (69,82%)
>      9 638 436 928      topdown-retiring          #     33,4% retiring                (69,82%)
>     10 041 432 678      topdown-bad-spec          #     34,8% bad speculation         (69,82%)
>      8 310 642 872      topdown-fe-bound          #     28,8% frontend bound          (69,82%)
>        879 846 428      topdown-be-bound          #      3,0% backend bound           (69,82%)
>      3 339 457 604      L1-dcache-loads           #  599,702 M/sec                    (69,38%)
>         92 744 182      L1-dcache-load-misses     #    2,78% of all L1-dcache accesses  (69,35%)
>          5 575 831      LLC-loads                 #    1,001 M/sec                    (69,32%)
>          1 173 449      LLC-load-misses           #   21,05% of all LL-cache accesses  (69,30%)
>    <not supported>      L1-icache-loads
>        339 072 313      L1-icache-load-misses                                         (30,94%)
>      3 319 520 527      dTLB-loads                #  596,121 M/sec                    (31,19%)
>            637 308      dTLB-load-misses          #    0,02% of all dTLB cache accesses  (31,38%)
>    <not supported>      iTLB-loads
>          3 026 652      iTLB-load-misses                                              (31,65%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
> 
>        1,952442620 seconds time elapsed
> 
>        4,277701000 seconds user
>        1,303833000 seconds sys
> 
> 
> After:
> 
> 
>           5 344,15 msec task-clock                #    2,867 CPUs utilized
>                994      context-switches          #  185,998 /sec
>                101      cpu-migrations            #   18,899 /sec
>             16 902      page-faults               #    3,163 K/sec
>      7 220 944 481      cycles                    #    1,351 GHz                      (39,36%)
>     11 277 945 316      instructions              #    1,56  insn per cycle           (47,12%)
>      2 371 729 681      branches                  #  443,799 M/sec                    (54,73%)
>         22 683 816      branch-misses             #    0,96% of all branches          (62,18%)
>     24 913 623 884      slots                     #    4,662 G/sec                    (69,76%)
>      9 354 630 464      topdown-retiring          #     34,6% retiring                (69,76%)
>      9 339 435 252      topdown-bad-spec          #     34,5% bad speculation         (69,76%)
>      7 563 625 485      topdown-fe-bound          #     27,9% frontend bound          (69,76%)
>        815 220 376      topdown-be-bound          #      3,0% backend bound           (69,76%)
>      3 209 639 698      L1-dcache-loads           #  600,589 M/sec                    (69,62%)
>         89 008 704      L1-dcache-load-misses     #    2,77% of all L1-dcache accesses  (69,55%)
>          5 369 924      LLC-loads                 #    1,005 M/sec                    (69,16%)
>          1 093 971      LLC-load-misses           #   20,37% of all LL-cache accesses  (69,32%)
>    <not supported>      L1-icache-loads
>        323 572 438      L1-icache-load-misses                                         (30,75%)
>      3 096 980 739      dTLB-loads                #  579,509 M/sec                    (31,33%)
>            616 949      dTLB-load-misses          #    0,02% of all dTLB cache accesses  (31,46%)
>    <not supported>      iTLB-loads
>          2 741 468      iTLB-load-misses                                              (31,51%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
> 
>        1,864260247 seconds time elapsed
> 
>        4,129709000 seconds user
>        1,225694000 seconds sys

> 

Hi David, I was surprised to hear that your intuition regarding this went against my own. So, that caused me to re-do the measurements and I realised I made a big mistake: I ran the tests without optimisations (slowdebug). I then decided to re-run the measurements with an optimised build (no `perf stat` as this changes the performance drastically).

The results of those measurements showed that `all=trace` do cause messages to be dropped with my patch. `all=debug` shows behavior similar to my initial testing: my patch fully saturates the buffer while original code doesn't. I suspect that there is huge lock contention on the `AsyncLogLocker` in the `all=trace` case, which makes it very difficult for the writer to gain access in time for the buffer to be filled and messages to be dropped.

With that in mind, maybe what should be looked into isn't this patch, but rather replacing the buffer with a lock-free circular one (for example, maybe there are better fits). That is, if we want better performance.

@navyxliu , maybe we should avoid merging this patch as it leads to increased complexity while there are other measures that could give much bigger gains?

-------------

PR: https://git.openjdk.org/jdk/pull/11804