[External] : Re: Missing gc time in zgc generational logging

Tue Jan 9 06:42:37 UTC 2024

Hi Mike,

The “Other” times are something different and excludes time to safepoint, etc. All GCs log the time it takes to run the GC operation in the safepoint as part of GC logging.

G1 indeed also stops GC threads when safepointing, leading to confusing logs.

Unfortunately, even if the log included time *to* safepoint in order to improve the latency impact understanding, which I think you are suggesting, it makes me wonder about time *from* safepoint (until the threads that were suspended start running again), which no logging currently captures.

I agree that there is room for improvement in the logging story here, for all GCs. Not entirely sure what we want the log to look like though.

/Erik

On 8 Jan 2024, at 17:43, Mike Millson <mmillson at redhat.com> wrote:

Hi Erik,

I think what you are referring to is what other collectors included in "Other" time, is that right?

For example, in the G1 logging below, the "Other: 0.1ms" is time spent outside of garbage collection, and the Safepoint time is equal to the GC time (4.858ms) plus this "Other" time (0.1ms).

In my experience, this "Other" time with other collectors is zero/negligible except when there is some performance issue (e.g. an I/O delay preventing a timely GC log write from one line to the next).

It seems like with ZGC, it's normal (not an indicator of a performance issue) to have this "Other" time, is that correct?

From my perspective, it's nice to be able to relate the GC times to the safepoint times, so I would be in favor of having ZGC logging that allows that. As it stands now, when looking at max pause and throughput, I think I can only trust the Safepoint logging because there is missing time in the GC logging.

~~~
[0.109s][info][safepoint    ] Entering safepoint region: G1CollectForAllocation
[0.109s][info][gc,start     ] GC(1) Pause Young (Concurrent Start) (G1 Evacuation Pause)
[0.109s][info][gc,task      ] GC(1) Using 2 workers of 4 for evacuation
[0.114s][info][gc,phases    ] GC(1)   Pre Evacuate Collection Set: 0.0ms
[0.114s][info][gc,phases    ] GC(1)   Evacuate Collection Set: 4.6ms
[0.114s][info][gc,phases    ] GC(1)   Post Evacuate Collection Set: 0.1ms
[0.114s][info][gc,phases    ] GC(1)   Other: 0.1ms
[0.114s][info][gc,heap      ] GC(1) Eden regions: 1->0(1)
[0.114s][info][gc,heap      ] GC(1) Survivor regions: 1->1(1)
[0.114s][info][gc,heap      ] GC(1) Old regions: 2->2
[0.114s][info][gc,heap      ] GC(1) Humongous regions: 0->0
[0.114s][info][gc,metaspace ] GC(1) Metaspace: 879K->879K(1056768K)
[0.114s][info][gc           ] GC(1) Pause Young (Concurrent Start) (G1 Evacuation Pause) 2M->1M(5M) 4.858ms
[0.114s][info][gc,cpu       ] GC(1) User=0.01s Sys=0.00s Real=0.01s
[0.114s][info][safepoint    ] Leaving safepoint region
[0.114s][info][safepoint    ] Total time for which application threads were stopped: 0.0049236 seconds, Stopping threads took: 0.0000050 seconds
~~~

Thank you,
Mike

On Mon, Jan 8, 2024 at 10:25 AM Erik Osterlund <erik.osterlund at oracle.com<mailto:erik.osterlund at oracle.com>> wrote:
Hi Mike,

All GC logs for all GCs show the time spent by the GC operation in a safepoint, while the safepoint logging includes details about time to safepoint etc. While this is indeed the same for all GCs, it might be worth noting that the time to safepoint includes the time it takes to shut down other GC threads as well as mutator threads. In particular, if there is a concurrent old generation collection going on when synchronizing a young mark start pause, we will first suspend those old generation GC threads, before we start to shut down the mutator threads. In a way, that can make the time to safepoint numbers look a bit inflated, as we are not really interrupting the mutator threads until after the GC threads have been synchronized. This anomaly never showed up with single generation ZGC, as it didn’t have any “other” generation workers to shut down when scheduling safepoints, and this part was a no-op.

Maybe it would be nice to distinguish how long time is spent on synchronizing the Java threads specifically, excluding the time for waiting for GC threads.

/Erik

> On 8 Jan 2024, at 15:12, Mike Millson <mmillson at redhat.com<mailto:mmillson at redhat.com>> wrote:
>
> There seems to be a lot of missing gc time in the jdk21 zgc generational logging.
>
> Following is an example (complete log attached).
>
> The gc time below is .234ms, and the safepoint is 2.7ms (1900085+806967+4204/10^6).
>
> It's over an order of magnitude off.
>
> [2023-12-12T10:20:45.260+0200][info][gc,phases   ] GC(34) y: Pause Mark Start 0.234ms
> [2023-12-12T10:20:45.260+0200][info][safepoint   ] Safepoint "ZMarkStartYoung", Time since last: 9642266719 ns, Reaching safepoint: 1900085 ns, Cleanup: 4204 ns, At safepoint: 806967 ns, Total: 2711256 ns
>
> Thank you,
> Mike
> <gc.log>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20240109/16f0956e/attachment-0001.htm>