From duke at openjdk.org Tue Jan 2 03:01:01 2024 From: duke at openjdk.org (duke) Date: Tue, 2 Jan 2024 03:01:01 GMT Subject: Withdrawn: 8306656: Generational ZGC: Fix age table logging In-Reply-To: References: Message-ID: On Fri, 21 Apr 2023 08:36:32 GMT, Erik Duveblad wrote: > Hey all, > > please review this patch that updates the age table logging. The logging now looks like: > > > [1.946s][info][gc,reloc ] GC(22) y: Age Table: > [1.946s][info][gc,reloc ] GC(22) y: Live Small Medium Large > [1.946s][info][gc,reloc ] GC(22) y: Eden 0M -> 0M 300 -> 300 0 -> 0 0 -> 0 > [1.946s][info][gc,reloc ] GC(22) y: Survivor 1 0M -> 0M 2 -> 2 0 -> 0 0 -> 0 > [1.946s][info][gc,reloc ] GC(22) y: Survivor 2 0M -> 0M 1 -> 0 0 -> 0 0 -> 0 > [1.946s][info][gc,reloc ] GC(22) y: Survivor 3 0M -> 0M 1 -> 1 0 -> 0 0 -> 0 > [1.946s][info][gc,reloc ] GC(22) y: Survivor 4 0M -> 0M 0 -> 1 0 -> 0 0 -> 0 > > > The first column is the size of the live objects with the given age and how that has changed since the previous collection. The remaining columns are the number of pages of the given age and how those have changed from the previous collection to the current one (so the format is ` -> `). This makes it easy to spot a change in behavior in objects' lifetimes from one collection to another: if objects have similar lifetimes compared to the previous collection then the distribution of pages should be similar between the current collection and the previous collection. > > _Note_: I would like to make the table adapt to live sizes smaller than one megabyte, but that should probably be done for the logging as a whole, not just for the age table. > > I also cleaned up some minor stuff as I went along. > > Testing: > - [x] Tier 1-3 (macOS-aarch64, Windows-x64, Linux-x64, Linux-aarch64) > - [x] Local testing on macOS-aarch64 > > Thanks, > Erik This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/zgc/pull/19 From mmillson at redhat.com Mon Jan 8 14:12:19 2024 From: mmillson at redhat.com (Mike Millson) Date: Mon, 8 Jan 2024 09:12:19 -0500 Subject: Missing gc time in zgc generational logging Message-ID: There seems to be a lot of missing gc time in the jdk21 zgc generational logging. Following is an example (complete log attached). The gc time below is .234ms, and the safepoint is 2.7ms (1900085+806967+4204/10^6). It's over an order of magnitude off. [2023-12-12T10:20:45.260+0200][info][gc,phases ] GC(34) y: Pause Mark Start 0.234ms [2023-12-12T10:20:45.260+0200][info][safepoint ] Safepoint "ZMarkStartYoung", Time since last: 9642266719 ns, Reaching safepoint: 1900085 ns, Cleanup: 4204 ns, At safepoint: 806967 ns, Total: 2711256 ns Thank you, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gc.log Type: text/x-log Size: 1750801 bytes Desc: not available URL: From erik.osterlund at oracle.com Mon Jan 8 15:24:43 2024 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Mon, 8 Jan 2024 15:24:43 +0000 Subject: Missing gc time in zgc generational logging In-Reply-To: References: Message-ID: Hi Mike, All GC logs for all GCs show the time spent by the GC operation in a safepoint, while the safepoint logging includes details about time to safepoint etc. While this is indeed the same for all GCs, it might be worth noting that the time to safepoint includes the time it takes to shut down other GC threads as well as mutator threads. In particular, if there is a concurrent old generation collection going on when synchronizing a young mark start pause, we will first suspend those old generation GC threads, before we start to shut down the mutator threads. In a way, that can make the time to safepoint numbers look a bit inflated, as we are not really interrupting the mutator threads until after the GC threads have been synchronized. This anomaly never showed up with single generation ZGC, as it didn?t have any ?other? generation workers to shut down when scheduling safepoints, and this part was a no-op. Maybe it would be nice to distinguish how long time is spent on synchronizing the Java threads specifically, excluding the time for waiting for GC threads. /Erik > On 8 Jan 2024, at 15:12, Mike Millson wrote: > > There seems to be a lot of missing gc time in the jdk21 zgc generational logging. > > Following is an example (complete log attached). > > The gc time below is .234ms, and the safepoint is 2.7ms (1900085+806967+4204/10^6). > > It's over an order of magnitude off. > > [2023-12-12T10:20:45.260+0200][info][gc,phases ] GC(34) y: Pause Mark Start 0.234ms > [2023-12-12T10:20:45.260+0200][info][safepoint ] Safepoint "ZMarkStartYoung", Time since last: 9642266719 ns, Reaching safepoint: 1900085 ns, Cleanup: 4204 ns, At safepoint: 806967 ns, Total: 2711256 ns > > Thank you, > Mike > From mmillson at redhat.com Mon Jan 8 16:42:19 2024 From: mmillson at redhat.com (Mike Millson) Date: Mon, 8 Jan 2024 11:42:19 -0500 Subject: Missing gc time in zgc generational logging In-Reply-To: References: Message-ID: Hi Erik, I think what you are referring to is what other collectors included in "Other" time, is that right? For example, in the G1 logging below, the "Other: 0.1ms" is time spent outside of garbage collection, and the Safepoint time is equal to the GC time (4.858ms) plus this "Other" time (0.1ms). In my experience, this "Other" time with other collectors is zero/negligible except when there is some performance issue (e.g. an I/O delay preventing a timely GC log write from one line to the next). It seems like with ZGC, it's normal (not an indicator of a performance issue) to have this "Other" time, is that correct? >From my perspective, it's nice to be able to relate the GC times to the safepoint times, so I would be in favor of having ZGC logging that allows that. As it stands now, when looking at max pause and throughput, I think I can only trust the Safepoint logging because there is missing time in the GC logging. ~~~ [0.109s][info][safepoint ] Entering safepoint region: G1CollectForAllocation [0.109s][info][gc,start ] GC(1) Pause Young (Concurrent Start) (G1 Evacuation Pause) [0.109s][info][gc,task ] GC(1) Using 2 workers of 4 for evacuation [0.114s][info][gc,phases ] GC(1) Pre Evacuate Collection Set: 0.0ms [0.114s][info][gc,phases ] GC(1) Evacuate Collection Set: 4.6ms [0.114s][info][gc,phases ] GC(1) Post Evacuate Collection Set: 0.1ms [0.114s][info][gc,phases ] GC(1) Other: 0.1ms [0.114s][info][gc,heap ] GC(1) Eden regions: 1->0(1) [0.114s][info][gc,heap ] GC(1) Survivor regions: 1->1(1) [0.114s][info][gc,heap ] GC(1) Old regions: 2->2 [0.114s][info][gc,heap ] GC(1) Humongous regions: 0->0 [0.114s][info][gc,metaspace ] GC(1) Metaspace: 879K->879K(1056768K) [0.114s][info][gc ] GC(1) Pause Young (Concurrent Start) (G1 Evacuation Pause) 2M->1M(5M) 4.858ms [0.114s][info][gc,cpu ] GC(1) User=0.01s Sys=0.00s Real=0.01s [0.114s][info][safepoint ] Leaving safepoint region [0.114s][info][safepoint ] Total time for which application threads were stopped: 0.0049236 seconds, Stopping threads took: 0.0000050 seconds ~~~ Thank you, Mike On Mon, Jan 8, 2024 at 10:25?AM Erik Osterlund wrote: > Hi Mike, > > All GC logs for all GCs show the time spent by the GC operation in a > safepoint, while the safepoint logging includes details about time to > safepoint etc. While this is indeed the same for all GCs, it might be worth > noting that the time to safepoint includes the time it takes to shut down > other GC threads as well as mutator threads. In particular, if there is a > concurrent old generation collection going on when synchronizing a young > mark start pause, we will first suspend those old generation GC threads, > before we start to shut down the mutator threads. In a way, that can make > the time to safepoint numbers look a bit inflated, as we are not really > interrupting the mutator threads until after the GC threads have been > synchronized. This anomaly never showed up with single generation ZGC, as > it didn?t have any ?other? generation workers to shut down when scheduling > safepoints, and this part was a no-op. > > Maybe it would be nice to distinguish how long time is spent on > synchronizing the Java threads specifically, excluding the time for waiting > for GC threads. > > /Erik > > > On 8 Jan 2024, at 15:12, Mike Millson wrote: > > > > There seems to be a lot of missing gc time in the jdk21 zgc generational > logging. > > > > Following is an example (complete log attached). > > > > The gc time below is .234ms, and the safepoint is 2.7ms > (1900085+806967+4204/10^6). > > > > It's over an order of magnitude off. > > > > [2023-12-12T10:20:45.260+0200][info][gc,phases ] GC(34) y: Pause Mark > Start 0.234ms > > [2023-12-12T10:20:45.260+0200][info][safepoint ] Safepoint > "ZMarkStartYoung", Time since last: 9642266719 ns, Reaching safepoint: > 1900085 ns, Cleanup: 4204 ns, At safepoint: 806967 ns, Total: 2711256 ns > > > > Thank you, > > Mike > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.osterlund at oracle.com Tue Jan 9 06:42:37 2024 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Tue, 9 Jan 2024 06:42:37 +0000 Subject: [External] : Re: Missing gc time in zgc generational logging In-Reply-To: References: Message-ID: <383E7D56-EDC5-4F78-BE52-C1CDD5CCD7E3@oracle.com> Hi Mike, The ?Other? times are something different and excludes time to safepoint, etc. All GCs log the time it takes to run the GC operation in the safepoint as part of GC logging. G1 indeed also stops GC threads when safepointing, leading to confusing logs. Unfortunately, even if the log included time *to* safepoint in order to improve the latency impact understanding, which I think you are suggesting, it makes me wonder about time *from* safepoint (until the threads that were suspended start running again), which no logging currently captures. I agree that there is room for improvement in the logging story here, for all GCs. Not entirely sure what we want the log to look like though. /Erik On 8 Jan 2024, at 17:43, Mike Millson wrote: ? Hi Erik, I think what you are referring to is what other collectors included in "Other" time, is that right? For example, in the G1 logging below, the "Other: 0.1ms" is time spent outside of garbage collection, and the Safepoint time is equal to the GC time (4.858ms) plus this "Other" time (0.1ms). In my experience, this "Other" time with other collectors is zero/negligible except when there is some performance issue (e.g. an I/O delay preventing a timely GC log write from one line to the next). It seems like with ZGC, it's normal (not an indicator of a performance issue) to have this "Other" time, is that correct? From my perspective, it's nice to be able to relate the GC times to the safepoint times, so I would be in favor of having ZGC logging that allows that. As it stands now, when looking at max pause and throughput, I think I can only trust the Safepoint logging because there is missing time in the GC logging. ~~~ [0.109s][info][safepoint ] Entering safepoint region: G1CollectForAllocation [0.109s][info][gc,start ] GC(1) Pause Young (Concurrent Start) (G1 Evacuation Pause) [0.109s][info][gc,task ] GC(1) Using 2 workers of 4 for evacuation [0.114s][info][gc,phases ] GC(1) Pre Evacuate Collection Set: 0.0ms [0.114s][info][gc,phases ] GC(1) Evacuate Collection Set: 4.6ms [0.114s][info][gc,phases ] GC(1) Post Evacuate Collection Set: 0.1ms [0.114s][info][gc,phases ] GC(1) Other: 0.1ms [0.114s][info][gc,heap ] GC(1) Eden regions: 1->0(1) [0.114s][info][gc,heap ] GC(1) Survivor regions: 1->1(1) [0.114s][info][gc,heap ] GC(1) Old regions: 2->2 [0.114s][info][gc,heap ] GC(1) Humongous regions: 0->0 [0.114s][info][gc,metaspace ] GC(1) Metaspace: 879K->879K(1056768K) [0.114s][info][gc ] GC(1) Pause Young (Concurrent Start) (G1 Evacuation Pause) 2M->1M(5M) 4.858ms [0.114s][info][gc,cpu ] GC(1) User=0.01s Sys=0.00s Real=0.01s [0.114s][info][safepoint ] Leaving safepoint region [0.114s][info][safepoint ] Total time for which application threads were stopped: 0.0049236 seconds, Stopping threads took: 0.0000050 seconds ~~~ Thank you, Mike On Mon, Jan 8, 2024 at 10:25?AM Erik Osterlund > wrote: Hi Mike, All GC logs for all GCs show the time spent by the GC operation in a safepoint, while the safepoint logging includes details about time to safepoint etc. While this is indeed the same for all GCs, it might be worth noting that the time to safepoint includes the time it takes to shut down other GC threads as well as mutator threads. In particular, if there is a concurrent old generation collection going on when synchronizing a young mark start pause, we will first suspend those old generation GC threads, before we start to shut down the mutator threads. In a way, that can make the time to safepoint numbers look a bit inflated, as we are not really interrupting the mutator threads until after the GC threads have been synchronized. This anomaly never showed up with single generation ZGC, as it didn?t have any ?other? generation workers to shut down when scheduling safepoints, and this part was a no-op. Maybe it would be nice to distinguish how long time is spent on synchronizing the Java threads specifically, excluding the time for waiting for GC threads. /Erik > On 8 Jan 2024, at 15:12, Mike Millson > wrote: > > There seems to be a lot of missing gc time in the jdk21 zgc generational logging. > > Following is an example (complete log attached). > > The gc time below is .234ms, and the safepoint is 2.7ms (1900085+806967+4204/10^6). > > It's over an order of magnitude off. > > [2023-12-12T10:20:45.260+0200][info][gc,phases ] GC(34) y: Pause Mark Start 0.234ms > [2023-12-12T10:20:45.260+0200][info][safepoint ] Safepoint "ZMarkStartYoung", Time since last: 9642266719 ns, Reaching safepoint: 1900085 ns, Cleanup: 4204 ns, At safepoint: 806967 ns, Total: 2711256 ns > > Thank you, > Mike > -------------- next part -------------- An HTML attachment was scrubbed... URL: