What gc.log lines reliably indicate STW pauses for ZGC?
Hi ZGC developers, I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as: [2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms that is, we check every line for the following condition (we don't use pattern matching for speed): if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like [2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms i.e. the condition should be if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line I wonder if this is all we need or I am missing something? Also, what's the meaning of lines like these: [2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise. Thank you, Misha
Hi Misha, Could you try GCViewer? https://github.com/chewiebug/GCViewer enhance support for zgc and shenandoah https://github.com/chewiebug/GCViewer/pull/261
2023年9月7日 03:15,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
Please look at the enhance support for ZGC about GCViewer. Thanks, Leslie Zhai
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
Thank you,
Misha
Hi Leslie, Thank you for your response. I am not sure I understand it, though. GCViewer is not a small tool, which extracts all kinds of information from the GC log. However, our goal for now is very limited: we just need to recognize and parse log lines that indicate STW GC pauses, that's it. We cannot reuse GCViewer code. On the other hand, I was hoping that I could get a more definitive/accurate answer directly from ZGC developers, rather than by reading code of a 3rd party tool. May I ask you to answer at least my second question: what's the meaning of lines like the one below? [2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms Thank you, Misha ________________________________ From: Leslie Zhai <zhaixiang@loongson.cn> Sent: Wednesday, September 6, 2023 5:35 PM To: Misha Dmitriev <mdmitriev@linkedin.com> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha, Could you try GCViewer? https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer&data=05%7C01%7Cmdmitriev%40linkedin.com%7Ce07cc8ab5300462417f408dbaf3a598d%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638296437390082029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YjtYLdtX%2FyjxNld0k7LcHZVe1YlS2IoDOedJv3JzHsQ%3D&reserved=0<https://github.com/chewiebug/GCViewer> enhance support for zgc and shenandoah https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fpull%2F261&data=05%7C01%7Cmdmitriev%40linkedin.com%7Ce07cc8ab5300462417f408dbaf3a598d%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638296437390082029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=YvfbyGgBl9s5fZ4aNqCkget1AXKGIJr0mYnhSC%2FAAXc%3D&reserved=0<https://github.com/chewiebug/GCViewer/pull/261>
2023年9月7日 03:15,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
Please look at the enhance support for ZGC about GCViewer. Thanks, Leslie Zhai
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
Thank you,
Misha
2023年9月8日 04:45,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
Thank you for your response. I am not sure I understand it, though. GCViewer is not a small tool, which extracts all kinds of information from the GC log. However, our goal for now is very limited: we just need to recognize and parse log lines that indicate STW GC pauses, that's it. We cannot reuse GCViewer code. On the other hand, I was hoping that I could get a more definitive/accurate answer directly from ZGC developers, rather than by reading code of a 3rd party tool.
May I ask you to answer at least my second question: what's the meaning of lines like the one below?
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
It is the Concurrent Mark Subphase, which doesn’t need to STW at all. Thanks, Leslie Zhai
Thank you,
Misha From: Leslie Zhai <zhaixiang@loongson.cn> Sent: Wednesday, September 6, 2023 5:35 PM To: Misha Dmitriev <mdmitriev@linkedin.com> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
Could you try GCViewer? https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
enhance support for zgc and shenandoah https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
2023年9月7日 03:15,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
Please look at the enhance support for ZGC about GCViewer.
Thanks, Leslie Zhai
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
Thank you,
Misha
2023年9月8日 09:38,Leslie Zhai <zhaixiang@LOONGSON.CN> 写道:
2023年9月8日 04:45,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
Thank you for your response. I am not sure I understand it, though. GCViewer is not a small tool, which extracts all kinds of information from the GC log. However, our goal for now is very limited: we just need to recognize and parse log lines that indicate STW GC pauses, that's it. We cannot reuse GCViewer code. On the other hand, I was hoping that I could get a more definitive/accurate answer directly from ZGC developers, rather than by reading code of a 3rd party tool.
May I ask you to answer at least my second question: what's the meaning of lines like the one below?
Single generation ZGC ZMark concurrent thread worker *with* timeout support https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/x/xMark.cpp#... Thanks, Leslie Zhai
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
It is the Concurrent Mark Subphase, which doesn’t need to STW at all.
Thanks, Leslie Zhai
Thank you,
Misha From: Leslie Zhai <zhaixiang@loongson.cn> Sent: Wednesday, September 6, 2023 5:35 PM To: Misha Dmitriev <mdmitriev@linkedin.com> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
Could you try GCViewer? https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
enhance support for zgc and shenandoah https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
2023年9月7日 03:15,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
Please look at the enhance support for ZGC about GCViewer.
Thanks, Leslie Zhai
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
Thank you,
Misha
Hi Misha, On 2023-09-06 21:15, Misha Dmitriev wrote:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses.
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses. FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing HTH, StefanK
Thank you,
Misha
Thank you very much, Stefan! Really appreciate your response. Misha ________________________________ From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha, On 2023-09-06 21:15, Misha Dmitriev wrote: Hi ZGC developers, I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as: [2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms that is, we check every line for the following condition (we don't use pattern matching for speed): if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like [2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms i.e. the condition should be if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line I wonder if this is all we need or I am missing something? I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses. Also, what's the meaning of lines like these: [2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise. This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses. FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing HTH, StefanK Thank you, Misha
Hi Leslie, I think it's worth letting you know, just in case: GCViewer seems to not handle ZGC logs very well at this time. I hope you can pass this information to its maintainers. I guess it's also worth notifying the ZGC group about this issue, in case you guys want to use GCViewer at some point. I used it a lot in the past to analyze CMS and G1 logs, and I still consider it a very good tool. So it would be sad if it doesn't support more modern GCs well. Anyway, here is why I think so. I am attaching a short ZGC log, which was created by taking one of our real production logs and cutting it to a few hundred lines. When I grep it for pauses, I get the following: $ grep ") Pause" jdk17-zgc.gc.log [2023-09-05T18:54:16.683+0000][506290.976s][info ][gc,phases ] GC(76751) Pause Mark End 0.070ms [2023-09-05T18:54:16.765+0000][506291.058s][info ][gc,phases ] GC(76751) Pause Relocate Start 0.013ms [2023-09-05T18:54:20.922+0000][506295.215s][info ][gc,phases ] GC(76752) Pause Mark Start 0.017ms [2023-09-05T18:54:23.456+0000][506297.749s][info ][gc,phases ] GC(76752) Pause Mark End 0.060ms [2023-09-05T18:54:23.566+0000][506297.859s][info ][gc,phases ] GC(76752) Pause Relocate Start 0.010ms [2023-09-05T18:54:23.623+0000][506297.916s][info ][gc,phases ] GC(76753) Pause Mark Start 0.012ms [2023-09-05T18:54:24.664+0000][506298.956s][info ][gc,phases ] GC(76753) Pause Mark End 0.064ms [2023-09-05T18:54:24.721+0000][506299.013s][info ][gc,phases ] GC(76753) Pause Relocate Start 0.012ms There are clearly 8 pauses. However, when I feed it to GCViewer (I built it from the latest source code), it says that there are just 3 pauses. It also prints some NaNs, probably due to lack of precision. On bigger logs we observed similar discrepancies, though the difference between pauses obtained by grepping and reported by GCViewer was not so big. Misha ________________________________ From: Misha Dmitriev <mdmitriev@linkedin.com> Sent: Friday, September 8, 2023 11:40 AM To: Stefan Karlsson <stefan.karlsson@oracle.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Thank you very much, Stefan! Really appreciate your response. Misha ________________________________ From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha, On 2023-09-06 21:15, Misha Dmitriev wrote: Hi ZGC developers, I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as: [2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms that is, we check every line for the following condition (we don't use pattern matching for speed): if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like [2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms i.e. the condition should be if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line I wonder if this is all we need or I am missing something? I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses. Also, what's the meaning of lines like these: [2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise. This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses. FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing HTH, StefanK Thank you, Misha
Hi Misha, Thanks for your bug report!
2023年9月15日 13:07,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
I think it's worth letting you know, just in case: GCViewer seems to not handle ZGC logs very well at this time. I hope you can pass this information to its maintainers. I guess it's also worth notifying the ZGC group about this issue, in case you guys want to use GCViewer at some point. I used it a lot in the past to analyze CMS and G1 logs, and I still consider it a very good tool. So it would be sad if it doesn't support more modern GCs well.
Anyway, here is why I think so. I am attaching a short ZGC log, which was created by taking one of our real production logs and cutting it to a few hundred lines. When I grep it for pauses, I get the following:
$ grep ") Pause" jdk17-zgc.gc.log [2023-09-05T18:54:16.683+0000][506290.976s][info ][gc,phases ] GC(76751) Pause Mark End 0.070ms [2023-09-05T18:54:16.765+0000][506291.058s][info ][gc,phases ] GC(76751) Pause Relocate Start 0.013ms [2023-09-05T18:54:20.922+0000][506295.215s][info ][gc,phases ] GC(76752) Pause Mark Start 0.017ms [2023-09-05T18:54:23.456+0000][506297.749s][info ][gc,phases ] GC(76752) Pause Mark End 0.060ms [2023-09-05T18:54:23.566+0000][506297.859s][info ][gc,phases ] GC(76752) Pause Relocate Start 0.010ms [2023-09-05T18:54:23.623+0000][506297.916s][info ][gc,phases ] GC(76753) Pause Mark Start 0.012ms [2023-09-05T18:54:24.664+0000][506298.956s][info ][gc,phases ] GC(76753) Pause Mark End 0.064ms [2023-09-05T18:54:24.721+0000][506299.013s][info ][gc,phases ] GC(76753) Pause Relocate Start 0.012ms
There are clearly 8 pauses. However, when I feed it to GCViewer (I built it from the latest source code), it says that there are just 3 pauses. It also prints some NaNs, probably due to lack of precision. On bigger logs we observed similar discrepancies, though the difference between pauses obtained by grepping and reported by GCViewer was not so big.
I am fixing the bug, owing to the enhancement about ZGC and ShenandoahGC was implemented by ourselves. Thanks, Leslie Zhai
Misha
From: Misha Dmitriev <mdmitriev@linkedin.com> Sent: Friday, September 8, 2023 11:40 AM To: Stefan Karlsson <stefan.karlsson@oracle.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Thank you very much, Stefan! Really appreciate your response.
Misha
From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
On 2023-09-06 21:15, Misha Dmitriev wrote:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses.
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses.
FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing
HTH, StefanK
Thank you,
Misha
<jdk17-zgc.gc.log>
Hi Misha, I investigated why GCViewer failed to parse jdk17 single generation ZGC: The intuitive phenomenon is `parentEvent` is null[1]: AbstractGCEvent<?> parentEvent = context.getPartialEventsMap().get(event.getNumber() + "”); Then there is no opportunity to return ZGC event[2] but instead return null. But why `parentEvent` AKA `context.getPartialEventsMap().get(eventNumber)` is null? That’s due to `Map<String, AbstractGCEvent<?>> partialEventsMap` doesn’t contain (through `put`) such event[3]: context.getPartialEventsMap().put(event.getNumber() + "", event); So the root cause is there are no `gc,start` tags to `handleTagGcStartTail`[4]: case TAG_GC_START: // <— TAG_GC_START = "gc,start"; returnEvent = handleTagGcStartTail(context, event); break; I am investigating why jdk17 single generation `ZStatPhaseCycle::register_start` failed to `log_info(gc, start)`??? I also dumped the jdk17 single generation ZGC log for Renaissance[6] philosophers for x86_64, aarch64 and also loongarch64: -XX:+UseZGC -Xlog:gc*=debug:file=zgc-`date +%Y%m%d%H%M%S`.log:time,uptime,level,tags -jar renaissance-gpl-0.14.2.jar philosophers And there are `gc,start` tags: [2023-09-16T10:27:34.880+0800][0.485s][info ][gc,start ] GC(0) Garbage Collection (System.gc()) [2023-09-16T10:27:35.903+0800][1.508s][info ][gc,start ] GC(1) Garbage Collection (Warmup) [2023-09-16T10:27:36.803+0800][2.408s][info ][gc,start ] GC(2) Garbage Collection (Warmup) [2023-09-16T10:27:38.103+0800][3.708s][info ][gc,start ] GC(3) Garbage Collection (Warmup) [2023-09-16T10:27:38.282+0800][3.887s][info ][gc,start ] GC(4) Garbage Collection (System.gc()) [2023-09-16T10:27:39.103+0800][4.708s][info ][gc,start ] GC(5) Garbage Collection (Proactive) [2023-09-16T10:27:40.103+0800][5.708s][info ][gc,start ] GC(6) Garbage Collection (Proactive) [2023-09-16T10:27:40.903+0800][6.508s][info ][gc,start ] GC(7) Garbage Collection (Proactive) Why there are no `gc,start` tags in your jdk17-zgc.gc.log??? Could you run Renaissance philosophers and dump the log in your real production environment shown as above? What is your real production environment? lscpu? sysctl -a | grep machdep.cpu? Thanks, Leslie Zhai 1. https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtrau... 2. https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtrau... 3. https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtrau... 4. https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtrau... 5. https://github.com/openjdk/jdk17u-dev/blob/master/src/hotspot/share/gc/z/zSt... 6. https://github.com/renaissance-benchmarks/renaissance
2023年9月15日 13:07,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
I think it's worth letting you know, just in case: GCViewer seems to not handle ZGC logs very well at this time. I hope you can pass this information to its maintainers. I guess it's also worth notifying the ZGC group about this issue, in case you guys want to use GCViewer at some point. I used it a lot in the past to analyze CMS and G1 logs, and I still consider it a very good tool. So it would be sad if it doesn't support more modern GCs well.
Anyway, here is why I think so. I am attaching a short ZGC log, which was created by taking one of our real production logs and cutting it to a few hundred lines. When I grep it for pauses, I get the following:
$ grep ") Pause" jdk17-zgc.gc.log [2023-09-05T18:54:16.683+0000][506290.976s][info ][gc,phases ] GC(76751) Pause Mark End 0.070ms [2023-09-05T18:54:16.765+0000][506291.058s][info ][gc,phases ] GC(76751) Pause Relocate Start 0.013ms [2023-09-05T18:54:20.922+0000][506295.215s][info ][gc,phases ] GC(76752) Pause Mark Start 0.017ms [2023-09-05T18:54:23.456+0000][506297.749s][info ][gc,phases ] GC(76752) Pause Mark End 0.060ms [2023-09-05T18:54:23.566+0000][506297.859s][info ][gc,phases ] GC(76752) Pause Relocate Start 0.010ms [2023-09-05T18:54:23.623+0000][506297.916s][info ][gc,phases ] GC(76753) Pause Mark Start 0.012ms [2023-09-05T18:54:24.664+0000][506298.956s][info ][gc,phases ] GC(76753) Pause Mark End 0.064ms [2023-09-05T18:54:24.721+0000][506299.013s][info ][gc,phases ] GC(76753) Pause Relocate Start 0.012ms
There are clearly 8 pauses. However, when I feed it to GCViewer (I built it from the latest source code), it says that there are just 3 pauses. It also prints some NaNs, probably due to lack of precision. On bigger logs we observed similar discrepancies, though the difference between pauses obtained by grepping and reported by GCViewer was not so big.
Misha
From: Misha Dmitriev <mdmitriev@linkedin.com> Sent: Friday, September 8, 2023 11:40 AM To: Stefan Karlsson <stefan.karlsson@oracle.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Thank you very much, Stefan! Really appreciate your response.
Misha
From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
On 2023-09-06 21:15, Misha Dmitriev wrote:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses.
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses.
FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing
HTH, StefanK
Thank you,
Misha
<jdk17-zgc.gc.log>
Hi Leslie, I am currently on vacation, will get back to this when I return in the second half of the next week. For now, I can only say that of two ZGC logs that I currently have, from two different apps running in production, one doesn't have any lines with the 'gc,start' tags (and it's a pretty big log, 71M - whereas another one does have some such lines. Otherwise the logs look similar. When I return, I'll check the GC logging configuration etc. for both apps. Meanwhile, it would be good if you further investigate the conditions under which ZStatPhaseCycle::register_start may not print to the log (or may not be called at all?) Misha ________________________________ From: Leslie Zhai <zhaixiang@loongson.cn> Sent: Friday, September 15, 2023 7:40 PM To: Misha Dmitriev <mdmitriev@linkedin.com> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha, I investigated why GCViewer failed to parse jdk17 single generation ZGC: The intuitive phenomenon is `parentEvent` is null[1]: AbstractGCEvent<?> parentEvent = context.getPartialEventsMap().get(event.getNumber() + "”); Then there is no opportunity to return ZGC event[2] but instead return null. But why `parentEvent` AKA `context.getPartialEventsMap().get(eventNumber)` is null? That’s due to `Map<String, AbstractGCEvent<?>> partialEventsMap` doesn’t contain (through `put`) such event[3]: context.getPartialEventsMap().put(event.getNumber() + "", event); So the root cause is there are no `gc,start` tags to `handleTagGcStartTail`[4]: case TAG_GC_START: // <― TAG_GC_START = "gc,start"; returnEvent = handleTagGcStartTail(context, event); break; I am investigating why jdk17 single generation `ZStatPhaseCycle::register_start` failed to `log_info(gc, start)`??? I also dumped the jdk17 single generation ZGC log for Renaissance[6] philosophers for x86_64, aarch64 and also loongarch64: -XX:+UseZGC -Xlog:gc*=debug:file=zgc-`date +%Y%m%d%H%M%S`.log:time,uptime,level,tags -jar renaissance-gpl-0.14.2.jar philosophers And there are `gc,start` tags: [2023-09-16T10:27:34.880+0800][0.485s][info ][gc,start ] GC(0) Garbage Collection (System.gc()) [2023-09-16T10:27:35.903+0800][1.508s][info ][gc,start ] GC(1) Garbage Collection (Warmup) [2023-09-16T10:27:36.803+0800][2.408s][info ][gc,start ] GC(2) Garbage Collection (Warmup) [2023-09-16T10:27:38.103+0800][3.708s][info ][gc,start ] GC(3) Garbage Collection (Warmup) [2023-09-16T10:27:38.282+0800][3.887s][info ][gc,start ] GC(4) Garbage Collection (System.gc()) [2023-09-16T10:27:39.103+0800][4.708s][info ][gc,start ] GC(5) Garbage Collection (Proactive) [2023-09-16T10:27:40.103+0800][5.708s][info ][gc,start ] GC(6) Garbage Collection (Proactive) [2023-09-16T10:27:40.903+0800][6.508s][info ][gc,start ] GC(7) Garbage Collection (Proactive) Why there are no `gc,start` tags in your jdk17-zgc.gc.log??? Could you run Renaissance philosophers and dump the log in your real production environment shown as above? What is your real production environment? lscpu? sysctl -a | grep machdep.cpu? Thanks, Leslie Zhai 1. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fimp%2FDataReaderUnifiedJvmLogging.java%23L303&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7KtXLn6Vpau46Nf15NuSsUwvkH%2BCAorYI2S1QosBXTg%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/imp/DataReaderUnifiedJvmLogging.java#L303> 2. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fmodel%2FAbstractGCEvent.java%23L714&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IlIjpMQRSM72bBG3UQMNGaEKvVKvyUblfVREIlkVNSY%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/model/AbstractGCEvent.java#L714> 3. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fimp%2FDataReaderUnifiedJvmLogging.java%23L296&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uarOB8FIf7dWkgomQ7ICTRhL6yUbGBKK%2BwL2X6kOozI%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/imp/DataReaderUnifiedJvmLogging.java#L296> 4. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fimp%2FDataReaderUnifiedJvmLogging.java%23L264&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B%2Fd9vgM%2BcDmQG%2Baihz4CcgTWQjxrujdR6y5WflT17tM%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/imp/DataReaderUnifiedJvmLogging.java#L264> 5. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk17u-dev%2Fblob%2Fmaster%2Fsrc%2Fhotspot%2Fshare%2Fgc%2Fz%2FzStat.cpp%23L641&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BKM2Bpba3lJu1LFEw7TdTV6nYyaBQV9jf%2FuweQnVYgU%3D&reserved=0<https://github.com/openjdk/jdk17u-dev/blob/master/src/hotspot/share/gc/z/zStat.cpp#L641> 6. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frenaissance-benchmarks%2Frenaissance&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JxCJHokctj2387jELulYTqmGMP2IykBym1LAJcf2Qn8%3D&reserved=0<https://github.com/renaissance-benchmarks/renaissance>
2023年9月15日 13:07,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
I think it's worth letting you know, just in case: GCViewer seems to not handle ZGC logs very well at this time. I hope you can pass this information to its maintainers. I guess it's also worth notifying the ZGC group about this issue, in case you guys want to use GCViewer at some point. I used it a lot in the past to analyze CMS and G1 logs, and I still consider it a very good tool. So it would be sad if it doesn't support more modern GCs well.
Anyway, here is why I think so. I am attaching a short ZGC log, which was created by taking one of our real production logs and cutting it to a few hundred lines. When I grep it for pauses, I get the following:
$ grep ") Pause" jdk17-zgc.gc.log [2023-09-05T18:54:16.683+0000][506290.976s][info ][gc,phases ] GC(76751) Pause Mark End 0.070ms [2023-09-05T18:54:16.765+0000][506291.058s][info ][gc,phases ] GC(76751) Pause Relocate Start 0.013ms [2023-09-05T18:54:20.922+0000][506295.215s][info ][gc,phases ] GC(76752) Pause Mark Start 0.017ms [2023-09-05T18:54:23.456+0000][506297.749s][info ][gc,phases ] GC(76752) Pause Mark End 0.060ms [2023-09-05T18:54:23.566+0000][506297.859s][info ][gc,phases ] GC(76752) Pause Relocate Start 0.010ms [2023-09-05T18:54:23.623+0000][506297.916s][info ][gc,phases ] GC(76753) Pause Mark Start 0.012ms [2023-09-05T18:54:24.664+0000][506298.956s][info ][gc,phases ] GC(76753) Pause Mark End 0.064ms [2023-09-05T18:54:24.721+0000][506299.013s][info ][gc,phases ] GC(76753) Pause Relocate Start 0.012ms
There are clearly 8 pauses. However, when I feed it to GCViewer (I built it from the latest source code), it says that there are just 3 pauses. It also prints some NaNs, probably due to lack of precision. On bigger logs we observed similar discrepancies, though the difference between pauses obtained by grepping and reported by GCViewer was not so big.
Misha
From: Misha Dmitriev <mdmitriev@linkedin.com> Sent: Friday, September 8, 2023 11:40 AM To: Stefan Karlsson <stefan.karlsson@oracle.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Thank you very much, Stefan! Really appreciate your response.
Misha
From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
On 2023-09-06 21:15, Misha Dmitriev wrote:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses.
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses.
FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing
HTH, StefanK
Thank you,
Misha
<jdk17-zgc.gc.log>
Hi Leslie, To close the loop for now: turns out that the app from which I got the "problematic" gc.log (with no "gc,start" tags) is no longer running with ZGC. That was just an experiment, and all the logs are already gone (or at least it would be quite hard to find them and determine the JVM configuration at the time when they were generated). A few apps that currently do run with ZGC generate logs with "gc,start" present. So for now I am not sure what it was - perhaps some kind of misconfiguration somewhere. I'll keep an eye on this as more apps at LI pick up ZGC, and will let you know if anything suspicious happens. Misha ________________________________ From: zgc-dev <zgc-dev-retn@openjdk.org> on behalf of Misha Dmitriev <mdmitriev@linkedin.com> Sent: Sunday, September 17, 2023 12:46 PM To: Leslie Zhai <zhaixiang@loongson.cn> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Leslie, I am currently on vacation, will get back to this when I return in the second half of the next week. For now, I can only say that of two ZGC logs that I currently have, from two different apps running in production, one doesn't have any lines with the 'gc,start' tags (and it's a pretty big log, 71M - whereas another one does have some such lines. Otherwise the logs look similar. When I return, I'll check the GC logging configuration etc. for both apps. Meanwhile, it would be good if you further investigate the conditions under which ZStatPhaseCycle::register_start may not print to the log (or may not be called at all?) Misha ________________________________ From: Leslie Zhai <zhaixiang@loongson.cn> Sent: Friday, September 15, 2023 7:40 PM To: Misha Dmitriev <mdmitriev@linkedin.com> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha, I investigated why GCViewer failed to parse jdk17 single generation ZGC: The intuitive phenomenon is `parentEvent` is null[1]: AbstractGCEvent<?> parentEvent = context.getPartialEventsMap().get(event.getNumber() + "”); Then there is no opportunity to return ZGC event[2] but instead return null. But why `parentEvent` AKA `context.getPartialEventsMap().get(eventNumber)` is null? That’s due to `Map<String, AbstractGCEvent<?>> partialEventsMap` doesn’t contain (through `put`) such event[3]: context.getPartialEventsMap().put(event.getNumber() + "", event); So the root cause is there are no `gc,start` tags to `handleTagGcStartTail`[4]: case TAG_GC_START: // <― TAG_GC_START = "gc,start"; returnEvent = handleTagGcStartTail(context, event); break; I am investigating why jdk17 single generation `ZStatPhaseCycle::register_start` failed to `log_info(gc, start)`??? I also dumped the jdk17 single generation ZGC log for Renaissance[6] philosophers for x86_64, aarch64 and also loongarch64: -XX:+UseZGC -Xlog:gc*=debug:file=zgc-`date +%Y%m%d%H%M%S`.log:time,uptime,level,tags -jar renaissance-gpl-0.14.2.jar philosophers And there are `gc,start` tags: [2023-09-16T10:27:34.880+0800][0.485s][info ][gc,start ] GC(0) Garbage Collection (System.gc()) [2023-09-16T10:27:35.903+0800][1.508s][info ][gc,start ] GC(1) Garbage Collection (Warmup) [2023-09-16T10:27:36.803+0800][2.408s][info ][gc,start ] GC(2) Garbage Collection (Warmup) [2023-09-16T10:27:38.103+0800][3.708s][info ][gc,start ] GC(3) Garbage Collection (Warmup) [2023-09-16T10:27:38.282+0800][3.887s][info ][gc,start ] GC(4) Garbage Collection (System.gc()) [2023-09-16T10:27:39.103+0800][4.708s][info ][gc,start ] GC(5) Garbage Collection (Proactive) [2023-09-16T10:27:40.103+0800][5.708s][info ][gc,start ] GC(6) Garbage Collection (Proactive) [2023-09-16T10:27:40.903+0800][6.508s][info ][gc,start ] GC(7) Garbage Collection (Proactive) Why there are no `gc,start` tags in your jdk17-zgc.gc.log??? Could you run Renaissance philosophers and dump the log in your real production environment shown as above? What is your real production environment? lscpu? sysctl -a | grep machdep.cpu? Thanks, Leslie Zhai 1. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fimp%2FDataReaderUnifiedJvmLogging.java%23L303&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7KtXLn6Vpau46Nf15NuSsUwvkH%2BCAorYI2S1QosBXTg%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/imp/DataReaderUnifiedJvmLogging.java#L303> 2. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fmodel%2FAbstractGCEvent.java%23L714&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IlIjpMQRSM72bBG3UQMNGaEKvVKvyUblfVREIlkVNSY%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/model/AbstractGCEvent.java#L714> 3. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fimp%2FDataReaderUnifiedJvmLogging.java%23L296&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uarOB8FIf7dWkgomQ7ICTRhL6yUbGBKK%2BwL2X6kOozI%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/imp/DataReaderUnifiedJvmLogging.java#L296> 4. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchewiebug%2FGCViewer%2Fblob%2Fdevelop%2Fsrc%2Fmain%2Fjava%2Fcom%2Ftagtraum%2Fperf%2Fgcviewer%2Fimp%2FDataReaderUnifiedJvmLogging.java%23L264&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B%2Fd9vgM%2BcDmQG%2Baihz4CcgTWQjxrujdR6y5WflT17tM%3D&reserved=0<https://github.com/chewiebug/GCViewer/blob/develop/src/main/java/com/tagtraum/perf/gcviewer/imp/DataReaderUnifiedJvmLogging.java#L264> 5. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk17u-dev%2Fblob%2Fmaster%2Fsrc%2Fhotspot%2Fshare%2Fgc%2Fz%2FzStat.cpp%23L641&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BKM2Bpba3lJu1LFEw7TdTV6nYyaBQV9jf%2FuweQnVYgU%3D&reserved=0<https://github.com/openjdk/jdk17u-dev/blob/master/src/hotspot/share/gc/z/zStat.cpp#L641> 6. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frenaissance-benchmarks%2Frenaissance&data=05%7C01%7Cmdmitriev%40linkedin.com%7C4d110bc930284e890e6308dbb65e5ec5%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638304288679249773%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JxCJHokctj2387jELulYTqmGMP2IykBym1LAJcf2Qn8%3D&reserved=0<https://github.com/renaissance-benchmarks/renaissance>
2023年9月15日 13:07,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
I think it's worth letting you know, just in case: GCViewer seems to not handle ZGC logs very well at this time. I hope you can pass this information to its maintainers. I guess it's also worth notifying the ZGC group about this issue, in case you guys want to use GCViewer at some point. I used it a lot in the past to analyze CMS and G1 logs, and I still consider it a very good tool. So it would be sad if it doesn't support more modern GCs well.
Anyway, here is why I think so. I am attaching a short ZGC log, which was created by taking one of our real production logs and cutting it to a few hundred lines. When I grep it for pauses, I get the following:
$ grep ") Pause" jdk17-zgc.gc.log [2023-09-05T18:54:16.683+0000][506290.976s][info ][gc,phases ] GC(76751) Pause Mark End 0.070ms [2023-09-05T18:54:16.765+0000][506291.058s][info ][gc,phases ] GC(76751) Pause Relocate Start 0.013ms [2023-09-05T18:54:20.922+0000][506295.215s][info ][gc,phases ] GC(76752) Pause Mark Start 0.017ms [2023-09-05T18:54:23.456+0000][506297.749s][info ][gc,phases ] GC(76752) Pause Mark End 0.060ms [2023-09-05T18:54:23.566+0000][506297.859s][info ][gc,phases ] GC(76752) Pause Relocate Start 0.010ms [2023-09-05T18:54:23.623+0000][506297.916s][info ][gc,phases ] GC(76753) Pause Mark Start 0.012ms [2023-09-05T18:54:24.664+0000][506298.956s][info ][gc,phases ] GC(76753) Pause Mark End 0.064ms [2023-09-05T18:54:24.721+0000][506299.013s][info ][gc,phases ] GC(76753) Pause Relocate Start 0.012ms
There are clearly 8 pauses. However, when I feed it to GCViewer (I built it from the latest source code), it says that there are just 3 pauses. It also prints some NaNs, probably due to lack of precision. On bigger logs we observed similar discrepancies, though the difference between pauses obtained by grepping and reported by GCViewer was not so big.
Misha
From: Misha Dmitriev <mdmitriev@linkedin.com> Sent: Friday, September 8, 2023 11:40 AM To: Stefan Karlsson <stefan.karlsson@oracle.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Thank you very much, Stefan! Really appreciate your response.
Misha
From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
On 2023-09-06 21:15, Misha Dmitriev wrote:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses.
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses.
FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing
HTH, StefanK
Thank you,
Misha
<jdk17-zgc.gc.log>
Hi Misha, Thanks for your update!
2023年9月21日 06:09,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
To close the loop for now: turns out that the app from which I got the "problematic" gc.log (with no "gc,start" tags) is no longer running with ZGC. That was just an experiment, and all the logs are already gone (or at least it would be quite hard to find them and determine the JVM configuration at the time when they were generated). A few apps that currently do run with ZGC generate logs with "gc,start" present.
So for now I am not sure what it was - perhaps some kind of misconfiguration somewhere. I'll keep an eye on this as more apps at LI pick up ZGC, and will let you know if anything suspicious happens.
It might be xlog Logging infrastructure related issue. And I investigated `ConcurrentGCThread::run`[1] always trigger the `ZDriverRequest` then `ZDriverGCScope` constructed `ZPhaseCycle` AKA `ZStatPhaseCycle` which generated logs with the “gc,start” tags except that there is no gc at all when `request.cause() == GCCause::_no_gc`[2]. 1. https://github.com/openjdk/jdk17u-dev/blob/master/src/hotspot/share/gc/share... 2. https://github.com/openjdk/jdk17u-dev/blob/master/src/hotspot/share/gc/z/zDr... Thanks, Leslie Zhai
Misha
From: zgc-dev <zgc-dev-retn@openjdk.org> on behalf of Misha Dmitriev <mdmitriev@linkedin.com> Sent: Sunday, September 17, 2023 12:46 PM To: Leslie Zhai <zhaixiang@loongson.cn> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Leslie,
I am currently on vacation, will get back to this when I return in the second half of the next week.
For now, I can only say that of two ZGC logs that I currently have, from two different apps running in production, one doesn't have any lines with the 'gc,start' tags (and it's a pretty big log, 71M - whereas another one does have some such lines. Otherwise the logs look similar.
When I return, I'll check the GC logging configuration etc. for both apps. Meanwhile, it would be good if you further investigate the conditions under which ZStatPhaseCycle::register_start may not print to the log (or may not be called at all?)
MishaFrom: Leslie Zhai <zhaixiang@loongson.cn> Sent: Friday, September 15, 2023 7:40 PM To: Misha Dmitriev <mdmitriev@linkedin.com> Cc: zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
I investigated why GCViewer failed to parse jdk17 single generation ZGC:
The intuitive phenomenon is `parentEvent` is null[1]:
AbstractGCEvent<?> parentEvent = context.getPartialEventsMap().get(event.getNumber() + "”);
Then there is no opportunity to return ZGC event[2] but instead return null.
But why `parentEvent` AKA `context.getPartialEventsMap().get(eventNumber)` is null?
That’s due to `Map<String, AbstractGCEvent<?>> partialEventsMap` doesn’t contain (through `put`) such event[3]:
context.getPartialEventsMap().put(event.getNumber() + "", event);
So the root cause is there are no `gc,start` tags to `handleTagGcStartTail`[4]:
case TAG_GC_START: // <― TAG_GC_START = "gc,start"; returnEvent = handleTagGcStartTail(context, event); break;
I am investigating why jdk17 single generation `ZStatPhaseCycle::register_start` failed to `log_info(gc, start)`???
I also dumped the jdk17 single generation ZGC log for Renaissance[6] philosophers for x86_64, aarch64 and also loongarch64:
-XX:+UseZGC -Xlog:gc*=debug:file=zgc-`date +%Y%m%d%H%M%S`.log:time,uptime,level,tags -jar renaissance-gpl-0.14.2.jar philosophers
And there are `gc,start` tags:
[2023-09-16T10:27:34.880+0800][0.485s][info ][gc,start ] GC(0) Garbage Collection (System.gc()) [2023-09-16T10:27:35.903+0800][1.508s][info ][gc,start ] GC(1) Garbage Collection (Warmup) [2023-09-16T10:27:36.803+0800][2.408s][info ][gc,start ] GC(2) Garbage Collection (Warmup) [2023-09-16T10:27:38.103+0800][3.708s][info ][gc,start ] GC(3) Garbage Collection (Warmup) [2023-09-16T10:27:38.282+0800][3.887s][info ][gc,start ] GC(4) Garbage Collection (System.gc()) [2023-09-16T10:27:39.103+0800][4.708s][info ][gc,start ] GC(5) Garbage Collection (Proactive) [2023-09-16T10:27:40.103+0800][5.708s][info ][gc,start ] GC(6) Garbage Collection (Proactive) [2023-09-16T10:27:40.903+0800][6.508s][info ][gc,start ] GC(7) Garbage Collection (Proactive)
Why there are no `gc,start` tags in your jdk17-zgc.gc.log???
Could you run Renaissance philosophers and dump the log in your real production environment shown as above?
What is your real production environment? lscpu? sysctl -a | grep machdep.cpu?
Thanks, Leslie Zhai
1. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
2. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
3. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
4. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
5. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
6. https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
2023年9月15日 13:07,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Hi Leslie,
I think it's worth letting you know, just in case: GCViewer seems to not handle ZGC logs very well at this time. I hope you can pass this information to its maintainers. I guess it's also worth notifying the ZGC group about this issue, in case you guys want to use GCViewer at some point. I used it a lot in the past to analyze CMS and G1 logs, and I still consider it a very good tool. So it would be sad if it doesn't support more modern GCs well.
Anyway, here is why I think so. I am attaching a short ZGC log, which was created by taking one of our real production logs and cutting it to a few hundred lines. When I grep it for pauses, I get the following:
$ grep ") Pause" jdk17-zgc.gc.log [2023-09-05T18:54:16.683+0000][506290.976s][info ][gc,phases ] GC(76751) Pause Mark End 0.070ms [2023-09-05T18:54:16.765+0000][506291.058s][info ][gc,phases ] GC(76751) Pause Relocate Start 0.013ms [2023-09-05T18:54:20.922+0000][506295.215s][info ][gc,phases ] GC(76752) Pause Mark Start 0.017ms [2023-09-05T18:54:23.456+0000][506297.749s][info ][gc,phases ] GC(76752) Pause Mark End 0.060ms [2023-09-05T18:54:23.566+0000][506297.859s][info ][gc,phases ] GC(76752) Pause Relocate Start 0.010ms [2023-09-05T18:54:23.623+0000][506297.916s][info ][gc,phases ] GC(76753) Pause Mark Start 0.012ms [2023-09-05T18:54:24.664+0000][506298.956s][info ][gc,phases ] GC(76753) Pause Mark End 0.064ms [2023-09-05T18:54:24.721+0000][506299.013s][info ][gc,phases ] GC(76753) Pause Relocate Start 0.012ms
There are clearly 8 pauses. However, when I feed it to GCViewer (I built it from the latest source code), it says that there are just 3 pauses. It also prints some NaNs, probably due to lack of precision. On bigger logs we observed similar discrepancies, though the difference between pauses obtained by grepping and reported by GCViewer was not so big.
Misha
From: Misha Dmitriev <mdmitriev@linkedin.com> Sent: Friday, September 8, 2023 11:40 AM To: Stefan Karlsson <stefan.karlsson@oracle.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Thank you very much, Stefan! Really appreciate your response.
Misha
From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
On 2023-09-06 21:15, Misha Dmitriev wrote:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses.
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses.
FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing
HTH, StefanK
Thank you,
Misha
<jdk17-zgc.gc.log>
Nice catch! WIP: https://github.com/xiangzhai/GCViewer/commits/fix-parent-event-is-null Thanks, Leslie Zhai
2023年9月9日 02:40,Misha Dmitriev <mdmitriev@linkedin.com> 写道:
Thank you very much, Stefan! Really appreciate your response.
Misha
From: Stefan Karlsson <stefan.karlsson@oracle.com> Sent: Thursday, September 7, 2023 11:08 PM To: Misha Dmitriev <mdmitriev@linkedin.com>; zgc-dev@openjdk.org <zgc-dev@openjdk.org> Subject: Re: What gc.log lines reliably indicate STW pauses for ZGC? Hi Misha,
On 2023-09-06 21:15, Misha Dmitriev wrote:
Hi ZGC developers,
I work at LinkedIn, where we have internal tooling for end-to-end GC performance monitoring and analysis. To record GC pauses and other interesting events, we parse GC logs as they are generated, in near real time. Since JDK 11, the unified logging format allowed us to extract STW pauses for all GCs in the same straightforward way: we look for lines such as:
[2020-03-09T19:12:22.542+0000][info][gc ] GC(62189) Pause Young (Allocation Failure) 14263M->9334M(14745M) 41.359ms
that is, we check every line for the following condition (we don't use pattern matching for speed):
if (line.endsWith("ms") && line.contains("[gc ") && line.contains("Pause ")) { // Extract pause time in ms from the end of the line
However, looks like ZGC, though it uses the unified logging format, has a different format for STW pause lines. From the logs that we have, I am not 100% sure I understand everything. So far looks like the candidates lines look like
[2023-09-05T18:47:27.323+0000][505881.615s][info ][gc,phases ] GC(76657) Pause Mark Start 0.024ms
i.e. the condition should be
if (line.endsWith("ms") && line.contains("[gc,phases ") && line.contains("Pause ")) { // Extract pause time in ms from the end of line
I wonder if this is all we need or I am missing something?
I think you also need to search for 'info' to make sure you don't get the more detailed breakdown of the pauses.
Also, what's the meaning of lines like these:
[2023-09-05T18:47:44.003+0000][505898.295s][debug ][gc,phases ] GC(76661) Pause Mark Try Complete (ZWorker#2) 0.002ms
Formally they look similar to the previous line, but "debug" and "ZWorker#2" suggests that it may not be a STW pause but pausing just one thread? Please advise.
This line prints what one of the GC worker threads was doing during the Mark End pause. The "(ZWorker#2)" indicates that this a GC worker thread. You should filter out these lines if you try to collect the GC pauses.
FWIW, in the version you are running this line is printed on the 'debug' level. However, in the latest source code this has changed and worker logging has been moved to the 'trace' level. There we have this breakdown: 1) 'info' logs a full pause or concurrent phase 2) 'debug' logs a subphase of (1) 3) 'trace' logs what individual worker threads are doing
HTH, StefanK
Thank you,
Misha
participants (3)
-
Leslie Zhai
-
Misha Dmitriev
-
Stefan Karlsson