ZGC logs

Wed Apr 22 10:46:28 UTC 2020

Hi Kirk,

On 4/21/20 11:23 PM, Kirk Pepperdine wrote:
> Hi,
> 
> I’ve been looking at GC log data trying to reconcile some of the numbers.
> 
> [4.719s][info ][gc,heap                     ] GC(3)                       Mark Start          Mark End               Relocate Start         Relocate End           High               Low
> [4.719s][info ][gc,heap                     ] GC(3)  Capacity:     1032M (25%)        1176M (29%)         1176M (29%)        1176M (29%)        1176M (29%)        1032M (25%)
> [4.719s][info ][gc,heap                     ] GC(3)   Reserve:         42M (1%)              42M (1%)              42M (1%)              42M (1%)           42M (1%)           42M (1%)
> [4.719s][info ][gc,heap                     ] GC(3)      Free:        3064M (75%)        2954M (72%)        3818M (93%)        3704M (90%)        3842M (94%)        2920M (71%)
> [4.719s][info ][gc,heap                     ] GC(3)      Used:         990M (24%)        1100M (27%)          236M (6%)            350M (9%)         1134M (28%)         212M (5%)
> [4.719s][info ][gc,heap                     ] GC(3)      Live:               -                          10M (0%)              10M (0%)              10M (0%)             -                  -
> [4.719s][info ][gc,heap                     ] GC(3) Allocated:           -                        208M (5%)             240M (6%)            764M (19%)            -                  -
> [4.719s][info ][gc,heap                     ] GC(3)   Garbage:          -                        979M (24%)             83M (2%)                5M (0%)             -                  -
> [4.719s][info ][gc,heap                     ] GC(3) Reclaimed:         -                              -                       896M (22%)           974M (24%)            -                  -
> 
> 
> If I understand what this log is telling me, there is initially 990M of data of which 10M is marked Live leaving 979M of garbage (-1 rounding error). During the concurrent mark an additional 208M of data was allocated requiring an additional 144M from committed memory. By Mark End, used increase by 110M, not 208M. That leaves 98M unaccounted for. I have some ideas but I’d feel more comfortable if someone could offer a comment on how to reconcile the live, allocated, garbage and reclaimed numbers with used. For example, I would assume that of the 350M remaining at Relocate End is a mix of live and floating garbage. That said, I cannot seem to reconcile the Used, allocated, and  reclaimed in a way that yields 350M. The calculation that seems to work is used at Mark End + allocated - recovered = used @ Relocate Start. That is 1100 + (240-208)-896 = 336. However applying that logic shifted over fails. 236 + (764-240) - (974-896) = 681 != 350.
> 
> So I’m seem to be missing bits that my starting at zStat hasn’t cleared up.
> 
> All comments appreciated.

What's confusing, as it doesn't show in the log, is that the "allocated" 
number can be inflated because some page allocations where "undone". 
This can happen in situations where, for example, two Java threads 
competed to allocate a new shared medium page, both threads allocate one 
medium page each, but only one thread will win the race to install that 
page. The thread who lost the race will then immediately free its newly 
allocated page (undo the allocation). However, we only increase the 
"allocated" number when pages are allocated, but we don't decrease it 
when we undo an allocation.

You can confirm that it's cause by "undo" by looking at the statistics 
table and look for the "Memory: Undo Page Allocation" line, which should 
show non-zero numbers. The need to undo allocations is usually a rare 
event, at least on most workloads.

One could argue that the "allocated" number is correct (in the sense 
that these pages where in fact allocated), but I think it would be more 
helpful to log the "allocated - undone" number, as the undo part is more 
of an artifact of how things work internally, and not what a user 
expects to see.

Here's a patch to adjust "allocated" so that it takes undone allocations 
into account. Feel free to try it out and see if the numbers make more 
sense.

diff --git a/src/hotspot/share/gc/z/zPageAllocator.cpp 
b/src/hotspot/share/gc/z/zPageAllocator.cpp
--- a/src/hotspot/share/gc/z/zPageAllocator.cpp
+++ b/src/hotspot/share/gc/z/zPageAllocator.cpp
@@ -287,12 +287,14 @@
  }

  void ZPageAllocator::decrease_used(size_t size, bool reclaimed) {
+  // Only pages explicitly released with the reclaimed flag set
+  // counts as reclaimed bytes. This flag is true when a worker
+  // releases a page after relocation, and is false when we
+  // release a page to undo an allocation.
    if (reclaimed) {
-    // Only pages explicitly released with the reclaimed flag set
-    // counts as reclaimed bytes. This flag is typically true when
-    // a worker releases a page after relocation, and is typically
-    // false when we release a page to undo an allocation.
      _reclaimed += size;
+  } else {
+    _allocated -= size;
    }
    _used -= size;
    if (_used < _used_low) {


cheers,
Per

> 
> Kind regards,
> Kirk
>