Hello everyone. I did my best to search the web for any prior discussion on this. Haven't found anything useful. I am trying to understand why there is a noticeable difference between the size of all objects on the heap and the heap used (after full GC). The heap used size can be 10%+ more than the size of of all live objects. Let's say I want to fill 1 GiB heap with 4MiB byte[] objects. Naively I'd imagine I can store 1 GiB / 4MiB = 256 such byte[] on the heap. (I made a simple program that just allocates byte[], stores it in a list, does GC and waits (so I can do jcmd or similar), nothing else) With EpsilonGC -> 255x 4MiB byte[] allocations, after this the app crashes with out of memory With SerialGC -> 246x With ParallelGC -> 233x With G1GC -> 204x With ZGC -> 170x For example in the ZGC case, where I have 170x of 4MiB byte[] on the heap. GC.heap_info: ZHeap used 1022M, capacity 1024M, max capacity 1024M Metaspace used 407K, committed 576K, reserved 1114112K class space used 24K, committed 128K, reserved 1048576K GC.class_histogram: Total 15118 713971800 (~714M) In this case does it mean ZGC is wasting 1022M - 714M = 308M for doing its "thing"? This is like 1022/714= 43% overhead? This example might be convoluted and atypical of any production environment. I am seeing the difference between live set and heap used in production at around 12.5% for 3 servers looked at. Is there any other way to estimate the overhead apart from looking at the difference between the live set and heap used? Does ZGC have any internal statistic of the overhead? I'd prefer not to assume 12.5% is the number to use and then get surprised that in some case it might be 25%? Do you have any recommendations regarding ZGC overhead when estimating heap space? Thank you Alen
Hi Alen, On 2023-04-19 13:38, Alen Vrečko wrote:
Hello everyone.
I did my best to search the web for any prior discussion on this. Haven't found anything useful.
I am trying to understand why there is a noticeable difference between the size of all objects on the heap and the heap used (after full GC). The heap used size can be 10%+ more than the size of of all live objects.
The difference can come from two sources: 1) There is unusable memory in the regions, caused by address and size alignment requirements. (This is probably what you see in your 4MB array test) 2) We only calculate the `live` value when we mark through objects on the heap. While ZGC runs a cycle it more or less ignores new objects that the Java threads allocate during the GC cycle. This means that the `used` value will increase, but the `live` value will not be updated until the Mark End of the next GC cycle.. The latter also makes it bit misleading to look at the used value after the GC cycle. With stop-the-world GCs, that used value is an OK approximation of what is live on the heap. With a concurrent GC it includes all the memory allocated during GC cycle. The used numbers is still true, but some (many) of those allocated objects where short-lived and died, but we won't be able to figure that out until the next GC cycle. I don't know if you have seen this, but we have a table where we try to give you a picture of how the values progressed during the GC cycle: [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Mark Start Mark End Relocate Start Relocate End High Low [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Capacity: 3282M (10%) 3536M (11%) 3580M (11%) 3612M (11%) 3612M (11%) 3282M (10%) [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Free: 28780M (90%) 28538M (89%) 28720M (90%) 29066M (91%) 29068M (91%) 28434M (89%) [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Used: 3234M (10%) 3476M (11%) 3294M (10%) 2948M (9%) 3580M (11%) 2946M (9%) [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Live: - 2496M (8%) 2496M (8%) 2496M (8%) - - [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Allocated: - 242M (1%) 364M (1%) 411M (1%) - - [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Garbage: - 737M (2%) 433M (1%) 39M (0%) - - [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Reclaimed: - - 304M (1%) 697M (2%) - - The `garbage` at Mark End is a diff between what was `used` when the GC cycle started and what we later found to be `live` in that used memory.
Let's say I want to fill 1 GiB heap with 4MiB byte[] objects.
Naively I'd imagine I can store 1 GiB / 4MiB = 256 such byte[] on the heap.
(I made a simple program that just allocates byte[], stores it in a list, does GC and waits (so I can do jcmd or similar), nothing else)
With EpsilonGC -> 255x 4MiB byte[] allocations, after this the app crashes with out of memory With SerialGC -> 246x With ParallelGC -> 233x With G1GC -> 204x With ZGC -> 170x
For example in the ZGC case, where I have 170x of 4MiB byte[] on the heap.
GC.heap_info:
ZHeap used 1022M, capacity 1024M, max capacity 1024M Metaspace used 407K, committed 576K, reserved 1114112K class space used 24K, committed 128K, reserved 1048576K
GC.class_histogram:
Total 15118 713971800 (~714M)
In this case does it mean ZGC is wasting 1022M - 714M = 308M for doing its "thing"? This is like 1022/714= 43% overhead?
My guess is that the object header (typically 16 bytes) pushed the objects size slightly beyond 4MB. ZGC allocates large objects in their own region. Those regions are 2MB aligned, which makes your ~4MB objects `use` 6MB. You would probably see similar results with G1 when the heap region size is increased, which happens when the heap max size is larger. You can test that by explicitly running with -XX:G1HeapRegionSize=2MB, to use a larger heap region size.
This example might be convoluted and atypical of any production environment.
I am seeing the difference between live set and heap used in production at around 12.5% for 3 servers looked at.
Is there any other way to estimate the overhead apart from looking at the difference between the live set and heap used? Does ZGC have any internal statistic of the overhead?
I don't think we have a way to differentiate between the overhead caused by (1) and (2) above.
I'd prefer not to assume 12.5% is the number to use and then get surprised that in some case it might be 25%?
The overhead of yet-to-be collected garbage can easily be above 25%. It all depend on the workload. We strive to keep the fragmentation below the -XX:ZFragmentationLimit, which is set tot 25% by default, but that doesn't include the overhead of newly allocated object (and it doesn't include the large objects).
Do you have any recommendations regarding ZGC overhead when estimating heap space?
Unfortunately, I can't. It depends on how intricate the object graph is (meaning how long it will take to mark through it), how many live objects you have, the allocation rate, number of cores, etc. There's a constant race between the GC and the allocating Java threads. If the Java threads "win", and use up all memory before the GC can mark through the object graph and then give back memory to the Java application, then the Java threads will stall waiting for more memory. You need to test with your workload and see if you've given enough heap memory to allow ZGC to complete its cycles without causing allocation stalls. Thanks, StefanK
Thank you Alen
Hi Stefan. Thank you for the explanations. Makes sense. Alen V V čet., 20. apr. 2023 ob 17:15 je oseba Stefan Karlsson < stefan.karlsson@oracle.com> napisala:
Hi Alen,
On 2023-04-19 13:38, Alen Vrečko wrote:
Hello everyone.
I did my best to search the web for any prior discussion on this. Haven't found anything useful.
I am trying to understand why there is a noticeable difference between the size of all objects on the heap and the heap used (after full GC). The heap used size can be 10%+ more than the size of of all live objects.
The difference can come from two sources:
1) There is unusable memory in the regions, caused by address and size alignment requirements. (This is probably what you see in your 4MB array test)
2) We only calculate the `live` value when we mark through objects on the heap. While ZGC runs a cycle it more or less ignores new objects that the Java threads allocate during the GC cycle. This means that the `used` value will increase, but the `live` value will not be updated until the Mark End of the next GC cycle..
The latter also makes it bit misleading to look at the used value after the GC cycle. With stop-the-world GCs, that used value is an OK approximation of what is live on the heap. With a concurrent GC it includes all the memory allocated during GC cycle. The used numbers is still true, but some (many) of those allocated objects where short-lived and died, but we won't be able to figure that out until the next GC cycle.
I don't know if you have seen this, but we have a table where we try to give you a picture of how the values progressed during the GC cycle:
[6.150s][1681984168056ms][info ][gc,heap ] GC(0) Mark Start Mark End Relocate Start Relocate End High Low [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Capacity: 3282M (10%) 3536M (11%) 3580M (11%) 3612M (11%) 3612M (11%) 3282M (10%) [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Free: 28780M (90%) 28538M (89%) 28720M (90%) 29066M (91%) 29068M (91%) 28434M (89%) [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Used: 3234M (10%) 3476M (11%) 3294M (10%) 2948M (9%) 3580M (11%) 2946M (9%) [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Live: - 2496M (8%) 2496M (8%) 2496M (8%) - - [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Allocated: - 242M (1%) 364M (1%) 411M (1%) - - [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Garbage: - 737M (2%) 433M (1%) 39M (0%) - - [6.150s][1681984168056ms][info ][gc,heap ] GC(0) Reclaimed: - - 304M (1%) 697M (2%) - -
The `garbage` at Mark End is a diff between what was `used` when the GC cycle started and what we later found to be `live` in that used memory.
Let's say I want to fill 1 GiB heap with 4MiB byte[] objects.
Naively I'd imagine I can store 1 GiB / 4MiB = 256 such byte[] on the heap.
(I made a simple program that just allocates byte[], stores it in a list, does GC and waits (so I can do jcmd or similar), nothing else)
With EpsilonGC -> 255x 4MiB byte[] allocations, after this the app crashes with out of memory With SerialGC -> 246x With ParallelGC -> 233x With G1GC -> 204x With ZGC -> 170x
For example in the ZGC case, where I have 170x of 4MiB byte[] on the
heap.
GC.heap_info:
ZHeap used 1022M, capacity 1024M, max capacity 1024M Metaspace used 407K, committed 576K, reserved 1114112K class space used 24K, committed 128K, reserved 1048576K
GC.class_histogram:
Total 15118 713971800 (~714M)
In this case does it mean ZGC is wasting 1022M - 714M = 308M for doing its "thing"? This is like 1022/714= 43% overhead?
My guess is that the object header (typically 16 bytes) pushed the objects size slightly beyond 4MB. ZGC allocates large objects in their own region. Those regions are 2MB aligned, which makes your ~4MB objects `use` 6MB.
You would probably see similar results with G1 when the heap region size is increased, which happens when the heap max size is larger. You can test that by explicitly running with -XX:G1HeapRegionSize=2MB, to use a larger heap region size.
This example might be convoluted and atypical of any production environment.
I am seeing the difference between live set and heap used in production at around 12.5% for 3 servers looked at.
Is there any other way to estimate the overhead apart from looking at the difference between the live set and heap used? Does ZGC have any internal statistic of the overhead?
I don't think we have a way to differentiate between the overhead caused by (1) and (2) above.
I'd prefer not to assume 12.5% is the number to use and then get surprised that in some case it might be 25%?
The overhead of yet-to-be collected garbage can easily be above 25%. It all depend on the workload. We strive to keep the fragmentation below the -XX:ZFragmentationLimit, which is set tot 25% by default, but that doesn't include the overhead of newly allocated object (and it doesn't include the large objects).
Do you have any recommendations regarding ZGC overhead when estimating heap space?
Unfortunately, I can't. It depends on how intricate the object graph is (meaning how long it will take to mark through it), how many live objects you have, the allocation rate, number of cores, etc. There's a constant race between the GC and the allocating Java threads. If the Java threads "win", and use up all memory before the GC can mark through the object graph and then give back memory to the Java application, then the Java threads will stall waiting for more memory. You need to test with your workload and see if you've given enough heap memory to allow ZGC to complete its cycles without causing allocation stalls.
Thanks, StefanK
Thank you Alen
participants (2)
-
Alen Vrečko
-
Stefan Karlsson