Why does load average on host increases as Allocation Stall happens?
Hi, I have notices Load average on the host increases 5 - 10 times when Allocation Stall happens, trying to understand what causes load average to increase when this happens. Looking at the code zPageAllocator.cpp do { // Start asynchronous GC ZCollectedHeap::heap()->collect(GCCause::_z_allocation_stall); // Wait for allocation to complete or fail page = request.wait(); } while (page == gc_marker); Seems request.wait() is internally doing a get call on ZFuture. 1. Will this use this thread to spin on CPU or it is async (mean this thread will go to sleep and can be woken up when it is ready and other process can occupy this CPU)? 2. Since load average increase matches exactly with allocation stall, is there any other operation (like Flushing page) can cause this behavior? Since i haven't enabled "gc,stats" tag in my logging i missed some information there. Will try to get that information when i can reproduce it. TIA, Sundar
* Sundara Mohan M.:
I have notices Load average on the host increases 5 - 10 times when Allocation Stall happens, trying to understand what causes load average to increase when this happens.
2. Since load average increase matches exactly with allocation stall, is there any other operation (like Flushing page) can cause this behavior?
I don't know ZGC internals, but I think it stalls the application when the GC cannot keep up. This is a last resort. Before that happens, more GC threads will try hard to reclaim memory. That work increases system load. An alternative explanation could be that something else consumes CPU resources, taking it away from the GC threads, so that they cannot keep up, and ZGC has to introduce allocation stalls.
On 11/14/19 7:58 PM, Sundara Mohan M wrote:
Hi, I have notices Load average on the host increases 5 - 10 times when Allocation Stall happens, trying to understand what causes load average to increase when this happens.
It's impossible to say with certainty without inspecting what's actually going on in the system. Florian's explanations are good. It could just be that your application workload is peaking, which in turn causes the allocation stalls.
Looking at the code zPageAllocator.cpp do { // Start asynchronous GC ZCollectedHeap::heap()->collect(GCCause::_z_allocation_stall);
// Wait for allocation to complete or fail page = request.wait(); } while (page == gc_marker);
Seems request.wait() is internally doing a get call on ZFuture.
1. Will this use this thread to spin on CPU or it is async (mean this thread will go to sleep and can be woken up when it is ready and other process can occupy this CPU)?
It's async. /Per
2. Since load average increase matches exactly with allocation stall, is there any other operation (like Flushing page) can cause this behavior?
Since i haven't enabled "gc,stats" tag in my logging i missed some information there. Will try to get that information when i can reproduce it.
TIA, Sundar
Thank you for the clarification. I will try to get more gc log and system information during that time to get more detail. Thanks Sundar On Mon, Nov 18, 2019 at 12:59 AM Per Liden <per.liden@oracle.com> wrote:
On 11/14/19 7:58 PM, Sundara Mohan M wrote:
Hi, I have notices Load average on the host increases 5 - 10 times when Allocation Stall happens, trying to understand what causes load average to increase when this happens.
It's impossible to say with certainty without inspecting what's actually going on in the system. Florian's explanations are good. It could just be that your application workload is peaking, which in turn causes the allocation stalls.
Looking at the code zPageAllocator.cpp do { // Start asynchronous GC ZCollectedHeap::heap()->collect(GCCause::_z_allocation_stall);
// Wait for allocation to complete or fail page = request.wait(); } while (page == gc_marker);
Seems request.wait() is internally doing a get call on ZFuture.
1. Will this use this thread to spin on CPU or it is async (mean this thread will go to sleep and can be woken up when it is ready and other process can occupy this CPU)?
It's async.
/Per
2. Since load average increase matches exactly with allocation stall, is there any other operation (like Flushing page) can cause this behavior?
Since i haven't enabled "gc,stats" tag in my logging i missed some information there. Will try to get that information when i can reproduce it.
TIA, Sundar
participants (3)
-
Florian Weimer
-
Per Liden
-
Sundara Mohan M