From nadav.wiener at gmail.com Tue Nov 4 11:41:03 2014 From: nadav.wiener at gmail.com (Nadav Wiener) Date: Tue, 04 Nov 2014 11:41:03 +0000 Subject: Flooded ConcurrentLinkedQueue leads to long minor GCs Message-ID: A queue with multiple producer threads and a single consumer thread. Producers add entries (simple value objects) to the queue at potentially high rates. A single consumer thread is triggered periodically to drain the queue. For most workloads, this setup has worked fine until now. Recently, a use case arose where producer output "got dialled all the way up to 11" for a period of several minutes, and the consumer was no longer able to keep up. At this stage, the queue starts backing up, with entry retention stretching into 1-2 minutes. By itself this isn't a severe problem -- we use the queue for deferred tasks. However, queue entry retention seems to severely impact minor GC (over 1 second), and therefore overall system latency. I am asking the mailing list's help in understanding the exact mechanics behind the long minor collection pauses. 1) Our current understanding is that minor GC pauses have to deal with lots of surviving objects across epochs, leading to lots of copying between survivor spaces in our relatively big young space (1GB split even between eden and survivors). 2) Another explanation I was pursuing earlier was that queue entries were being held for so long they became tenured. At this point, the old generation collector (we use CMS) would start interacting with minor GCs, increasing dirty card handling time. I'll add that overall impact on old GC timings is negligible. #2 might be an additional contributing factor, but the gradual buildup of minor GCs seems to indicate #1 is more significant. This is what we are doing for the specific queue in question: 1) Enforcing a (large) strict cap for the queue. This should be large enough to handle floods. in the event the queue does fill up we opt to discard all queue contents instead of rejecting new entries sporadically. 2) Migrating to an LMAX Disruptor based implementation -- microbenchmarks indicate this produces almost no garbage at all (a few killobytes every collection). We'd really appreciate a corroboration of this, or alternative explanations and suggestions. It's important for us to have a clear understanding of what's going on and peer review of proposed solution, as it would influence further changes in our system. Suggestions for GC friendly solutions for unbounded queues (which grow dynamically) would also be welcome. Thanks, Nadav Wiener. -------------- next part -------------- An HTML attachment was scrubbed... URL: