From nadav.wiener at gmail.com  Tue Nov  4 11:41:03 2014
From: nadav.wiener at gmail.com (Nadav Wiener)
Date: Tue, 04 Nov 2014 11:41:03 +0000
Subject: Flooded ConcurrentLinkedQueue leads to long minor GCs
Message-ID: <CAAMN-q8hEQ2EKtEDtuT9U_RF1OfpJv+nqB3Kibrk=1vCsNGPaA@mail.gmail.com>

A queue with multiple producer threads and a single consumer thread. Producers
add entries (simple value objects) to the queue at potentially high rates.
A single consumer thread is triggered periodically to drain the queue.

For most workloads, this setup has worked fine until now.

Recently, a use case arose where producer output "got dialled all the way
up to 11" for a period of several minutes, and the consumer was no longer
able to keep up.

At this stage, the queue starts backing up, with entry retention stretching
into 1-2 minutes. By itself this isn't a severe problem -- we use the queue
for deferred tasks. However, queue entry retention seems to severely impact
minor GC (over 1 second), and therefore overall system latency.

I am asking the mailing list's help in understanding the exact mechanics
behind the long minor collection pauses.

1) Our current understanding is that minor GC pauses have to deal with lots
of surviving objects across epochs, leading to lots of copying between
survivor spaces in our relatively big young space (1GB split even between
eden and survivors).

2) Another explanation I was pursuing earlier was that queue entries were
being held for so long they became tenured. At this point, the old
generation collector (we use CMS) would start interacting with minor GCs,
increasing dirty card handling time. I'll add that overall impact on old GC
timings is negligible.

#2 might be an additional contributing factor, but the gradual buildup of
minor GCs seems to indicate #1 is more significant.

This is what we are doing for the specific queue in question:
1) Enforcing a (large) strict cap for the queue. This should be large
enough to handle floods. in the event the queue does fill up we opt to
discard all queue contents instead of rejecting new entries sporadically.
2) Migrating to an LMAX Disruptor based implementation -- microbenchmarks
indicate this produces almost no garbage at all (a few killobytes every
collection).

We'd really appreciate a corroboration of this, or alternative explanations
and suggestions. It's important for us to have a clear understanding of
what's going on and peer review of proposed solution, as it would influence
further changes in our system.

Suggestions for GC friendly solutions for unbounded queues (which grow
dynamically) would also be welcome.

Thanks,
Nadav Wiener.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20141104/722b2444/attachment.html>