RFR: 8133051: Concurrent refinement threads may be activated and deactivated at random

Wed Apr 6 02:15:45 UTC 2016

> On Apr 5, 2016, at 9:02 AM, Kim Barrett <kim.barrett at oracle.com> wrote:
> I've been so focused on the many-refinement-thread problems that I
> forgot to consider the small-number-of-threads case here.  Please hold
> off on reviewing...

[Taking off hold; I've figured out how to deal with the small number
of threads case, and it turns out it helps the large number of threads
case too.  I've updated the description of the changes from the
initial RFR to account for the change to the activation of the primary
concurrent refinement thread.]

Please review this change to the G1 concurrent refinement thread
controller.  This change addresses unnecessary activation when there
are many threads and few buffers to be processed.  It also addresses
delayed activation due to mis-configuration of the dirty card queue
set's notification mechanism.

This change continues to use (more or less) the existing control
model, only avoiding obviously wasted effort or undesirable delays.
Further enhancements to the control model will be made under
JDK-8137022 or subtasks from that.

- Changed the G1 concurrent refinement thread activation controller to
use a minimum buffer count step between (de)activation values for the
threads.  This is accomplished by having a minimum yellow zone size,
based on the number of refinement threads.  This avoids waking up more
refinement threads than there are buffers available to process.  (It
is, of course, still possible for a refinement thread to wake up and
discover it has no work to do, because of progress by other threads.
But at least we're no longer waking up threads with a near guarantee
they won't find work to do.)

- As part of the above, changed G1ConcRefinementThresholdStep to have
a minimum value of one, a default value of 2, and to be used to
determine a lower bound on the thread activation step size.  A larger
step size makes it less likely a thread will be woken up and discover
other threads have already completed the work "allocated" to it.  Too
large a minimum may overly restrict the number of refinement threads
being activated, leading to missed pause targets.

- Changed the threshold for activation of the primary concurrent
refinement thread via notification from the dirty card queue set upon
enqueue of a new buffer.  It was previously using a notification
threshold of green_zone * (1 + predictor_sigma).  This could lead to a
significantly larger activation threshold, particularly as the
green_zone value grows, which could lead to a much larger number of
pending buffers for pause-time update_rs to process, leading to missed
update_rs time targets and unnecessary back pressure on the green_zone
size.

Instead we now start with the normal activation threshold for the
primary thread, calculated using the green_zone value and threshold
steps.  We limit that using ParallelGCThreads (the number of threads
used by the update_rs phase), possibly running the thread more
aggressively than the normal threshold would suggest, again to limit
the excess over the green_zone value that might be found by update_rs.

Using default configuration parameters, comparing runs of specjbb2015
on Linux-x64 with 24 logical processors (so 18 refinement threads with
the default configuration), with these changes we see a noticable
increase in the steady state green zone value as compared to the
baseline:

	baseline	limit primary
mean	387		473
median	390		483
stddev	 68		 68
min	121		166
max	568		630

across ~375 collection pauses for each case.

We're still using the same green zone adjustment (the first 40 or so
pauses show identical green_zone growth in this comparison).  The
difference is in the activation of the primary (zero'th) concurrent
refinement thread by dirty card queue set notification.  After a pause
we'll often see a burst of concurrent refinement thread activity, as
dirty cards scheduled for revisiting are processed.  Once that's done,
the modified version typically activates / runs / deactivates just the
primary thread as mutators enqueue buffers, keeping the number of
buffers close to the green zone target.  The baseline allows the
number of buffers to grow until several threads are activated (4 with
the default configuration used).  Sometimes the baseline starts them
too late (or not at all), allowing the number of buffers to
significantly exceed the green zone target when a pause occurs,
leading to the update_rs phase exceeding its time goal.

As a result of this change, ConcurrentG1Refine construction no longer
needs to predictor argument (though it may return with future
improvements to the control model as part of JDK-8137022).

- Command line -XX:G1ConcRefinementThreads=0 now creates zero
concurrent refinement threads, rather than using the ergonomic default
even though zero is explicitly specified.  This will result in
mutator-only concurrent processing of dirty card buffers, which may
result in missed pause targets.  (Mutator-only processing being
insufficient is one of the issues discussed in JDK-8137022.) The use
of a zero value is mostly intended for testing, rather than production
use.

- Command line -XX:G1ConcRefinementRedZone=0 is no longer documented
as disabling concurrent processing.  So far as I can tell, it never
did so.  Rather, it meant that buffers completed by mutator threads
were always processed by them (and that only when
G1UseAdaptiveConcRefinement was off).  Buffers enqueued for other
reasons would still be processed by the concurrent refinement threads.

CR:
https://bugs.openjdk.java.net/browse/JDK-8133051

Webrev:
http://cr.openjdk.java.net/~kbarrett/8133051/webrev.00/

Testing:
Local specjbb2015 (Linux-x64)
GC nightly with G1
Aurora performance testing - no significant differences.