RFR: 8133051: Concurrent refinement threads may be activated and deactivated at random

Wed Apr 13 11:46:05 UTC 2016

Hi Kim,

On Tue, 2016-04-05 at 22:15 -0400, Kim Barrett wrote:
> > On Apr 5, 2016, at 9:02 AM, Kim Barrett <kim.barrett at oracle.com>
> > wrote:
> > I've been so focused on the many-refinement-thread problems that I
> > forgot to consider the small-number-of-threads case here.  Please
> > hold
> > off on reviewing...
> 
> [Taking off hold; I've figured out how to deal with the small number
> of threads case, and it turns out it helps the large number of 
> threads case too.  I've updated the description of the changes from 
> the initial RFR to account for the change to the activation of the
> primary concurrent refinement thread.]
> 
> Please review this change to the G1 concurrent refinement thread
> controller.  This change addresses unnecessary activation when there
> are many threads and few buffers to be processed.  It also addresses
> delayed activation due to mis-configuration of the dirty card queue
> set's notification mechanism.
> 
> This change continues to use (more or less) the existing control
> model, only avoiding obviously wasted effort or undesirable delays.
> Further enhancements to the control model will be made under
> JDK-8137022 or subtasks from that.
> 
> - Changed the G1 concurrent refinement thread activation controller 
> to use a minimum buffer count step between (de)activation values for 
> the threads.  This is accomplished by having a minimum yellow zone 
> size, based on the number of refinement threads.  This avoids waking 
> up more refinement threads than there are buffers available to 
> process.  (It is, of course, still possible for a refinement thread 
> to wake up and discover it has no work to do, because of progress by 
> other threads. But at least we're no longer waking up threads with a 
> near guarantee they won't find work to do.)
> 
> - As part of the above, changed G1ConcRefinementThresholdStep to have
> a minimum value of one, a default value of 2, and to be used to
> determine a lower bound on the thread activation step size.  A larger
> step size makes it less likely a thread will be woken up and discover
> other threads have already completed the work "allocated" to it.  Too
> large a minimum may overly restrict the number of refinement threads
> being activated, leading to missed pause targets.
> 
> - Changed the threshold for activation of the primary concurrent
> refinement thread via notification from the dirty card queue set upon
> enqueue of a new buffer.  It was previously using a notification
> threshold of green_zone * (1 + predictor_sigma).  This could lead to 
> a significantly larger activation threshold, particularly as the
> green_zone value grows, which could lead to a much larger number of
> pending buffers for pause-time update_rs to process, leading to 
> missed update_rs time targets and unnecessary back pressure on the
> green_zone size.
> 
> Instead we now start with the normal activation threshold for the
> primary thread, calculated using the green_zone value and threshold
> steps.  We limit that using ParallelGCThreads (the number of threads
> used by the update_rs phase), possibly running the thread more
> aggressively than the normal threshold would suggest, again to limit
> the excess over the green_zone value that might be found by
> update_rs.
> 
> Using default configuration parameters, comparing runs of specjbb2015
> on Linux-x64 with 24 logical processors (so 18 refinement threads 
> with the default configuration), with these changes we see a 
> noticable increase in the steady state green zone value as compared 
> to the baseline:
> 
> 	baseline	limit primary
> mean	387		473
> median	390		483
> stddev	 68		 68
> min	121		166
> max	568		630
> 
> across ~375 collection pauses for each case.
> 
> We're still using the same green zone adjustment (the first 40 or so
> pauses show identical green_zone growth in this comparison).  The
> difference is in the activation of the primary (zero'th) concurrent
> refinement thread by dirty card queue set notification.  After a 
> pause we'll often see a burst of concurrent refinement thread 
> activity, as dirty cards scheduled for revisiting are processed. 
>  Once that's done, the modified version typically activates / runs / 
> deactivates just the primary thread as mutators enqueue buffers, 
> keeping the number of buffers close to the green zone target.  The 
> baseline allows the number of buffers to grow until several threads 
> are activated (4 with the default configuration used).  Sometimes the 
> baseline starts them too late (or not at all), allowing the number of 
> buffers to significantly exceed the green zone target when a pause 
> occurs, leading to the update_rs phase exceeding its time goal.
> 
> As a result of this change, ConcurrentG1Refine construction no longer
> needs to predictor argument (though it may return with future
> improvements to the control model as part of JDK-8137022).
> 
> - Command line -XX:G1ConcRefinementThreads=0 now creates zero
> concurrent refinement threads, rather than using the ergonomic 
> default even though zero is explicitly specified.  This will result 
> in mutator-only concurrent processing of dirty card buffers, which 
> may result in missed pause targets.  (Mutator-only processing being
> insufficient is one of the issues discussed in JDK-8137022.) The use
> of a zero value is mostly intended for testing, rather than 
> production use.
> 
> - Command line -XX:G1ConcRefinementRedZone=0 is no longer documented
> as disabling concurrent processing.  So far as I can tell, it never
> did so.  Rather, it meant that buffers completed by mutator threads
> were always processed by them (and that only when G1UseAdaptiveConcRe
> finement was off).  Buffers enqueued for other reasons would still be 
> processed by the concurrent refinement threads.
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8133051
> 
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8133051/webrev.00/
> 
> Testing:
> Local specjbb2015 (Linux-x64)
> GC nightly with G1
> Aurora performance testing - no significant differences.

- (pre-existing) in the options that were changed, maybe spell out "rem
set" and "rset" for better understanding.

- this is a change of existing behavior: the values for the global
flags used are never written back, not even after initialization. It
would be nice to do so. That is kind of useful without using adaptive
concurrent refinement. The logs won't give you that information in this
case.

Anyway, a log message printing the initial values would be really nice
regardless of whether adaptiveness is turned on.

- is there a reason why this change needs to particularly put the
Thresholds struct into a namespace? Please remove the namespace usage
declaration.

- the log message prints the thresholds, not the zones (ranges). This
might lead to misunderstandings.

The same with variable and parameter names. (this is mostly pre
-existing)

- the defines should not use local variables with leading underscore.
While there are not particular rules for it, variables with leading
underscore are reserved for members.
It initially got me confused with these defines being used in the
context of some object, and referencing that object's members.
Please also consider #undef'ing these local defines.

- in concurrentG1Refine.cpp:154, the assignment of "size = green * 2"
seems quite random, just as the previous factor of 3.
Wouldn't it be better to just use min_size, or do nothing here because
the following value clamping will fix it up anyway?

- the "update_rs_processed_buffers" parameter in various methods should
imho be a size_t, not a double. There does not seem to be a reason to
use a double. (pre-existing)

- I kind of dislike even requiring an unused parameter for a method in
static methods like in calc_new_yellow/red_zone, but I kind of see the
point for consistency's sake. Let's see what others think.

In any case, I would prefer if the TODO's were removed. Most likely
they are related to improvements you have in your queue anyway,
otherwise I would prefer to store TODO's in the bug tracker.

- for consistency: all methods but calc_new_*_zone() parameters that
contain the thresholds have the _zone suffix, but not those:

- concurrentG1Refine.cpp:174ff: indentation of the parameter list does
not conform to any other in the same file.

- the * 2 in the calc_new_yellow_zone() seems to be the same factor as
in calc_init_yellow_zone(). Unless this is changed soon, I would prefer
if it were factored out somehow.

- concurrentRefine.cpp:360+362: instead of the C-style cast, maybe a
static_cast should be used here too.

- in addition to the actual new thresholds, it might be interesting to
get information about the values passed to G1ConcurrentRefine::adjust()
 in a trace message.

Looks good overall though.

Thanks,
  Thomas