Request for Review (xs) - 8159073: Error handling incomplete when creating GC threads lazily

Thu Jul 14 08:47:29 UTC 2016

Hi,

On Wed, 2016-07-13 at 16:46 -0700, Jon Masamitsu wrote:
> 
> On 7/13/2016 4:11 AM, Thomas Schatzl wrote:
> > 
> > Hi Jon,
> > 
> > On Thu, 2016-06-23 at 13:32 -0700, Jon Masamitsu wrote:
> > > 
> > > The change has grown some (small now, not extra small).
> > > 
> > > Added logging for the case where thread creation fails.
> > > 
> > > http://cr.openjdk.java.net/~jmasa/8159073/webrev_delta_00_01a/
> > > 
> > > Added fixes for failed thread creation.
> > > 
> > > http://cr.openjdk.java.net/~jmasa/8159073/webrev_delta_00_01b/
> > > 
> > > Complete fix with test added.
> > > 
> > > http://cr.openjdk.java.net/~jmasa/8159073/webrev.01/
> > > 
> > > Tested with gc_test_suite with new diagnostic flag on and off.
> >    some comments:
> > 
> > - in a few files (concurrentMarkSweepGeneration.cpp,
> > parnewGeneration.cpp) there is a comment like:
> > 
> > 2891       // If the total workers is greater than 1, then multiple
> > workers
> > 2892       // may be used at some time and the initialization has
> > been
> > set
> > 2893       // such that the single threaded path cannot be used.
> > 2894       if (workers->total_workers() > 1) {
> > 
> > Wouldn't it be easier to just remove the special case for running
> > the task in the VM thread?
> I've seen cases with G1 where using the VM thread compared to using
> 1 GC worker thread is about .5ms faster.   For a long time I would
> have blown off .5ms but I'm less inclined to do that now.  It is
> after all 500us :-).  We hear talk about sub ms pauses being
> important to someone somewhere.

Not trying to convince you, and it is a good argument, but:

I kind of doubt that this is significant for a STW collector that takes
(potentially) multiple (tens? hundreds?) of ms to actually get to a
safepoint in the first place.

Note that saving .5ms to be significant here means that the work must
be very little in the first place too, to be completed by a single
thread within a very small time frame so that it is not worth
parallelizing it.

There are also significant gains to be made by just reducing the number
of times we start up threads (e.g. reference processing, G1's "other"
tasks). Not that you couldn't address any of these issues in addition,
but they decrease the impact :)

Further, if we consider doing that, I would prefer to have that
behavior centralized in the run_task() methods. Not have to consider
every time we create and run a gang task whether it might be useful to
run it single-threaded (and add the boiler plate code to do it
everywhere).
I really kind of dislike such generic optimizations applied to single
instances only, and then forgotten everywhere else. (And then somebody
taking lots of time to find and apply the same optimization opportunity
everywhere else using copy&paste, and of course forgetting it in one
particular important place :) Note that this is not a critique of this
change, just there have been a few instances of that littered across gc
code)

>  Beyond that, one of the instances where this comment occurs is for
> the CMS marking thread where 1 concurrent worker may be common.

Fair argument.

> And if that's not convincing I'll bring out my last resort excuse: 
> "Beyond the scope of this change" :-).  Yes, it would simplify
> things, I'd rather not have that fight over this change.

This has been more of a side comment, no problem :)

> > - WorkGang::run_task(): from my understanding that method has
> > always been intended to be a shortcut for
> > 
> > old_num_workers = num_workers;
> > update_workers(num_workers);
> > run_task(cl);
> > update_workers(old_num_workers);
> > 
> > This is not the case here any more. The code silently changes the
> > number of dispatched workers, which may lead to confusion. I.e. the
> > caller may expect that the given do_work() closure method will be
> > called with arguments from 0..num_workers-1 eventually.
> I think you're referring to
> 
> 265 void WorkGang::run_task(AbstractGangTask* task) {
> 266   run_task(task, active_workers());
> 267 }

No, run_task() with the num_workers argument. Sorry for being unclear.

> I think the idiom is
> 
> 1) decide if you want to recalculate the number of threads to use 
> (either call calc_active_workers() or use the current value of
> active_workers()).
> 2) call run_task() passing the task and it uses the active_workers()
> that is part of the workgang.
> 
> Repeat at the next parallel task (no dependency on the number of
> active workers used in the last parallel task).
> 
> I know there was some unfortunate cases with G1 where the reporting
> of statistics required that the number of active_workers at the
> parallel task had to be the number of when the statistics were
> reported and required the restoring the previous active_workers
> value.  I had hoped that the new run_task()
> 
> void run_task(AbstractGangTask* task, uint num_workers);
> 
> would help fix that unfortunate case but I am no longer holding out
> hope for that.

I have not seen that in current code, I might have overlooked that :)

> > That is not the case here any more.
> > 
> > I am not sure whether this is a problem, I think that all currently
> > implemented closures are okay with that. But I am not sure what the
> > problem is with the original code, so I would tend to try to avoid
> > such a change in behavior. Again, the worst thing that may happen
> > is that a single worker thread may get his do_work() method called
> > multiple times.
> > 
> > This is something that has been possible, and the code had to
> > prepared for that, since forever.
> > 
> > [Additionally, currently, all existing users of that run_task()
> > method pass (or should pass) num_workers <= active_workers() that
> > call this method, so no extra threads should be created.]
>
> I don't recall thinking that the parameter (num_workers) should be
> less than active_workers.  If num_workers should be less than
> active_workers, that's to be decided by the caller of run_task() and
> run_task() should use num_workers if possible (where possible is
> num_workers <= total_workers).

No, num_workers need not be less than active workers. But the code
change in run_task(, uint) actually forces that now. And it changes the
number of times the do_work() is (supposedly) guaranteed to be called
if num_workers > active_workers.

Currently G1 uses the global active_workers() as a global ceiling for
the maximum number of threads wanted, for no particular reason.

The reasoning is kind of, if active_workers(), which also takes heap
size into account, says, for that application we should not use more
than 2 threads because eg. it only uses a 50M heap. And anything beyond
that is (on a global scale) waste of resources.

In such a situation, using different metrics, there might be cases
where more threads are (slightly) useful. But in case of an exceptional
situation, it would allocate too many resources at that time. Since we
don't ever release some of these resources (e.g. thread specific data
structures), this would have negative impact forever (like memory
consumption) just for a spike of activity.

And on the other hand, it would be too complicated for the local thread
number calculations to always consider every little detail (and repeat
the same calculations). This reasoning imho helps simplifying the code
and hopefully the understanding.

However, the code suggested so far in this change does not support
that, and introduces some semantics change for the caller too.

> > The only problem calling update_active_workers() in that method
> > right now is that it only allows less active workers than total
> > workers if UseDynamicNumberOfGCThreads is true as far as I can see.
> I think UseDynamicNumberOfGCThreads being true allows use to use less
> than total number of threads (which is the default behavior
> forever).  G1 is more parallel and more concurrent and perhaps needs
> more flexibility. That's a change we could consciously make and
> advertise to the world.

I have no problem with that, however I am doubtful that the current
heuristics for setting the number of threads are good enough.

> > 
> > So I tried a vm.gc test run with the patch at
> > http://cr.openjdk.java.net/~tschatzl/8159073/webrev/hotspot.patch
> > (basically your patch, with changes in WorkGang::run_task(), and
> > removing the nasty assert in update_active_workers()), and it
> > passed.
> I can believe that patch workers but I think it is going in the wrong
> direction.
> 
> calc_active_workers() picks the number of GC threads based on heap
> size. I think this is appropriate for many parallel GC tasks.  But
> there are exceptions and  run_task(AbstractGangTask* task, uint
> num_workers) is for those exceptions.  For example, Reference
> processing  does not scale with the size of the heap so we should use
> some other policy to choose the number of parallel workers for
> Reference processing and use run_task(task, num_workers) to say I
> want to override the number of workers active in the workgroup.

Yes, but the existing code does not support that. It strictly bounds
the actual number of "workers" (and actually, in terms of the framework
this is more a number of tasks!) by the number of active threads
available at the moment! :)

E.g. if active_workers = 3 (for whatever reason), but somebody calls
run_task(task, 10), the method silently only performs three "tasks".

And as mentioned already, that changes the existing behavior: instead
of do_work() of the closure being called with worker_id of 0..9, it
will only get called with worker_id of 0..2!

What if the caller wants to have the do_work() call with worker_id 9 do
something different to the other threads?
This will fail with the suggested change, because do_work() will never
be called with 9 then.

Previously, the caller had been guaranteed to get invoked with
worker_id's from 0..num_workers-1. Even if only less parallel threads
were actually used.

(There has never been a guaranteed 1:1 mapping between threads and
tasks/work)

> Alternatively, we could figure out another way to handle the cases
> where the number of active workers needs to be the same between
> tasks.  I'm pretty sure it had to do with reporting information in
> G1.  As I recall it, it went like
> 
> - pick N active workers to do workA
> - pick M active workers to do workB
> - report workA with current active workers
> M and either under report (N > M) or report
> garbage (N < M)

That's basically why my suggested change restores the active_workers
global variable after its work - backwards compatibility to existing
code.

Maybe this has been an issue previously, e.g. with the evacuation
phase: although some threads did not contribute actual work to a
particular phase, but for average calculation they were still accounted
for?
Afaik that has been fixed in the meantime.

Overall, and as a future improvement, I think it might be better to
move the management of "global" active_workers variable out from
WorkGang. And maybe rename the parameters more appropriately. ;)

Thanks,
  Thomas