Request for Review (xs) - 8159073: Error handling incomplete when creating GC threads lazily

Fri Jul 15 14:28:52 UTC 2016

On 7/14/2016 1:47 AM, Thomas Schatzl wrote:
> Hi,
>
> On Wed, 2016-07-13 at 16:46 -0700, Jon Masamitsu wrote:
>> On 7/13/2016 4:11 AM, Thomas Schatzl wrote:
>>> Hi Jon,
>>>
>>> On Thu, 2016-06-23 at 13:32 -0700, Jon Masamitsu wrote:
>>>> The change has grown some (small now, not extra small).
>>>>
>>>> Added logging for the case where thread creation fails.
>>>>
>>>> http://cr.openjdk.java.net/~jmasa/8159073/webrev_delta_00_01a/
>>>>
>>>> Added fixes for failed thread creation.
>>>>
>>>> http://cr.openjdk.java.net/~jmasa/8159073/webrev_delta_00_01b/
>>>>
>>>> Complete fix with test added.
>>>>
>>>> http://cr.openjdk.java.net/~jmasa/8159073/webrev.01/
>>>>
>>>> Tested with gc_test_suite with new diagnostic flag on and off.
>>>     some comments:
>>>
>>> - in a few files (concurrentMarkSweepGeneration.cpp,
>>> parnewGeneration.cpp) there is a comment like:
>>>
>>> 2891       // If the total workers is greater than 1, then multiple
>>> workers
>>> 2892       // may be used at some time and the initialization has
>>> been
>>> set
>>> 2893       // such that the single threaded path cannot be used.
>>> 2894       if (workers->total_workers() > 1) {
>>>
>>> Wouldn't it be easier to just remove the special case for running
>>> the task in the VM thread?
>> I've seen cases with G1 where using the VM thread compared to using
>> 1 GC worker thread is about .5ms faster.   For a long time I would
>> have blown off .5ms but I'm less inclined to do that now.  It is
>> after all 500us :-).  We hear talk about sub ms pauses being
>> important to someone somewhere.
> Not trying to convince you, and it is a good argument, but:
>
> I kind of doubt that this is significant for a STW collector that takes
> (potentially) multiple (tens? hundreds?) of ms to actually get to a
> safepoint in the first place.
>
> Note that saving .5ms to be significant here means that the work must
> be very little in the first place too, to be completed by a single
> thread within a very small time frame so that it is not worth
> parallelizing it.

Right.

>
> There are also significant gains to be made by just reducing the number
> of times we start up threads (e.g. reference processing, G1's "other"
> tasks). Not that you couldn't address any of these issues in addition,
> but they decrease the impact :)

Are you saying that if I delete the option to use the VM thread, then
I don't need that code that checks the special case of 1 thread
(more deleted code) so it simplifies my change?

>
> Further, if we consider doing that, I would prefer to have that
> behavior centralized in the run_task() methods. Not have to consider
> every time we create and run a gang task whether it might be useful to
> run it single-threaded (and add the boiler plate code to do it
> everywhere).

I can see that point although I don't think the centralized point should be
run_task().

> I really kind of dislike such generic optimizations applied to single
> instances only, and then forgotten everywhere else. (And then somebody
> taking lots of time to find and apply the same optimization opportunity
> everywhere else using copy&paste, and of course forgetting it in one
> particular important place :) Note that this is not a critique of this
> change, just there have been a few instances of that littered across gc
> code)

I think it is less the case of an optimization (to use the VM thread) as 
a way
to check that the parallel code that was being added had the same result
as the original code.  But that is neither here nor there now.

>
>>   Beyond that, one of the instances where this comment occurs is for
>> the CMS marking thread where 1 concurrent worker may be common.
> Fair argument.
>
>> And if that's not convincing I'll bring out my last resort excuse:
>> "Beyond the scope of this change" :-).  Yes, it would simplify
>> things, I'd rather not have that fight over this change.
> This has been more of a side comment, no problem :)
>
>>> - WorkGang::run_task(): from my understanding that method has
>>> always been intended to be a shortcut for
>>>
>>> old_num_workers = num_workers;
>>> update_workers(num_workers);
>>> run_task(cl);
>>> update_workers(old_num_workers);
>>>
>>> This is not the case here any more. The code silently changes the
>>> number of dispatched workers, which may lead to confusion. I.e. the
>>> caller may expect that the given do_work() closure method will be
>>> called with arguments from 0..num_workers-1 eventually.
>> I think you're referring to
>>
>> 265 void WorkGang::run_task(AbstractGangTask* task) {
>> 266   run_task(task, active_workers());
>> 267 }
> No, run_task() with the num_workers argument. Sorry for being unclear.
>
>> I think the idiom is
>>
>> 1) decide if you want to recalculate the number of threads to use
>> (either call calc_active_workers() or use the current value of
>> active_workers()).
>> 2) call run_task() passing the task and it uses the active_workers()
>> that is part of the workgang.
>>
>> Repeat at the next parallel task (no dependency on the number of
>> active workers used in the last parallel task).
>>
>> I know there was some unfortunate cases with G1 where the reporting
>> of statistics required that the number of active_workers at the
>> parallel task had to be the number of when the statistics were
>> reported and required the restoring the previous active_workers
>> value.  I had hoped that the new run_task()
>>
>> void run_task(AbstractGangTask* task, uint num_workers);
>>
>> would help fix that unfortunate case but I am no longer holding out
>> hope for that.
> I have not seen that in current code, I might have overlooked that :)

I looked and it is not there anymore.  It was in the GC logging code and 
has been
fixed at some point.

>
>>> That is not the case here any more.
>>>
>>> I am not sure whether this is a problem, I think that all currently
>>> implemented closures are okay with that. But I am not sure what the
>>> problem is with the original code, so I would tend to try to avoid
>>> such a change in behavior. Again, the worst thing that may happen
>>> is that a single worker thread may get his do_work() method called
>>> multiple times.
>>>
>>> This is something that has been possible, and the code had to
>>> prepared for that, since forever.
>>>
>>> [Additionally, currently, all existing users of that run_task()
>>> method pass (or should pass) num_workers <= active_workers() that
>>> call this method, so no extra threads should be created.]
>> I don't recall thinking that the parameter (num_workers) should be
>> less than active_workers.  If num_workers should be less than
>> active_workers, that's to be decided by the caller of run_task() and
>> run_task() should use num_workers if possible (where possible is
>> num_workers <= total_workers).
> No, num_workers need not be less than active workers. But the code
> change in run_task(, uint) actually forces that now. And it changes the
> number of times the do_work() is (supposedly) guaranteed to be called
> if num_workers > active_workers.
>
> Currently G1 uses the global active_workers() as a global ceiling for
> the maximum number of threads wanted, for no particular reason.
>
> The reasoning is kind of, if active_workers(), which also takes heap
> size into account, says, for that application we should not use more
> than 2 threads because eg. it only uses a 50M heap. And anything beyond
> that is (on a global scale) waste of resources.
>
> In such a situation, using different metrics, there might be cases
> where more threads are (slightly) useful. But in case of an exceptional
> situation, it would allocate too many resources at that time. Since we
> don't ever release some of these resources (e.g. thread specific data
> structures), this would have negative impact forever (like memory
> consumption) just for a spike of activity.
>
> And on the other hand, it would be too complicated for the local thread
> number calculations to always consider every little detail (and repeat
> the same calculations). This reasoning imho helps simplifying the code
> and hopefully the understanding.
>
> However, the code suggested so far in this change does not support
> that, and introduces some semantics change for the caller too.
>
>>> The only problem calling update_active_workers() in that method
>>> right now is that it only allows less active workers than total
>>> workers if UseDynamicNumberOfGCThreads is true as far as I can see.
>> I think UseDynamicNumberOfGCThreads being true allows use to use less
>> than total number of threads (which is the default behavior
>> forever).  G1 is more parallel and more concurrent and perhaps needs
>> more flexibility. That's a change we could consciously make and
>> advertise to the world.
> I have no problem with that, however I am doubtful that the current
> heuristics for setting the number of threads are good enough.

I agree that it can be improved.

>
>>> So I tried a vm.gc test run with the patch at
>>> http://cr.openjdk.java.net/~tschatzl/8159073/webrev/hotspot.patch
>>> (basically your patch, with changes in WorkGang::run_task(), and
>>> removing the nasty assert in update_active_workers()), and it
>>> passed.
>> I can believe that patch workers but I think it is going in the wrong
>> direction.
>>
>> calc_active_workers() picks the number of GC threads based on heap
>> size. I think this is appropriate for many parallel GC tasks.  But
>> there are exceptions and  run_task(AbstractGangTask* task, uint
>> num_workers) is for those exceptions.  For example, Reference
>> processing  does not scale with the size of the heap so we should use
>> some other policy to choose the number of parallel workers for
>> Reference processing and use run_task(task, num_workers) to say I
>> want to override the number of workers active in the workgroup.
> Yes, but the existing code does not support that. It strictly bounds
> the actual number of "workers" (and actually, in terms of the framework
> this is more a number of tasks!) by the number of active threads
> available at the moment! :)

Do you mean this code that is bounding the num_workers by the
number of created threads?

278   uint active_workers = MIN2(_created_workers, num_workers);
279   _dispatcher->coordinator_execute_on_workers(task, active_workers);

>
> E.g. if active_workers = 3 (for whatever reason), but somebody calls
> run_task(task, 10), the method silently only performs three "tasks".

The limiting to active_workers is the kind of thing I see callers of
run_task(task, num_worker) doing.  From "share/vm/gc/g1/g1CardLiveData.cpp"
(line 427)

427   uint const num_workers = (uint)MIN2(num_chunks, 
(size_t)workers->active_workers());
428
429   G1ClearCardLiveDataTask cl(live_cards_bm(), num_chunks);
430
431   log_debug(gc, ergo)("Running %s using %u workers for " SIZE_FORMAT 
" work units.", cl.n    ame(), num_workers, num_chunks);
432   workers->run_task(&cl, num_workers);

>
> And as mentioned already, that changes the existing behavior: instead
> of do_work() of the closure being called with worker_id of 0..9, it
> will only get called with worker_id of 0..2!
>
> What if the caller wants to have the do_work() call with worker_id 9 do
> something different to the other threads?
> This will fail with the suggested change, because do_work() will never
> be called with 9 then.
>
> Previously, the caller had been guaranteed to get invoked with
> worker_id's from 0..num_workers-1. Even if only less parallel threads
> were actually used.

Let me respond to this once I hear what you have to say about my comments
above.

>
> (There has never been a guaranteed 1:1 mapping between threads and
> tasks/work)
>
>> Alternatively, we could figure out another way to handle the cases
>> where the number of active workers needs to be the same between
>> tasks.  I'm pretty sure it had to do with reporting information in
>> G1.  As I recall it, it went like
>>
>> - pick N active workers to do workA
>> - pick M active workers to do workB
>> - report workA with current active workers
>> M and either under report (N > M) or report
>> garbage (N < M)
> That's basically why my suggested change restores the active_workers
> global variable after its work - backwards compatibility to existing
> code.
>
> Maybe this has been an issue previously, e.g. with the evacuation
> phase: although some threads did not contribute actual work to a
> particular phase, but for average calculation they were still accounted
> for?
> Afaik that has been fixed in the meantime.

I agree it has been fixed.

I made the following change to guarantee  that _active_workers in 
WorkGang was
not changed by run_task(task, num_workers).

diff --git a/src/share/vm/gc/shared/workgroup.cpp 
b/src/share/vm/gc/shared/workgroup.cpp
--- a/src/share/vm/gc/shared/workgroup.cpp
+++ b/src/share/vm/gc/shared/workgroup.cpp
@@ -267,6 +267,7 @@
  }

  void WorkGang::run_task(AbstractGangTask* task, uint num_workers) {
+  uint prev_active_workers = active_workers();
    guarantee(num_workers <= total_workers(),
              "Trying to execute task %s with %u workers which is more 
than the amount of total workers %u.",
              task->name(), num_workers, total_workers());
@@ -276,6 +277,7 @@
    // created workers.
    uint active_workers = MIN2(_created_workers, num_workers);
    _dispatcher->coordinator_execute_on_workers(task, active_workers);
+  guarantee(prev_active_workers == this->active_workers(), "Active 
workers changed from %u to %u", prev_active_workers, 
this->active_workers());
  }

It passed vm.gc.  That's what I expected.

>
> Overall, and as a future improvement, I think it might be better to
> move the management of "global" active_workers variable out from
> WorkGang. And maybe rename the parameters more appropriately. ;)

In my mind WorkGang is a place to store the number of active workers 
(and now
created workers).  WorkGang knows the maximum number of workers that it
can use so active_workers seemed to fit.  Are you suggesting removing the
run_task(task) method and only have run_task(task, num_workers) so that
the number of active workers is always feed to WorkGang?  I would not
be opposed to that.

Jon

>
> Thanks,
>    Thomas
>