RFR (M): 8245721: Refactor the TaskTerminator

Wed Jul 8 15:31:06 UTC 2020

> On Jun 24, 2020, at 4:03 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8245721
> Webrev:
> http://cr.openjdk.java.net/~tschatzl/8245721/webrev/
> Testing:
> tier1-5, many many perf rounds, many tier1-X rounds with other patches
> 
> Thanks,
>  Thomas

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskTerminator.hpp
  54   struct SpinContext {
  55     uint yield_count;
...

While there's certainly variability on this, discussions I've had with
others and other reviews suggest the Style Guide's suggestion to use
leading underscores for data member names should apply even for public
members of structs.

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskTerminator.hpp
 128       spin_context.hard_spin_limit = MIN2(2 * spin_context.hard_spin_limit,
 129                                           (uint) WorkStealingHardSpins);

In the new code, hard_spin_limit is always (re)initialized to
WorkStealingHardSpins, so this does nothing.

In the old code, the starting spin limit was
  WorkStealingHardSpins >> WorkStealingSpinToYieldRatio
and that value was captured in a variable for (hard_spin_start) for
use in reinitializing the limit after each yield iteration.

So this is a change, and I'm guessing not an intentional one.

------------------------------------------------------------------------------ 
src/hotspot/share/gc/shared/taskTerminator.cpp 
 154   SpinContext spin_context;

If this were moved inside the outer loop of offer_termination, there
would be no need for SpinContext::reset; just use the ctor-initializer
for initialization of the members.

------------------------------------------------------------------------------ 
src/hotspot/share/gc/shared/taskTerminator.cpp 

I *think* the inner loop of offer_termination would be made clearer by
hoisting the yield_count check out of do_spin_iteration.  I'm thinking
something like

instead of

 169       bool giveup_spin;
 170       do {
 171         size_t tasks;
 172         bool should_exit_termination;
 173         {
 174           MutexUnlocker y(_blocker, Mutex::_no_safepoint_check_flag);
 175           giveup_spin = do_spin_iteration(spin_context);
 176 
 177           // Dirty read of exit condition. 
 178           tasks = tasks_in_queue_set();
 179           should_exit_termination = exit_termination(tasks, terminator);
 180         }
 181 
 182         // Immediately check exit conditions after re-acquiring the lock using the
 183         // information gathered just recently.
 184         if (_offered_termination == _n_threads) {
 185           prepare_for_return(the_thread);
 186           assert_queue_set_empty();
 187           return true;
 188         } else if (should_exit_termination) {
 189           prepare_for_return(the_thread, tasks);
 190           _offered_termination--;
 191           return false;
 192         }
 193       } while (!giveup_spin);

something like

  while (spin_context.yield_count <= WorkStealingYieldsBeforeSleep) {
    // Each spin iteration is counted as a yield for purposes of
    // deciding when to sleep.
    ++spin_context.yield_count;
    size_t tasks;
    bool should_exit_termination;
    {
      MutexUnlocker y(_blocker ...);
      do_spin_iteration(spin_context);
      // Dirty read of exit condition.
      tasks = tasks_in_queue_set();
      should_exit_termination = exit_termination(tasks, terminator);
    }
    // Immediately check exit conditions after re-acquiring the lock.
    if (_offered_termination == _n_threads) {
      ... ;
      return true;
    } else if (should_exit_termination) {
      ... ;
    }
  }

This also makes some issues with naming a little more apparent.
do_spin_iteration might spin or might yield, yield_count is counting
both spins and yields, and spin_context is about more than spinning.
Maybe something a little more generic, using "delay" rather than
"spin", like do_delay_step, delay_count, and delay_context?

[pre-existing]
Side question: Should the yield_count test be <= or < ? It seems like
WorkStealingYieldsBeforeSleep == 0 should disable yields and just go
straight to the sleep case. And that's desirable for a uniprocessor,
though why we'd be using a GC using this on a uniprocessor system is
a question.  Exclusive test would also better match the doc string for
the option.  

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskTerminator.cpp 
 177           // Dirty read of exit condition. 
 178           tasks = tasks_in_queue_set();

From comment in the bug, that's intentionally done while not holding
the lock, because doing otherwise makes the locked section too long.
There should be a comment about that so nobody is tempted to move it
(as I did in an earlier version of the above suggested move of the
yield_count checking).

------------------------------------------------------------------------------ 
src/hotspot/share/gc/shared/taskTerminator.cpp 

How bad would it be to not bother collecting the number of tasks and 
just have prepare_for_return unconditionally notify_all? 

Alternatively, if that's bad and the cost of tasks_in_queue_set() is
also a problem (which I can well believe), maybe the taskqueue
facility can have some specialized helper function added for use here.
All we really want to know in the current code is how many tasks are
available up to the number of workers, or more than the number of
workers and we don't need a precise answer.

Such a change can be dealt with later.

Another idea would be to record the task count in the terminator,
along with a last update epoch counter.  (Both are only touched under
the lock, so no lock-free synchronization to worry about.)  Use the
recorded count if it's epoch is more recent than the locally captured
epoch from the most recent local use.

Of course, this is being called after some kind of delay (spin, yield,
sleep), so it's not clear how important the performance is.

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskTerminator.cpp 
 204     } else {
 205       size_t tasks = tasks_in_queue_set();
 206       if (exit_termination(tasks, terminator)) {
 207         prepare_for_return(the_thread, tasks);
 208         _offered_termination--;
 209         return false;
 210       }
 211     }

We get here only after our wait on the monitor ended.

If we were woken up by some other thread (so wait returned true), I
think we should skip the exit_termination test (just assume true) and
pass 0 to prepare_for_return. (So no need for task counting in this
case.) So insert something like this before the above

  } else if (!timedout) {
    // Don't bother waking up more tasks.
    prepare_for_return(the_thread, 0);
    _offered_termination--;
    return false;
  } else {
    ...

I think this also fixes a (pre-existing, I think) bug. Assume there
are enough tasks for exiting being called for, but fewer than the
number of threads waiting. Either the spin master or the first timeout
will wake up some, but not all, of the waiting threads. Each of those
threads (or the spin master) may wake up additional threads, making
the the limit on wakeups based on number of tasks pointless, and also
making the task counting of much more limited value; given that
exit_termination just checks tasks > 0, we could instead just check
whether the queue_set is empty.  So I think without this we just have
a slower version of always waking up all waiting threads.

(There is also a small improvement to TaskQueueSuper::is_empty() that
I didn't bother making because it didn't seem to matter.)

------------------------------------------------------------------------------
src/hotspot/share/gc/shared/taskTerminator.cpp
 156   MutexLocker x(_blocker, Mutex::_no_safepoint_check_flag);
...
 174           MutexUnlocker y(_blocker, Mutex::_no_safepoint_check_flag);
...
 197     _blocker->wait_without_safepoint_check(WorkStealingSleepMillis);

I agree with Zhengyu's suggestion of using MonitorLocker and waiting
on the locker, rather than using naked _blocker->wait_xxx.

You replied:
> I plan to drop the for-Java Monitor and use os::PlatformMonitor
> instead here. That removes all _no_safepoint_check_flag code.

I don't think that should be done. PlatformMutex/Monitor are intended
intended for use in the implementation of Monitor/Mutex, as an
abstraction over the actual platform facilities. The "usual"
facilities should not be casually bypassed. (I am aware there are a
few direct uses of the Platform layer. Some of those might be simple
bugs. Some may be doing somewhat weird things that have difficulties
with the usual facilities, particularly with rank nesting checks,
which might arguably be considered an overly convoluted code bug.)

> Also there is no MonitorUnlocker, so we'd still have to use
> MutexUnlocker. It will look a bit weird to not have matching
> locker/unlocker names.

Given that Mutex is a conceptual (and now for real) base class of
Monitor, I don't see anything wrong or strange with the unlocker
support being only at the Mutex layer. Literally duplicating it to
have a similar name at the Monitor layer would seem strange to me. If
it really makes you feel better, have a (nearby? or shared?) typedef
for MonitorLocker.

------------------------------------------------------------------------------