RFR: Optimizing best of two work stealing queue selection

Fri Jun 22 18:08:58 UTC 2018

Hi Zhengyu,

On Fri, 2018-06-22 at 11:27 -0400, Zhengyu Gu wrote:
> Hi Thomas,
> 
> Thanks for the suggestions! Please see comments inline.
> 
> On 06/22/2018 10:56 AM, Thomas Schatzl wrote:
> > Hi Zhengyu,
> > 
> > On Fri, 2018-06-22 at 09:32 -0400, Zhengyu Gu wrote:
> > > Hi,
> > > 
> > > This optimization is based on this paper [1]. The idea is to keep
> > > last successfully stolen queue as one of best of two candidates.
> > 
> >    :)
> > 
> > > My experiments showed that it only marginally improved successful
> > > stealing ratio, given the ratio is *very* high to begin with,
> > > north of 95%, even under extremely unbalanced scenarios (e.g.
> > > concurrent threads = 3, parallel threads = 11, of course, based
> > > on Shenandoah ParallelClaimableQueueSet ).
> > > 
> > > However, it showed remarkable improvement on task termination:
> > > I measured how often worker threads have to exit termination
> > > protocol and retry (failed termination)
> > 
> > The other change suggested in the paper that only checks
> > num_live_threads instead of max_num_threads during termination
> > would also be nice to have, but I think (and you were imho right
> > about this) you wanted to have this separate :)
> 
> Be honest, I have yet studied this optimization, cause Shenandoah
> has itself queue set implementation, I am not sure if it is
> applicable to us.

If you mean ParallelClaimableQueueSet: I can't see how it would help in
finding the queues that still have work during stealing (or in case of
the mentioned optimization, do more frequent everybody-terminated-yet
checks based on number of active queues).

If I had to guess, I would think the ParallelClaimableQueueSet of a way
of using the same queue set for phases with a changing number of
threads but make sure that at the end of a phase all queues are empty.
But I may be wrong :)

Also the ShenandoahTaskTerminator only helps (from my understanding)
with the contention on the "is-there-any-work-remaining" question and
only waking up as many threads as there is work (not everyone).

The second optimization in the paper, as far as I understand it, is
limiting on the number of attempts to steal (i.e. after the thread is
woken up, how often should he try to find work) - i.e. at the moment in
GenericTaskQueueSet::steal(), steal_best_of_2() is called 2*#threads^2
attempts to find work within all task queues regardless of how many
active threads there are.

The solution in the paper is to call steal_best_of_2() only 2*#live-
threads^2 within all task queues and then go back to see if everybody
terminated.

My suggestion is to not look through all task queues, but somehow
during waking up the threads, where we already find out how many active
tasks there are, also record which threads were found to have work.
And only look through these task queues when trying to steal.

Even if that is an approximation too (while we are looking for work,
threads might complete their queues), it should make these tests more
effective.
However given we first try to steal on the queue of the thread we might
already be extremely efficient in this area.

So, then only actual work stealing from the queues is left to work on
:)

> > That other change probably works because then we do the can-we-
> > terminate check more often the less work there is, see my question
> > about why not only query the queues with known-elements (at the
> > time of waking up the threads).
> > 
> > > 
> > > Compiler.sunflow:
> > > 
> > > Before patch:
> > >     Avg: failed termination 4.747   Worst case: 209
> > > 
> > > After patch:
> > >     Avg: failed termination 0.513   Worst case: 11
> > > 
> > > Although, above result is from one particular run, but it is
> > > *very reproducible* and also with other benchmarks.
> > > 
> > > 
> > > Webrev:
> > > http://cr.openjdk.java.net/~zgu/shenandoah/optimized_best_of_2/we
> > > brev
> > > .00/index.html
> > > 
> > > Hopefully, Aleksey and Thomas have better ways to measure the
> > > impact to overall termination.
> > > 
> > 
> > I can do some perf measurements.
> > 
> > My current overall comment of this change is threefold:
> > 
> > - why not immediately contribute in jdk/jdk? There does not seem to
> > be anything Shenandoah-specific about it.
> 
> All my observation is based Shenandoah implementation, which has a
> few parts that diverged from upstream (e.g. termination protocol and
> queue set)

This change has no impact on the termination protocol. The change
itself looks like it almost applies to the current mainline code. I
will give it a try.

> > 
> > - I do not see a point in having an optional switch; it seems
> > obviously superior, and the paper already provided significant
> > feedback on its usefulness. Aleksey and me will also do
> > measurements. If there were a switch to enable this I would make
> > this one runtime (i.e. as parameter to the queueset), and only
> > maybe, if I really wanted a global, use it as default value for
> > that parameter.
> 
> I would love to drop the switch, if you guys measurements can
> confirm what I observed, and not negative impacts.
> 
> I am also going to run some tests over the weekend.

Okay.

> > - I would try to make some effort to make this "last-queue-id"
> > thread-local. I guess there will be lots of memory ping-pong
> > between threads reading/writing to the same cacheline(s) all the
> > time.
> > 
> > I.e. at least use a PaddedArray, or pass the last-queue-id to
> > steal_best_of_2() like the seed parameter.
> > 
> 
> Yes, it was also in my mind, will fix it.

In case you opt for the second, instead of adding parameters every time
you need a new thread local variable, it might be useful to add a
single "context" struct containing them, and pass only that around.

Might give nicer looking code.

Thanks,
  Thomas