RFR: Optimizing best of two work stealing queue selection

Fri Jun 22 14:56:45 UTC 2018

Hi Zhengyu,

On Fri, 2018-06-22 at 09:32 -0400, Zhengyu Gu wrote:
> Hi,
> 
> This optimization is based on this paper [1]. The idea is to keep
> last successfully stolen queue as one of best of two candidates.

  :)

> My experiments showed that it only marginally improved successful 
> stealing ratio, given the ratio is *very* high to begin with, north
> of 95%, even under extremely unbalanced scenarios (e.g. concurrent
> threads = 3, parallel threads = 11, of course, based on Shenandoah 
> ParallelClaimableQueueSet ).
> 
> However, it showed remarkable improvement on task termination:
> I measured how often worker threads have to exit termination
> protocol and retry (failed termination)

The other change suggested in the paper that only checks
num_live_threads instead of max_num_threads during termination would
also be nice to have, but I think (and you were imho right about this)
you wanted to have this separate :)

That other change probably works because then we do the can-we-
terminate check more often the less work there is, see my question
about why not only query the queues with known-elements (at the time of
waking up the threads).

> 
> Compiler.sunflow:
> 
> Before patch:
>    Avg: failed termination 4.747   Worst case: 209
> 
> After patch:
>    Avg: failed termination 0.513   Worst case: 11
> 
> Although, above result is from one particular run, but it is *very 
> reproducible* and also with other benchmarks.
> 
> 
> Webrev: 
> http://cr.openjdk.java.net/~zgu/shenandoah/optimized_best_of_2/webrev
> .00/index.html
> 
> Hopefully, Aleksey and Thomas have better ways to measure the impact
> to overall termination.
> 

I can do some perf measurements.

My current overall comment of this change is threefold:

- why not immediately contribute in jdk/jdk? There does not seem to be
anything Shenandoah-specific about it.

- I do not see a point in having an optional switch; it seems obviously
superior, and the paper already provided significant feedback on its
usefulness. Aleksey and me will also do measurements. If there were a
switch to enable this I would make this one runtime (i.e. as parameter
to the queueset), and only maybe, if I really wanted a global, use it
as default value for that parameter.

- I would try to make some effort to make this "last-queue-id" thread-
local. I guess there will be lots of memory ping-pong between threads
reading/writing to the same cacheline(s) all the time.

I.e. at least use a PaddedArray, or pass the last-queue-id to
steal_best_of_2() like the seed parameter.

Thanks,
  Thomas