RFR: Optimizing best of two work stealing queue selection

Fri Jun 22 15:27:18 UTC 2018

Hi Thomas,

Thanks for the suggestions! Please see comments inline.

On 06/22/2018 10:56 AM, Thomas Schatzl wrote:
> Hi Zhengyu,
> 
> On Fri, 2018-06-22 at 09:32 -0400, Zhengyu Gu wrote:
>> Hi,
>>
>> This optimization is based on this paper [1]. The idea is to keep
>> last successfully stolen queue as one of best of two candidates.
> 
>    :)
> 
>> My experiments showed that it only marginally improved successful
>> stealing ratio, given the ratio is *very* high to begin with, north
>> of 95%, even under extremely unbalanced scenarios (e.g. concurrent
>> threads = 3, parallel threads = 11, of course, based on Shenandoah
>> ParallelClaimableQueueSet ).
>>
>> However, it showed remarkable improvement on task termination:
>> I measured how often worker threads have to exit termination
>> protocol and retry (failed termination)
> 
> The other change suggested in the paper that only checks
> num_live_threads instead of max_num_threads during termination would
> also be nice to have, but I think (and you were imho right about this)
> you wanted to have this separate :)

Be honest, I have yet studied this optimization, cause Shenandoah has 
itself queue set implementation, I am not sure if it is applicable to us.

> 
> That other change probably works because then we do the can-we-
> terminate check more often the less work there is, see my question
> about why not only query the queues with known-elements (at the time of
> waking up the threads).
> 
>>
>> Compiler.sunflow:
>>
>> Before patch:
>>     Avg: failed termination 4.747   Worst case: 209
>>
>> After patch:
>>     Avg: failed termination 0.513   Worst case: 11
>>
>> Although, above result is from one particular run, but it is *very
>> reproducible* and also with other benchmarks.
>>
>>
>> Webrev:
>> http://cr.openjdk.java.net/~zgu/shenandoah/optimized_best_of_2/webrev
>> .00/index.html
>>
>> Hopefully, Aleksey and Thomas have better ways to measure the impact
>> to overall termination.
>>
> 
> I can do some perf measurements.
> 
> My current overall comment of this change is threefold:
> 
> - why not immediately contribute in jdk/jdk? There does not seem to be
> anything Shenandoah-specific about it.

All my observation is based Shenandoah implementation, which has a few 
parts that diverged from upstream (e.g. termination protocol and queue set)

> 
> - I do not see a point in having an optional switch; it seems obviously
> superior, and the paper already provided significant feedback on its
> usefulness. Aleksey and me will also do measurements. If there were a
> switch to enable this I would make this one runtime (i.e. as parameter
> to the queueset), and only maybe, if I really wanted a global, use it
> as default value for that parameter.

I would love to drop the switch, if you guys measurements can confirm 
what I observed, and not negative impacts.

I am also going to run some tests over the weekend.

> 
> - I would try to make some effort to make this "last-queue-id" thread-
> local. I guess there will be lots of memory ping-pong between threads
> reading/writing to the same cacheline(s) all the time.
> 
> I.e. at least use a PaddedArray, or pass the last-queue-id to
> steal_best_of_2() like the seed parameter.
> 

Yes, it was also in my mind, will fix it.

Thanks,

-Zhengyu

> Thanks,
>    Thomas
>