RFR: Optimizing best of two work stealing queue selection

Mon Jun 25 13:09:26 UTC 2018

Hi,

I did a few runs of Compiler.sunflow benchmark over past weekend, with 
different task terminators and best-of-2 queue selection algorithms. One 
factor that is not reflected in the flags, 
ShenandoahParallelClaimableQueueSet, that I believe to have impact on 
termination.

It looks like that the optimization always has positive impact, and the 
results were pretty consistent. Does the result convincing enough for 
purposing upstream?

* Shenandoah Repo with patch [1]:
                                           ShenandoahGC
                                         Avg.      Worst
+ShenandoahOWST/+OptimizedBestOfTwo     0.145      16
+ShenandoahOWST/-OptimizedBestOfTwo     1.552      269
-ShenandoahOWST/+OptimizedBestOfTwo     0.341      52
-ShenandoahOWST/-OptimizedBestofTwo     2.327      384

                               G1
                        Avg.       Worst
+OptimizedBestOfTwo    23.049     335
-OptimizedBestOfTwo    43.632     930

* JDK repo with patch [2]
                               G1
                        Avg.       Worst
+OptimizedBestOfTwo    17.496     469
-OptimizedBestOfTwo    26.670     1062

[3] and [4] are full logs.

Thanks,

-Zhengyu

[1] http://cr.openjdk.java.net/~zgu/tq_terminator/shenandoah.patch
[2] http://cr.openjdk.java.net/~zgu/tq_terminator/jdk.patch
[3] http://cr.openjdk.java.net/~zgu/tq_terminator/shenandoah.log
[4] http://cr.openjdk.java.net/~zgu/tq_terminator/jdk.log

On 06/22/2018 02:08 PM, Thomas Schatzl wrote:
> Hi Zhengyu,
> 
> On Fri, 2018-06-22 at 11:27 -0400, Zhengyu Gu wrote:
>> Hi Thomas,
>>
>> Thanks for the suggestions! Please see comments inline.
>>
>> On 06/22/2018 10:56 AM, Thomas Schatzl wrote:
>>> Hi Zhengyu,
>>>
>>> On Fri, 2018-06-22 at 09:32 -0400, Zhengyu Gu wrote:
>>>> Hi,
>>>>
>>>> This optimization is based on this paper [1]. The idea is to keep
>>>> last successfully stolen queue as one of best of two candidates.
>>>
>>>     :)
>>>
>>>> My experiments showed that it only marginally improved successful
>>>> stealing ratio, given the ratio is *very* high to begin with,
>>>> north of 95%, even under extremely unbalanced scenarios (e.g.
>>>> concurrent threads = 3, parallel threads = 11, of course, based
>>>> on Shenandoah ParallelClaimableQueueSet ).
>>>>
>>>> However, it showed remarkable improvement on task termination:
>>>> I measured how often worker threads have to exit termination
>>>> protocol and retry (failed termination)
>>>
>>> The other change suggested in the paper that only checks
>>> num_live_threads instead of max_num_threads during termination
>>> would also be nice to have, but I think (and you were imho right
>>> about this) you wanted to have this separate :)
>>
>> Be honest, I have yet studied this optimization, cause Shenandoah
>> has itself queue set implementation, I am not sure if it is
>> applicable to us.
> 
> If you mean ParallelClaimableQueueSet: I can't see how it would help in
> finding the queues that still have work during stealing (or in case of
> the mentioned optimization, do more frequent everybody-terminated-yet
> checks based on number of active queues).
> 
> If I had to guess, I would think the ParallelClaimableQueueSet of a way
> of using the same queue set for phases with a changing number of
> threads but make sure that at the end of a phase all queues are empty.
> But I may be wrong :)
> 
> Also the ShenandoahTaskTerminator only helps (from my understanding)
> with the contention on the "is-there-any-work-remaining" question and
> only waking up as many threads as there is work (not everyone).
> 
> The second optimization in the paper, as far as I understand it, is
> limiting on the number of attempts to steal (i.e. after the thread is
> woken up, how often should he try to find work) - i.e. at the moment in
> GenericTaskQueueSet::steal(), steal_best_of_2() is called 2*#threads^2
> attempts to find work within all task queues regardless of how many
> active threads there are.
> 
> The solution in the paper is to call steal_best_of_2() only 2*#live-
> threads^2 within all task queues and then go back to see if everybody
> terminated.
> 
> My suggestion is to not look through all task queues, but somehow
> during waking up the threads, where we already find out how many active
> tasks there are, also record which threads were found to have work.
> And only look through these task queues when trying to steal.
> 
> Even if that is an approximation too (while we are looking for work,
> threads might complete their queues), it should make these tests more
> effective.
> However given we first try to steal on the queue of the thread we might
> already be extremely efficient in this area.
> 
> So, then only actual work stealing from the queues is left to work on
> :)
> 
>>> That other change probably works because then we do the can-we-
>>> terminate check more often the less work there is, see my question
>>> about why not only query the queues with known-elements (at the
>>> time of waking up the threads).
>>>
>>>>
>>>> Compiler.sunflow:
>>>>
>>>> Before patch:
>>>>      Avg: failed termination 4.747   Worst case: 209
>>>>
>>>> After patch:
>>>>      Avg: failed termination 0.513   Worst case: 11
>>>>
>>>> Although, above result is from one particular run, but it is
>>>> *very reproducible* and also with other benchmarks.
>>>>
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~zgu/shenandoah/optimized_best_of_2/we
>>>> brev
>>>> .00/index.html
>>>>
>>>> Hopefully, Aleksey and Thomas have better ways to measure the
>>>> impact to overall termination.
>>>>
>>>
>>> I can do some perf measurements.
>>>
>>> My current overall comment of this change is threefold:
>>>
>>> - why not immediately contribute in jdk/jdk? There does not seem to
>>> be anything Shenandoah-specific about it.
>>
>> All my observation is based Shenandoah implementation, which has a
>> few parts that diverged from upstream (e.g. termination protocol and
>> queue set)
> 
> This change has no impact on the termination protocol. The change
> itself looks like it almost applies to the current mainline code. I
> will give it a try.
> 
>>>
>>> - I do not see a point in having an optional switch; it seems
>>> obviously superior, and the paper already provided significant
>>> feedback on its usefulness. Aleksey and me will also do
>>> measurements. If there were a switch to enable this I would make
>>> this one runtime (i.e. as parameter to the queueset), and only
>>> maybe, if I really wanted a global, use it as default value for
>>> that parameter.
>>
>> I would love to drop the switch, if you guys measurements can
>> confirm what I observed, and not negative impacts.
>>
>> I am also going to run some tests over the weekend.
> 
> Okay.
> 
>>> - I would try to make some effort to make this "last-queue-id"
>>> thread-local. I guess there will be lots of memory ping-pong
>>> between threads reading/writing to the same cacheline(s) all the
>>> time.
>>>
>>> I.e. at least use a PaddedArray, or pass the last-queue-id to
>>> steal_best_of_2() like the seed parameter.
>>>
>>
>> Yes, it was also in my mind, will fix it.
> 
> In case you opt for the second, instead of adding parameters every time
> you need a new thread local variable, it might be useful to add a
> single "context" struct containing them, and pass only that around.
> 
> Might give nicer looking code.
> 
> Thanks,
>    Thomas
>