RFR: Optimizing best of two work stealing queue selection

Fri Jun 22 19:04:26 UTC 2018

Hi Thomas,
> 
> If you mean ParallelClaimableQueueSet: I can't see how it would help in
> finding the queues that still have work during stealing (or in case of
> the mentioned optimization, do more frequent everybody-terminated-yet
> checks based on number of active queues).
> 
> If I had to guess, I would think the ParallelClaimableQueueSet of a way
> of using the same queue set for phases with a changing number of
> threads but make sure that at the end of a phase all queues are empty.
> But I may be wrong :)
> 
Basically, yes.

But I suspect it has effect to balance queues somehow, since I did not 
see high stealing failure rates mentioned in paper.

Normal usage pattern for ParallelClaimableQueueSet, is that, workers 
compete to claim *extra* queues and process them first, in the process, 
they push new tasks to their *owned" queues. In theory, when we enter 
termination phase, we can exclude *extra* queues, since they should all 
be empty - it is on my TODO list for next week.

> Also the ShenandoahTaskTerminator only helps (from my understanding)
> with the contention on the "is-there-any-work-remaining" question and
> only waking up as many threads as there is work (not everyone).

Yes, we only wake up enough workers for remaining tasks.

> 
> The second optimization in the paper, as far as I understand it, is
> limiting on the number of attempts to steal (i.e. after the thread is
> woken up, how often should he try to find work) - i.e. at the moment in
> GenericTaskQueueSet::steal(), steal_best_of_2() is called 2*#threads^2
> attempts to find work within all task queues regardless of how many
> active threads there are.
> 
> The solution in the paper is to call steal_best_of_2() only 2*#live-
> threads^2 within all task queues and then go back to see if everybody
> terminated.

Sounds simply enough, I will give it a try next week.

> 
> My suggestion is to not look through all task queues, but somehow
> during waking up the threads, where we already find out how many active
> tasks there are, also record which threads were found to have work.
> And only look through these task queues when trying to steal.
> 
> Even if that is an approximation too (while we are looking for work,
> threads might complete their queues), it should make these tests more
> effective.
> However given we first try to steal on the queue of the thread we might
> already be extremely efficient in this area.

Eliminating *extra* queues may improve efficiency ... will see if I can 
squeeze some juice here.

>>
>> Yes, it was also in my mind, will fix it.
> 
> In case you opt for the second, instead of adding parameters every time
> you need a new thread local variable, it might be useful to add a
> single "context" struct containing them, and pass only that around.
> 
I notice that ZGC started to use "__thread variables" for thread local 
variables. Could you give us some insights if this is the prefer way to 
go? If there is any compatibility issues?

Thanks,

-Zhengyu

> Might give nicer looking code.
> 
> Thanks,
>    Thomas
>