Potential optimization to the GC termination protocol in JDK8

Mon Jun 11 11:04:31 UTC 2018

Hi Tony,

On Fri, 2018-06-08 at 11:16 -0500, T S wrote:
> On Fri, Jun 8, 2018 at 2:35 AM Thomas Schatzl <thomas.schatzl at oracle.
> com> wrote:
> > 
> > Hi,
> > 
> 
> Hi Thomas,
> 
> Thanks for your feedback.
> 
> >   sorry for being somewhat late in that thread... busy lately...
> > 
> > On Thu, 2018-04-26 at 08:53 +0200, Roman Kennke wrote:
> > > > > [Problem]
> > > > > The work stealing during the GC takes lots of time to
> > > > > terminate.
> > > > >  [...]
> 
> > > Hi Tony,
> > > in Shenandoah we have implemented an improved termination
> > > protocol.
> > > See the comment here:
> > > 
> > > http://hg.openjdk.java.net/shenandoah/jdk/file/b4a3595f7c56/src/h
> > > otsp
> > > ot/share/gc/shenandoah/shenandoahTaskqueue.hpp#l93
> > > 
> > > We intend to upstream this stuff very soon, at which point it can
> > > be used by all other GCs if they wish. It's a drop-in replacement
> > > for the existing task queue code.
> > > 
> > > It sounds like it's similar to what you have in mind. Right?
> > 
> > The ShenandoahTaskTerminator is different in that it optimizes the
> > amount of woken up threads and the detection of that there is work
> > to steal.
> > 
> 
> Thanks for the introduction of ShenandoahTaskTerminator. I will read
> the code later.
> Actually, our optimization on GC terminator is similar as you
> described. We also optimize stealing

The ShenandoahTaskTerminator does not change the actual stealing itself
after non-busy threads are woken up.

It improves the process of waking up threads by (as far as I understand
;) of course):
a) only waking up as many threads as there are work items
b) only one (changing) master thread looks through the queues for work

This is independent and so complementary to the work provided by you.

> in two ways:
> 1, the amount of live GC threads to control the steal total number;
> 2, who or which thread to steal.
> Our modification of steal on JDK 8 can be found on:
> https://github.com/tonys24/optimized-steal
> 
> The basic idea of our work is simple:
> To realize target 1, we replace _n with live_number for live GC
> threads.

I have one question about that, that I think the paper also did not
answer specifically (maybe some interpretation error from me).

So steal_best_of_2, instead of doing 2*N*N (N = number of threads)
steal attempts, it does 2*N_live*N_live steal attempts.

Is there any provision to do these N_live steal attempts only on
threads that actually have work?

I can see that just by reducing the number of steal attempts we improve
performance (I have done these experiments maybe a year or more ago,
like only doing 2*N attempts or other random iteration counts), but
just reducing the steal attempts seems to make it often fall back to
trying to terminate.

Consider the situation when N_live is much smaller than N - then the
probability that a given thread trying to find work is small, and so
there may be a lot of unusuccessful steal attempts.

It seems to be much better to not only do N_live attempts, but do these
attempts on only the queues of threads that are live.

The second optimization shown by the paper, trying to keep stealing
from the same thread as before probably mitigates this issue of
unsuccessful stealing attempts quite a bit, but still...

Property b) of the shenandoah task terminator might help here.

> https://github.com/tonys24/optimized-steal/blob/master/file6.diff
> To realize target 2, we replace steal_best_of_2 with
> steal_best_of_2_kun. We picked two
> threads, one is stolen successfully last time and the other is picked
> randomly. Then we select the one which has longer queue size.

Yeah, that makes sense.

Thanks,
  Thomas