Potential optimization to the GC termination protocol in JDK8

Fri Jun 8 16:16:43 UTC 2018

On Fri, Jun 8, 2018 at 2:35 AM Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
>
> Hi,
>

Hi Thomas,

Thanks for your feedback.

>   sorry for being somewhat late in that thread... busy lately...
>
> On Thu, 2018-04-26 at 08:53 +0200, Roman Kennke wrote:
> > > > [Problem]
> > > > The work stealing during the GC takes lots of time to terminate.
> > > > The Parallel Scavenge in JDK 8 uses a distributed termination
> > > > protocol to synchronize GC threads. After 2 ∗ N consecutive
> > > > unsuccessful steal attempts (steal_best_of_2 function), a GC
> > > > thread enters the termination procedure, where N is the number of
> > > > GC threads.
> > > >
> > > > Suppose there are N GC threads, it takes 2 * N * N failed
> > > > attempts before a GC stops. It is inefficient and takes too much
> > > > time, especially when there are very few GC threads alive. If
> > > > there are hundreds or thousands of GC happen during the app
> > > > execution, that is a big waste of time.
> > > >
> > > > [Solution]
> > > > Is that possible to reduce the number of steal attempt during the
> > > > end of GC? My idea is to record the number of active GC threads
> > > > (i.e., N_live) that are not yet in the termination protocol. A
> > > > thread only steals from the pool of active GC threads because the
> > > > inactive GC threads have no task to be stolen. Accordingly, the
> > > > criteria for thread termination becomes 2 ∗ N_live consecutive
> > > > failed steal attempts. It can reduce the steal attempts to half.
> > > >
> > > > Good forward for others' feedbacks.
>
> Sure we are interested in such an enhancement and would appreciate a
> contribution. However new work should be conducted in the latest
> version only, i.e. at this time JDK 11 (and probably JDK 12 by the time
> such a change would be pushed given the FC date of end of June).
>

Currently, I worked on the JDK 8. I am not sure whether there exist
lots of code changes
in the latest JDK 11 or 12.

> Maybe by coincidence you might be one of the authors of a recent paper
> [1] demonstrating this? If you are interested in contributing, please
> have a look at the corresponding wiki page [4].
>

Yes, I am one of the authors of the paper.

> > > > Thanks.
> > > > Tony
> >
> > Hi Tony,
> > in Shenandoah we have implemented an improved termination protocol.
> > See the comment here:
> >
> > http://hg.openjdk.java.net/shenandoah/jdk/file/b4a3595f7c56/src/hotsp
> > ot/share/gc/shenandoah/shenandoahTaskqueue.hpp#l93
> >
> > We intend to upstream this stuff very soon, at which point it can be
> > used by all other GCs if they wish. It's a drop-in replacement for
> > the existing task queue code.
> >
> > It sounds like it's similar to what you have in mind. Right?
>
> The ShenandoahTaskTerminator is different in that it optimizes the
> amount of woken up threads and the detection of that there is work to
> steal.
>

Thanks for the introduction of ShenandoahTaskTerminator. I will read
the code later.
Actually, our optimization on GC terminator is similar as you
described. We also optimize stealing
in two ways:
1, the amount of live GC threads to control the steal total number;
2, who or which thread to steal.
Our modification of steal on JDK 8 can be found on:
https://github.com/tonys24/optimized-steal

The basic idea of our work is simple:
To realize target 1, we replace _n with live_number for live GC threads.
https://github.com/tonys24/optimized-steal/blob/master/file6.diff
To realize target 2, we replace steal_best_of_2 with
steal_best_of_2_kun. We picked two
threads, one is stolen successfully last time and the other is picked
randomly. Then we select
the one which has longer queue size.

I will read the code of ShenandoahTaskTerminator to see how can we
contribute to the latest
code of OpenJDK.
Thank you so much for your email.

Best
Tony

> The suggestion presented here improves work stealing itself, i.e. how
> the woken up threads search for threads.
>
> Note that I think that with the ShenandoahTaskTerminator such an
> implementation would be easier to implement than with the existing
> ParallelTaskTerminator.
>
> There are other ideas and information about known not-optimal-
> implementation floating around in literature (e.g. [1][2][3], but
> recently I looked there are a lots) to improve this even further.
>
> Thanks,
>   Thomas
>
> [1] Kun Suo, Jia Rao, Hong Jiang, and Witawas Srisa-an. 2018.
> Characterizing and optimizing hotspot parallel garbage collection on
> multicore systems. In Proceedings of the Thirteenth EuroSys Conference
> (EuroSys '18). ACM, New York, NY, USA, Article 35, 15 pages. DOI: https
> ://doi.org/10.1145/3190508.3190512
> [2] Junjie Qian, Witawas Srisa-an, Du Li, Hong Jiang, Sharad Seth, and
> Yaodong Yang. 2015. SmartStealing: Analysis and Optimization of Work
> Stealing in Parallel Garbage Collection for Java VM. In Proceedings of
> the Principles and Practices of Programming on The Java Platform (PPPJ
> '15). ACM, New York, NY, USA, 170-181. DOI: http://dx.doi.org/10.1145/2
> 807426.2807441
> [3] Lokesh Gidra, Gaël Thomas, Julien Sopena, Marc Shapiro, and Nhan
> Nguyen. 2015. NumaGiC: a Garbage Collector for Big Data on Big NUMA
> Machines. In Proceedings of the Twentieth International Conference on
> Architectural Support for Programming Languages and Operating Systems
> (ASPLOS '15). ACM, New York, NY, USA, 661-673. DOI: https://doi.org/10.
> 1145/2694344.2694361
> [4] http://openjdk.java.net/contribute/

-- 

**********************************
> Tony Suo
> Computer Science, University of Texas at Arlington
**********************************