[Bug] Load imbalance of Tasks among GCTaskThreads during the Young GC

Thomas Schatzl thomas.schatzl at oracle.com
Wed Nov 23 09:52:20 UTC 2016


Hi,

  some thoughts...

On Mon, 2016-11-21 at 11:15 -0600, Tony S wrote:
> Hi Thomas,
> 
> Thank you for your reply.
> 
> On Mon, Nov 21, 2016 at 3:22 AM, Thomas Schatzl
> <thomas.schatzl at oracle.com> wrote:
> > 
> > Hi,
> > 
> > On Fri, 2016-11-18 at 22:54 -0600, Tony S wrote:
> > > 
> > > Hi,
> > > 
> > > [Description]
> > > src/share/vm/gc_implementation/parallelScavenge/psScavenge.cpp
> > > 
> > > PSPromotionManager enqueues various tasks (ThreadRootsTask /
> > > StealTask / ScavengeRootsTask / OldToYoungRootsTask) into
> > > GCTaskQueue.
> > > GCTaskThread asked GCTaskManager to call get_task() to obtain
> > > tasks from the GCTaskQueue. However, I found the tasks are not
> > > distributed evenly among the GCTaskThreads. Some GCTaskThreads
> > > get more tasks than others.
> > This imbalance could have several reasons:
> > 
> > - startup of some tasks is delayed, e.g. due to scheduling
> > threads(s) X gets activated a significant amount of time later than
> > other threads.
> > 
> I am not sure. I tried to bind all the GCTaskThread to different
> cores to avoid these threads interfered with each other (like below).
> 
[...]
> 
> However, the issue still exists there.

I do not think without measuring the delay between the notification of
the worker threads (in GCTaskManager::add_list()) and when a worker
thread actually starts the work (ends up in the do_it() method of the
first task) one can rule that issue out.

This delay between the notification and the thread actually starting
work might be a (relatively speaking) considerable amount of time for
whatever reason.
Even if you bind threads.

See e.g. https://bugs.openjdk.java.net/browse/JDK-8087324. I do not
think something similar is implemented (or has been tried out) for
parallel gc.

> > - the work unit per task is so small, completing some of these
> > tasks is
> > so quick that a thread immediately competes for more work with the
> > other tasks. Due to scheduling that task might somehow get
> > favorable
> > treatment (e.g. OldToYoungRootsTask) by the scheduler (like: it is
> > already running and has not used up its time slice).
> > 
> The time execution time for some task is small. That is right. For
> example, the below tasks only need hundreds to thousands of
> nanoseconds.
> 
>  OldToYoungRootsTask time:3961  GCTaskThread ID:13
>  ScavengeRootsTask time:1492  GCTaskThread ID:14
>  ThreadRootsTask time:995  GCTaskThread ID:14
> 
> I do not think it is due to the scheduler. I changed the linux
> scheduler to FIFO instead of CFS. Here are the results. Under FIFO,
> the OldToYoungRootsTasks still are not distributed evenly. See,
> GCTaskThread 0,3,6 execute them.
> 
>   OldToYoungRootsTask time:26748 GCTaskThread ID:6
>   OldToYoungRootsTask time:9327 GCTaskThread ID:3
>   OldToYoungRootsTask time:20358 GCTaskThread ID:0
>   OldToYoungRootsTask time:3413 GCTaskThread ID:6
>   OldToYoungRootsTask time:609 GCTaskThread ID:0
>   OldToYoungRootsTask time:3354 GCTaskThread ID:3
>   OldToYoungRootsTask time:947 GCTaskThread ID:6
>   OldToYoungRootsTask time:2007 GCTaskThread ID:0
>   OldToYoungRootsTask time:861 GCTaskThread ID:6
>   OldToYoungRootsTask time:3364 GCTaskThread ID:3
>   OldToYoungRootsTask time:1493 GCTaskThread ID:0
>   OldToYoungRootsTask time:740 GCTaskThread ID:6
>   OldToYoungRootsTask time:349 GCTaskThread ID:3
>   OldToYoungRootsTask time:73 GCTaskThread ID:0

Are the other threads actually trying to steal work from the queue at
that time? Or are they still in the process of being woken up?

> > > [...]
> > > only StealTasks are evenly executed on different GCTaskThread.
> > > The other tasks are executed on few GCTaskThreads, not evenly
> > > distributed.
> > > 
> > > When I increase the number of mutator threads and thus have more
> > > tasks (ThreadRootsTask / StealTask / ScavengeRootsTask /
> > > OldToYoungRootsTask), however, those tasks are still not
> > > distributed evenly among the GCTaskThreads, which causes the load
> > > imbalance and thus prolongs the Young GC time.

Just imo: Given the durations of your trace below (ns probably), this
imbalance at the start does not seem to be that critical compared to
how long the actual work (StealTask) takes.

> > The trace above does not include start/end times of the threads and
> > tasks so it is hard to diagnose any actual problem, its source, or
> > if there is actually one. I.e. StealTasks of different threads
> > might take a different amount of time, so they may actually finish
> > at approximately the same time anyway? Did you measure how long the
> > threads are actually waiting for work (or completion of other
> > threads)?
> > 
> I add the start and end timestamp in the JDK to measure the execution
> time of different task, here are the results(using ubuntu 16.10,
> scheduler: default CFS, machine: dell T430, benchmark dapaco
> lusearch):
> 
>   OldToYoungRootsTask time:19213  GCTaskThread ID:10
>   OldToYoungRootsTask time:5680  GCTaskThread ID:13
>   OldToYoungRootsTask time:13489  GCTaskThread ID:14
>   OldToYoungRootsTask time:944  GCTaskThread ID:14
>   OldToYoungRootsTask time:821  GCTaskThread ID:14
>   OldToYoungRootsTask time:6227  GCTaskThread ID:13
>   OldToYoungRootsTask time:768  GCTaskThread ID:14
>   OldToYoungRootsTask time:2729  GCTaskThread ID:13
>   OldToYoungRootsTask time:745  GCTaskThread ID:13
>   OldToYoungRootsTask time:4095  GCTaskThread ID:14
>   OldToYoungRootsTask time:774  GCTaskThread ID:13
>   OldToYoungRootsTask time:828  GCTaskThread ID:14
>   OldToYoungRootsTask time:384  GCTaskThread ID:14
>   ScavengeRootsTask time:1248  GCTaskThread ID:14
>   OldToYoungRootsTask time:3961  GCTaskThread ID:13
>   ScavengeRootsTask time:1492  GCTaskThread ID:14
>   ThreadRootsTask time:995  GCTaskThread ID:14
>   ThreadRootsTask time:590  GCTaskThread ID:14
>   ThreadRootsTask time:437  GCTaskThread ID:14
>   ThreadRootsTask time:1341  GCTaskThread ID:2
>   ThreadRootsTask time:377  GCTaskThread ID:14
>   ThreadRootsTask time:310  GCTaskThread ID:2
>   ThreadRootsTask time:431  GCTaskThread ID:14
>   ThreadRootsTask time:376  GCTaskThread ID:2
>   ThreadRootsTask time:277  GCTaskThread ID:2
>   ThreadRootsTask time:400  GCTaskThread ID:2
>   ThreadRootsTask time:3354  GCTaskThread ID:14
>   ThreadRootsTask time:951  GCTaskThread ID:2
>   ThreadRootsTask time:584  GCTaskThread ID:14
>   ThreadRootsTask time:1592  GCTaskThread ID:8
>   OldToYoungRootsTask time:50418  GCTaskThread ID:10
>   ThreadRootsTask time:1255  GCTaskThread ID:8
>   ScavengeRootsTask time:1836  GCTaskThread ID:8
>   ScavengeRootsTask time:1735  GCTaskThread ID:8
>   ScavengeRootsTask time:1128  GCTaskThread ID:8
>   ThreadRootsTask time:14132  GCTaskThread ID:2
>   ThreadRootsTask time:16539  GCTaskThread ID:14
>   ScavengeRootsTask time:510  GCTaskThread ID:14
>   ScavengeRootsTask time:6407  GCTaskThread ID:8
>   StealTask time:3160638  GCTaskThread ID:1
>   StealTask time:3929547  GCTaskThread ID:13
>   StealTask time:3359946  GCTaskThread ID:4
>   StealTask time:3469719  GCTaskThread ID:2
>   StealTask time:3216089  GCTaskThread ID:11
>   StealTask time:3863544  GCTaskThread ID:14
>   StealTask time:3465927  GCTaskThread ID:9
>   StealTask time:3507085  GCTaskThread ID:3
>   StealTask time:3624188  GCTaskThread ID:12
>   StealTask time:3704221  GCTaskThread ID:7
>   StealTask time:3565777  GCTaskThread ID:6
>   StealTask time:3204696  GCTaskThread ID:0
>   StealTask time:4006233  GCTaskThread ID:8
>   StealTask time:3997742  GCTaskThread ID:10
>   StealTask time:3530225  GCTaskThread ID:5
> 
> 
> Also, I found the StealTask dominates the total time, contributes the
> most part of the young GC.

StealTask does the actual evacuation work, i.e. pulling reachable
objects from the roots or already evacuated objects. The other tasks
only determine the roots (and maybe copy the immediately reachable
objects, I do not remember exactly, but I think so).

That observation is as expected.

> However, not all the StealTask really
> needs so much time. If one of the StealTask takes much time, many
> other StealTask will be prolonged. This is another issue.

After evacuation of all objects from the (strong) roots, the threads
need to synchronize to continue with the next phases (e.g.
j.l.ref.Reference processing). These more or less expect that before
processing a weaker type of reference, the liveness of all stronger
ones has been determined completely (and it makes the implementation
much easier to simply synchronize all threads together). Or just
synchronize to end the gc pause at the very end.

That one thread might make other tasks wait in this case seems expected
behavior given some work imbalance.

The imbalance of the StealTasks themselves may have various reasons,
e.g. imperfect work stealing, or just that the evacuated object graph
does not pose that much opportunity for parallelization. All Hotspot
collectors synchronize on an object basis, i.e. only one thread may
ever copy and process a single object. If the object graph contains
e.g. only a few longer singly linked lists, the implementation will
have a hard time looking for parallelism (for these subgraphs) to
exploit.
There is also the environment playing into this of course, potentially
taking cpu time.

  Thomas



More information about the hotspot-gc-dev mailing list