Brief ZGC Overview (from digging up the source code)

Sat Dec 8 16:47:15 UTC 2018

Hi,

On 12/06/2018 07:45 PM, Mohammad Dashti wrote:
> Hi,
> I've been digging up the source code of ZGC to get some understanding
> of how it works. I've summarized how I think it works, and would
> appreciate any feedback or corrections. I'm sure i missed a lot of
> details (e.g. I don't talk about relocation at all), but I tried to
> make my description as clear as possible. Here it is:
> 
> There are four threads that are involved in the main Garbage
> Collection (GC) tasks: ZDirector, ZDriver, VMThread, and ZWorker(s).

There's also the ZStat thread, which collects and prints various 
statistics. But it doesn't play any central role in the over all algorithm.

> 
> The ZDirector thread continuously runs and checks for “causes” to run
> the GC. Here, I’m only concerned about the “allocation rate” cause.
> ZDirector calculates the object allocation rate and once the rate is
> above a specified value it will trigger a GC cycle by signalling the

It's not that the allocation rate itself needs to reach a specified 
limit for the to GC kick in. What is considered is the allocation rate 
in relation to amount of free memory and the time to complete a GC 
cycle. This is a simplification, but the GC kicks in when 
(free_memory/allocation_rate) - gc_cycle_duration <= 0.

> ZDriver thread which is waiting to be signalled. The ZDirector is one
> path that can start the GC cycle. The other path is through mutator
> threads when they allocate new objects through the “new” bytecode. A
> GC cycle might be started in the case where there isn’t any available
> heap space. The ZDriver is the thread responsible for going through
> the whole GC cycle step by step. For some stages of GC, such as Mark
> Start and Mark End, mutator threads have to come to a safepoint.
> ZDriver accomplishes this by signalling the VMthread. When the
> VMThread runs, a call is made to ensure that all mutators come to a
> stop. The VMthread then performs the “stop-the-world” necessary
> operations for these stages of GC. Once this is done, ZDriver will
> enter the main concurrent marking cycle. The concurrent GC work is
> delegated to worker threads (ZWorker). The number of worker threads is
> specified at the the launch of the Java command.

There's also heuristics to pick that number, if the user didn't specify it.

> 
> These worker threads run concurrently with mutators. The heart of the
> work is in ZMark:drain() where the worker threads will pop object
> entries from their local stacks and mark_and_follow() these entries.
> Work will continue until there are no more entries to pop. Workers pop
> entries according to the “stripe” that these entries belong to. ZGC
> uses a total number of stripes that is equal to the specified number
> of workers up to a maximum number of 16 stripes. If there are more

Number of stripes used is always the power of 2 that is equal to or 
closest below the number of concurrent worker threads.

> than 16 workers, some workers will work on shared stripes. For
> instance, in the case that we have 16 worker threads there will be 16
> stripes and each thread will pop entries in its corresponding stripe.
> So, thread 0 will pop entries from its local stack for stripe 0,
> thread 1 from stripe 1 and so on. Stripe numbers merely represent
> different segments in the heap. In ZGC, objects belong to ZPages that
> can either be small, medium, or large. For example, small ZPages are
> 2MB in size, so to know which stripe an object belongs to, the object
> address is right-shifted 21 bits to get the ZPage starting address (21
> bits is 2MB). Then, the shifted address is anded with the total number
> of stripes to get the stripe number that an entry corresponds to. The

To be precise, that should be "...anded with the total number of stripes 
minus one..."

> main purpose of having stripes is to reduce the probability that two
> workers have to mark the same objects which can be costly since the
> actual final step in marking is done atomically as will be explained
> next.

The atomic operation itself is typically not the main problem. In fact, 
with strict stripe assignment (only allow one worker per stripe) we 
wouldn't need to mark object atomically at all. The main reason for the 
striping is to limit the amount of memory that is shared between workers 
to get better cache behavior.

> 
> Popping is done on stacks that correspond to stripes. So, there is a
> maximum of 16 stacks corresponding to the maximum of 16 stripes. Each
> stack is a placeholder for a maximum of 254 entries (it has 254
> “slots”). Once the slots are full (entries are pushed to stacks during
> following objects in the marking phase, so slots may become full), the
> stack is added to a list of “published” stacks. These published stacks
> might be used later if a worker thread has no entries to follow in its
> local stack and needs to steal entries from other stacks.

That's correct, but the term "steal" in ZGC typically refers to the 
situation when there's no more work to do on a stripe, so the worker 
assigned to that stripe steals work from different stripe. Grabbing 
published stacks from the stripe a worker is assigned to is just normal 
business.

> 
> Once an object entry is popped, mark_and_follow() is called. This
> function contains the bulk of the concurrent marking phase. It is
> essentially divided into two parts: try_mark_object(), and
> follow_object(). In try_mark_object(), the object will be marked. This
> is done by obtaining the ZPage that the object belongs to, then the
> “index” of this object in the Zpage is calculated. Finally, the object
> is marked by setting a bit in a “bitmap” inside a “livemap” which
> contains liveness information about all objects in the object graph.
> The livemap is essentially the object graph with liveness information
> . Of course, the setting of the bit is done atomically since it is
> possible that the bit will be set from another worker thread that
> might have followed its object path and reached the same object.
> 
> The second half of marking is done in follow_object(), where the
> recently marked object will be followed to add all of its reference
> objects into the corresponding marking stacks. Each Java instance
> class object contains a map (or a number of maps) of reference objects
> (these are called “oops”; an oop (Ordinary Object Pointer) is a
> representation of the object itself in addition to metadata and
> functions to manipulate object address bits). These maps contain a
> number of oops (depending on how many reference objects the instance
> class contains) that are iterated and pushed onto their
> stripe-corresponding marking stack where they will be marked later in
> the try_mark_object(). The whole process of mark_and_follow() is
> repeated until all of the object graph is traversed. Once this is
> complete, all of the worker stacks will be flushed and prepared for
> the next GC cycle.
> 
> Thanks,
> Mohammad
> 

cheers,
Per