G1GC: The design choice of prefetching

Tue Oct 15 08:17:20 UTC 2019

Hi Mingyuh,

 >> ------------------------------------------------------------------
 >> From:Mingyu Wu <timberonce at gmail.com>
 >> Send Time:2019 Oct. 15 (Tue.) 09:34
 >> To:hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>
 >> Subject:G1GC: The design choice of prefetching
 >>
 >> Hi all,
 >> I find that G1GC (in OpenJDK12) implements a method named
 >> *prefetch_and_push*, which prefetches the header and the first field
 >> of an object referenced by a pointer *p *while *p* is about to be
 >> enqueued.
 >> However, the effect of this prefetch instruction can be unstable as
 >> the time when the object is processed is unknown. It is possible that
 >> many references are enqueued before *p *(the data structure is
 >> actually First-In-Last-Out) and finally evict the cache line storing
 >> the object, making the prefetch useless. Therefore, what is the
 >> design choice of those prefetch instructions? Do they stand for some
 >> tradeoffs related to the overhead of prefetching?
 >>
 >> Thanks,
 >> Mingyu
 >
On 15.10.19 08:18, Liang Mao wrote:
> Hi Mingyu,
> 
> The prefetch design is not only available in new versions of G1 GC but introduced
> in very early years in hotspot and other GCs like ParNew. It is kind of aggressive
> prefecting imho which prefetches all the addresses in the ref queue which contains
> *grey pointers* and also creates enough latency between issuing prefetch instructions
> and memory access to maximize the cache utilization.
> There could be the problem you mentioned that cache is evicted if overflowed.
> Maintaining the proper length of the ref queue is the way to avoid this. You can
>   look into the issue below which fixed this problem and improved performance in G1.
> https://bugs.openjdk.java.net/browse/JDK-6672778
> OpenJDK developers may correct me if there's something I misunderstood.
> 
> Thanks,
> Liang
> 

As Liang correctly pointed out, the current oop prefetch design in G1 is 
mostly based on existing precedence in other GCs and lots of testing. 
There are some differences noted below.

As you also pointed out correctly, there is a tradeoff to be made wrt to 
the complexity of this code vs. the actual gains. This code path is in 
my experience *extremely* sensitive to changes, so adding some simple 
heuristic here might nullify all the gains from more timely prefetching.

In my tests, when implementing JDK-6672778 I performed many tests with 
variants of this scheme. The currently implemented one (with the 
upper/lower "trim" bound) proved to be fastest overall.

Compared to other collectors, G1 also always prefetches and pushes as 
indicated in the

   // We're not going to even bother checking whether the object is
   // already forwarded or not, as this usually causes an immediate
   // stall. We'll try to prefetch the object (for write, given that
   // we might need to install the forwarding reference) and we'll
   // get back to it when pop it from the queue

comment in G1ScanClosureBase::prefetch_and_push, contrary to the other 
collectors which first check whether the reference has already been 
forwarded. The current code proved better for G1 at the time.

Other attempted changes like prepending a small entry FIFO in the 
push/pop path just made the whole evacution slower (to induce some 
"fixed" latency between prefetching and work on these reference). But 
maybe I did something wrong here.

These measurements might be invalid at this time, particularly because 
of changes how the java heap roots are traversed (JDK-8213108), so 
revisiting this may be interesting and fruitful.

It would be really interesting to me to hear back from you or anybody 
else in the future about experiments you did whatever the results are; 
even "failed" attempts can be learned from. :)

Thanks,
   Thomas