Remove prefetch during mark

Mon Mar 5 15:10:12 UTC 2018

Hi Per,

We have implemented a pipeline that prefetches according to the following pipeline stages:
1. ZPageTable, 1 cache line
2. ZPage, 2 cache lines
3. ZBitMap, 1 cache line
4. target object 1 or 2 cache lines, ZMarkCache 1 cache line

If nothing is in cache, 6 or 7 cache lines are read from memory, 384 or 448 bytes.

Performance moves around between average 145ns and 170ns or more per ZMarkEntry for different code variations.  (Unmodified code is at the high end of this range.)  I am tending toward the belief that memory bandwidth is the bottleneck.  I have not collected performance monitor data to verify this.  It is notable that I have added so many new instructions without performance penalty and some performance benefit.  However, eliminating branch miss-predict prefetches and cmpxchgs have a positive effect on performance.  That is usually the case only when bandwidth is the bottleneck.

These measurements are based upon counting the entries processed and using the os:current_thread_cpu_time() for time measurement.  I am running SPECjbb2015 and averaging the mark cycles that exceed 2 minutes in wall clock time.  I am using numactl to restrict execution to a single socket.  I am using 44 hyperthreads.  The concurrent GC thread execution times fall below 70% of the wall clock time execution at times, particularly when the page cache is flushing.

In more idle circumstances, the average service time of a ZMarkEntry is as low as 85 ns.

In order to optimize the marking bandwidth, it might be worth thinking about optimizing the ZPageTable for medium size pages and moving sufficient information into the table that, typically, the ZPage object is not referenced.

More later,

Hugh

-----Original Message-----
From: Per Liden [mailto:per.liden at oracle.com] 
Sent: Friday, March 2, 2018 3:51 AM
To: Wilkinson, Hugh <hugh.wilkinson at intel.com>; Steve Blackburn <steve.blackburn at anu.edu.au>; zgc-dev at openjdk.java.net
Subject: Re: Remove prefetch during mark

Hi Hugh,

On 03/01/2018 10:38 PM, Wilkinson, Hugh wrote:
> The construct that we are pursuing is outlined by the pseudocode in my next posting.  It is based upon earlier work done for network packet processing.
> 
> Intermediate FIFOs are used to prefetch ahead by 1 to N entries.  The size of the FIFOs would correspond to memory latencies and can be configured at runtime.
> The cost of this construct is the store and restore context operations.
> 
> The pseudocode implements a 3 stage pipeline:
> 1.	 Get an entry, prefetch the ZLiveMap and ZBitMap.

Wouldn't it not be a good idea to split the above stage in two? I'm thinking you can't prefetch the ZBitMap without first completing a load in the ZLiveMap (ZPage). So basically:

1. Get entry, prefetch ZLiveMap (ZPage)
2. Prefetch ZBitMap
3 ...

Also, have you considered having a stage for prefetching the page table? 
Maybe that's already hot enough and wouldn't help, but something like:

1. Get entry, prefetch ZPage address in ZPageTable 2. Prefetch ZLiveMap (ZPage) 3. Prefetch ZBitMap
4 ...

> 2.	Update the ZLiveMap and ZBitMap and if not already marked, prefetch the object.
> 3.	Update the live bytes and follow the object.
> 
> We find that prefetching the object for already marked objects degrades performance.
> 
> The looping constructs allow filling the FIFOs incrementally, allowing for up to 2 stores for every restore while the FIFOs are filling.
> 
> When the FIFOs are full, each stage of the pipeline executes once each in sequence.
> 
> The use of “do {…;if () continue;…; break; } while(UNLIKELY(true));” eliminates loop overhead and optimizes the code for single iteration execution.

Maybe it's way to early to tell, but I'm of course curious if you've got some early indication of what level of improvement (in terms of overall mark time reduction) this scheme can offer on some known benchmark?

cheers,
Per

> 
> Hugh
> 
> -----Original Message-----
> From: zgc-dev [mailto:zgc-dev-bounces at openjdk.java.net] On Behalf Of 
> Steve Blackburn
> Sent: Thursday, March 1, 2018 6:24 AM
> To: Per Liden <per.liden at oracle.com>; zgc-dev at openjdk.java.net
> Subject: Re: Remove prefetch during mark
> 
> A surprising amount of blood, sweat and tears went into that work before we saw the nice result --- we could not get any performance out of the previously published prefetching work.
> 
> On the face of it, you're really well positioned to get some value given that you're already enqueuing edges.
> 
> I'll be interested to hear how things go.
> 
> Cheers,
> 
> --Steve
> 
> On 1/3/18, 10:18 pm, "Per Liden" <per.liden at oracle.com> wrote:
> 
>      On 03/01/2018 12:12 PM, Per Liden wrote:
>      > Hi,
>      >
>      > On 03/01/2018 01:52 AM, Steve Blackburn wrote:
>      >> Hi all,
>      >>
>      >> I just stumbled upon this thread, and thought I ought to chime in.
>      >>
>      >> You may find our prefetch paper from 10 years ago useful.   Or not! :-).
>      >>
>      >> http://users.cecs.anu.edu.au/~steveb/downloads/pdf/pf-ismm-2007.pdf
>      >
>      > Thanks for the pointer. Link above doesn't seem to work for me, but I
>      > found the paper through ACM.
>      >
>      >>
>      >> The short version is that there were a number of efforts to get
>      >> prefetching working well in the past, but none were effective.  We did
>      >> a pretty detailed study and managed to get some very nice results,
>      >> with two important changes:
>      >>
>      >>    *   FIFO front end to mark queue (without the FIFO the prefetch
>      >> distance is unpredictable)
>      >>    *   Enqueue edges rather than nodes
>      >> Obviously, the situation is different here (concurrent, big change in
>      >> uarch, etc), but still there are some core ideas that you probably
>      >> ought to know.
>      >>
>      >> The impatient may want to jump to section 7.2 and 7.3.    Note the
>      >> last para of 7.3: just adding the FIFO, with no software prefetch may
>      >> bring a win on some architectures.
>      >
>      > We do enqueue edges in ZGC (to enable "striped marking"), so we're
>      > fairly good positioned for prefetching to work, one would think. I
>      > recently did some quick tests with a FIFO in front of the mark stack
>      > (which would match "EdgeSide" in the paper) with varying prefetch
>      > distance, but wasn't able to observe any real improvements. More
>      > measurements and analysis would be needed to understand why.
>      
>      Here's the FIFO prefetch patch I did, in case anyone is interested in
>      doing more work/analysis in this area:
>      
>      http://cr.openjdk.java.net/~pliden/zgc/mark_prefetch/webrev.0/
>      
>      cheers,
>      Per
>      
>      >
>      > cheers,
>      > Per
>      >
>      >>
>      >> Cheers,
>      >>
>      >> --Steve
>      >>
>      >> On 02/14/2018 05:23 PM, Wilkinson, Hugh wrote:
>      >>> I have been looking at this also.
>      >>>
>      >>> I find that if the prefetching occurs 3 popped entries ahead of the
>      >>> processing, then there is a worthwhile benefit.
>      >>>
>      >>> A bit of re-structuring is required to make this easy and efficient.
>      >>>
>      >>> I am prefetching 2 cache lines from the referenced object and also
>      >>> doing a PREFETCHW of the mark bitmap.  (Prefetch::write() requires
>      >>> modification for x86.)
>      >>>
>      >>> With the current code structure, removal of the Prefetch::read()
>      >>> probably makes sense; however, I would like to highlight that marking
>      >>> performance can be improved with sufficiently early software cache
>      >>> prefetches.
>      >>>
>      >>> I expect to share more details later.
>      >>
>      >> Looking forward to that!
>      >>
>      >> cheers,
>      >> Per
>      >>
>      
>