Patch reduces ZGC concurrent marking overhead by ~24%

Fri Jun 8 23:02:42 UTC 2018

Hi Hugh,

Thanks for investigating this. Looking forward to take it for a spin. 
Since we're currently focused on integrating ZGC in JDK 11, I've only 
had time to briefly looked at the patch, but let's get back to this once 
we're past that milestone and things settle down a bit.

Thanks!

/Per

On 06/05/2018 11:11 PM, Wilkinson, Hugh wrote:
> We have posted a ZGC patch to http://cr.openjdk.java.net/~vdeshpande/zgc/webrev.00/ .
> 
> We have posted a bug/enhancement report to https://bugs.openjdk.java.net/browse/JDK-8204350 .
> 
> The starting point for the code was a source "tip" download from 18 May 2018.  This puts it between tags jdk-11-13 and jdk-11-14.
> 
> The patch pipelines concurrent marking and concurrent remapping during marking, using software prefetch instructions to move data from DRAM to cache without waiting for it to arrive.  We used VTune to validate what was being prefetched.  The structures for which prefetching is occurring include ZPageTable, ZPage, ZBitMap, ZForwardingTable, ZMarkCache and the object being marked.
> 
> The pipeline processing is enabled by the +/-ZMarkPipeline option.  Four other options allow setting how far ahead the prefetching occurs (0-8), whether PREFETCHW instructions are used, whether prefetching occurs for remapping and whether a 2 or 4 stage pipeline is used for remapping.
> 
> We evaluated the benefits using SPECjbb2015.  The maximum benefit was seen on a SkyLake platform.  The compute processing time per mark entry was reduced by 24.08%, prefetching 6 entries ahead.  The modified code outputs a gc log message of the following form at the end of every concurrent marking phase to report performance:
> [193.140s][info   ][gc] GC(6) Total popped entries = ,75712991,.  Compute ns per popped entry = ,141,.  Wall ns per popped entry = ,186,.  Drain calls per (14) worker thread = ,15400.00,.
> 
> Our evaluation of the benefits has not been exhaustive.  For some workloads and some platforms, the benefits may be higher.
> 
> The code has been tested on Intel processors.  It is intended to build and to execute on other processor architectures.
> 
> Hugh
> 
> 
> 
>