RFR: Remove prefetch during mark

Thu Mar 1 16:45:56 UTC 2018

Hi Per,

I believe that the Prefetch::read() and Prefetch::write() will be defined for x86 by zgc/src/hotspot/os_cpu/linux_x86/prefetch_linux_x86.inline.hpp. (jdk-10+43)

This file provides incorrect translation to assembly code.  It creates a 3-arg effective address that is interpreted by the processor as (base_address, index, size).

The size argument of read() and write() is translated as an index.  The size in the tupple above is always 1 in the translation, but this is only the size of the indexed object (1, 2, 4, 8).

So, specifying a size of a cacheline to read() and write() will prefetch the cacheline after the one that was intended.

In this instance, the  easiest workaround is to specify a size of 1.

The actual prefetch instructions do not take a size argument.  They only take a byte address reference, and the cacheline containing the byte is prefetched.

A more robust fix is to instantiate the number of prefetch instructions necessary to span the size, but this is only practical for perhaps a maximum of perhaps 4 prefetch instructions.  It is only practical for a compile-time constant size argument.

Current Intel processors will perform a prefetch execlusive ownership for the PREFETCHW instruction.  Prior to BDW, except for potentially some early Pentium 4s, a NOP was executed for a PREFETCHW instance.  The file could enable execution of a PREFETCHW for a Prefetch::write().

Hugh

-----Original Message-----
From: zgc-dev [mailto:zgc-dev-bounces at openjdk.java.net] On Behalf Of Per Liden
Sent: Thursday, March 1, 2018 6:18 AM
To: Steve Blackburn <steve.blackburn at anu.edu.au>; zgc-dev at openjdk.java.net
Subject: Re: RFR: Remove prefetch during mark

On 03/01/2018 12:12 PM, Per Liden wrote:
> Hi,
> 
> On 03/01/2018 01:52 AM, Steve Blackburn wrote:
>> Hi all,
>>
>> I just stumbled upon this thread, and thought I ought to chime in.
>>
>> You may find our prefetch paper from 10 years ago useful.   Or not! :-).
>>                  
>> http://users.cecs.anu.edu.au/~steveb/downloads/pdf/pf-ismm-2007.pdf
> 
> Thanks for the pointer. Link above doesn't seem to work for me, but I 
> found the paper through ACM.
> 
>>
>> The short version is that there were a number of efforts to get 
>> prefetching working well in the past, but none were effective.  We 
>> did a pretty detailed study and managed to get some very nice 
>> results, with two important changes:
>>
>>    *   FIFO front end to mark queue (without the FIFO the prefetch 
>> distance is unpredictable)
>>    *   Enqueue edges rather than nodes Obviously, the situation is 
>> different here (concurrent, big change in uarch, etc), but still 
>> there are some core ideas that you probably ought to know.
>>
>> The impatient may want to jump to section 7.2 and 7.3.    Note the 
>> last para of 7.3: just adding the FIFO, with no software prefetch may 
>> bring a win on some architectures.
> 
> We do enqueue edges in ZGC (to enable "striped marking"), so we're 
> fairly good positioned for prefetching to work, one would think. I 
> recently did some quick tests with a FIFO in front of the mark stack 
> (which would match "EdgeSide" in the paper) with varying prefetch 
> distance, but wasn't able to observe any real improvements. More 
> measurements and analysis would be needed to understand why.

Here's the FIFO prefetch patch I did, in case anyone is interested in doing more work/analysis in this area:

http://cr.openjdk.java.net/~pliden/zgc/mark_prefetch/webrev.0/

cheers,
Per

> 
> cheers,
> Per
> 
>>
>> Cheers,
>>
>> --Steve
>>
>> On 02/14/2018 05:23 PM, Wilkinson, Hugh wrote:
>>> I have been looking at this also.
>>>
>>> I find that if the prefetching occurs 3 popped entries ahead of the 
>>> processing, then there is a worthwhile benefit.
>>>
>>> A bit of re-structuring is required to make this easy and efficient.
>>>
>>> I am prefetching 2 cache lines from the referenced object and also 
>>> doing a PREFETCHW of the mark bitmap.  (Prefetch::write() requires 
>>> modification for x86.)
>>>
>>> With the current code structure, removal of the Prefetch::read() 
>>> probably makes sense; however, I would like to highlight that 
>>> marking performance can be improved with sufficiently early software 
>>> cache prefetches.
>>>
>>> I expect to share more details later.
>>
>> Looking forward to that!
>>
>> cheers,
>> Per
>>