RFR: Remove prefetch during mark

Fri Mar 2 11:27:38 UTC 2018

Hi Per,

I was not sure whether an interval of 0 would be problematic for non-x86 architectures.

Hugh

-----Original Message-----
From: Per Liden [mailto:per.liden at oracle.com] 
Sent: Friday, March 2, 2018 2:17 AM
To: Wilkinson, Hugh <hugh.wilkinson at intel.com>; Steve Blackburn <steve.blackburn at anu.edu.au>; zgc-dev at openjdk.java.net
Subject: Re: RFR: Remove prefetch during mark

Hi Hugh,

On 03/01/2018 05:45 PM, Wilkinson, Hugh wrote:
> Hi Per,
> 
> I believe that the Prefetch::read() and Prefetch::write() will be 
> defined for x86 by 
> zgc/src/hotspot/os_cpu/linux_x86/prefetch_linux_x86.inline.hpp. 
> (jdk-10+43)
> 
> This file provides incorrect translation to assembly code.  It creates a 3-arg effective address that is interpreted by the processor as (base_address, index, size).
> 
> The size argument of read() and write() is translated as an index.  The size in the tupple above is always 1 in the translation, but this is only the size of the indexed object (1, 2, 4, 8).
> 
> So, specifying a size of a cacheline to read() and write() will prefetch the cacheline after the one that was intended.

Wow. Thanks for spotting that. It seems that the word "interval" in the Prefetch interface should really be "offset".

> 
> In this instance, the  easiest workaround is to specify a size of 1.

Isn't 0 even better?

/Per

> 
> The actual prefetch instructions do not take a size argument.  They only take a byte address reference, and the cacheline containing the byte is prefetched.
> 
> A more robust fix is to instantiate the number of prefetch instructions necessary to span the size, but this is only practical for perhaps a maximum of perhaps 4 prefetch instructions.  It is only practical for a compile-time constant size argument.
> 
> Current Intel processors will perform a prefetch execlusive ownership 
> for the PREFETCHW instruction.  Prior to BDW, except for potentially 
> some early Pentium 4s, a NOP was executed for a PREFETCHW instance.  
> The file could enable execution of a PREFETCHW for a Prefetch::write(
> 
> Hugh
> 
> 
> -----Original Message-----
> From: zgc-dev [mailto:zgc-dev-bounces at openjdk.java.net] On Behalf Of 
> Per Liden
> Sent: Thursday, March 1, 2018 6:18 AM
> To: Steve Blackburn <steve.blackburn at anu.edu.au>; 
> zgc-dev at openjdk.java.net
> Subject: Re: RFR: Remove prefetch during mark
> 
> On 03/01/2018 12:12 PM, Per Liden wrote:
>> Hi,
>>
>> On 03/01/2018 01:52 AM, Steve Blackburn wrote:
>>> Hi all,
>>>
>>> I just stumbled upon this thread, and thought I ought to chime in.
>>>
>>> You may find our prefetch paper from 10 years ago useful.   Or not! :-).
>>>                   
>>> http://users.cecs.anu.edu.au/~steveb/downloads/pdf/pf-ismm-2007.pdf
>>
>> Thanks for the pointer. Link above doesn't seem to work for me, but I 
>> found the paper through ACM.
>>
>>>
>>> The short version is that there were a number of efforts to get 
>>> prefetching working well in the past, but none were effective.  We 
>>> did a pretty detailed study and managed to get some very nice 
>>> results, with two important changes:
>>>
>>>     *   FIFO front end to mark queue (without the FIFO the prefetch 
>>> distance is unpredictable)
>>>     *   Enqueue edges rather than nodes Obviously, the situation is 
>>> different here (concurrent, big change in uarch, etc), but still 
>>> there are some core ideas that you probably ought to know.
>>>
>>> The impatient may want to jump to section 7.2 and 7.3.    Note the 
>>> last para of 7.3: just adding the FIFO, with no software prefetch 
>>> may bring a win on some architectures.
>>
>> We do enqueue edges in ZGC (to enable "striped marking"), so we're 
>> fairly good positioned for prefetching to work, one would think. I 
>> recently did some quick tests with a FIFO in front of the mark stack 
>> (which would match "EdgeSide" in the paper) with varying prefetch 
>> distance, but wasn't able to observe any real improvements. More 
>> measurements and analysis would be needed to understand why.
> 
> Here's the FIFO prefetch patch I did, in case anyone is interested in doing more work/analysis in this area:
> 
> http://cr.openjdk.java.net/~pliden/zgc/mark_prefetch/webrev.0/
> 
> cheers,
> Per
> 
>>
>> cheers,
>> Per
>>
>>>
>>> Cheers,
>>>
>>> --Steve
>>>
>>> On 02/14/2018 05:23 PM, Wilkinson, Hugh wrote:
>>>> I have been looking at this also.
>>>>
>>>> I find that if the prefetching occurs 3 popped entries ahead of the 
>>>> processing, then there is a worthwhile benefit.
>>>>
>>>> A bit of re-structuring is required to make this easy and efficient.
>>>>
>>>> I am prefetching 2 cache lines from the referenced object and also 
>>>> doing a PREFETCHW of the mark bitmap.  (Prefetch::write() requires 
>>>> modification for x86.)
>>>>
>>>> With the current code structure, removal of the Prefetch::read() 
>>>> probably makes sense; however, I would like to highlight that 
>>>> marking performance can be improved with sufficiently early 
>>>> software cache prefetches.
>>>>
>>>> I expect to share more details later.
>>>
>>> Looking forward to that!
>>>
>>> cheers,
>>> Per
>>>