RFR: Remove prefetch during mark

Mon Mar 5 12:13:39 UTC 2018

Per,

FWIW, adjacent cache line prefetching is usually enabled for clients (desktops, laptops) and disabled for servers.  It has been this way for a long time.  For servers, the bandwidth penalty of adjacent cache line prefetching was likely the determining factor in this difference.

Hugh

-----Original Message-----
From: Per Liden [mailto:per.liden at oracle.com] 
Sent: Monday, March 5, 2018 2:05 AM
To: Wilkinson, Hugh <hugh.wilkinson at intel.com>; Steve Blackburn <steve.blackburn at anu.edu.au>; zgc-dev at openjdk.java.net
Subject: Re: RFR: Remove prefetch during mark

Hi Hugh,

On 03/02/2018 05:24 PM, Wilkinson, Hugh wrote:
> BTW, the DEFAULT_CACHE_LINE_SIZE for x86 is defined at 128 bytes for a 
> number of common configurations 1. defined(TIERED) && defined(_LP64)
> 2 . (defined(COMPILER2)||defined(SHARK)) && defined(_LP64)
> 
> It seems like the size is 64 bytes only for 32-bit machines.
> 
> I suspect that most x86 machines, particularly big ones, have a cache line size of 64 bytes.

Right, the choice of 128 bytes comes from JEP 143: "Improve Contended Locking" [1]. I personally think it was a mistake to change DEFAULT_CACHE_LINE_SIZE instead of having a different/new constant to control lock alignment/padding where needed. As far as I know, the number 128 was selected because most systems prefetch adjacent cachelines by default.

[1] https://bugs.openjdk.java.net/browse/JDK-8046133

/Per

> 
> (globalDefinitions_x86.hpp)
> 
> -----Original Message-----
> From: Per Liden [mailto:per.liden at oracle.com]
> Sent: Friday, March 2, 2018 2:17 AM
> To: Wilkinson, Hugh <hugh.wilkinson at intel.com>; Steve Blackburn 
> <steve.blackburn at anu.edu.au>; zgc-dev at openjdk.java.net
> Subject: Re: RFR: Remove prefetch during mark
> 
> Hi Hugh,
> 
> On 03/01/2018 05:45 PM, Wilkinson, Hugh wrote:
>> Hi Per,
>>
>> I believe that the Prefetch::read() and Prefetch::write() will be 
>> defined for x86 by 
>> zgc/src/hotspot/os_cpu/linux_x86/prefetch_linux_x86.inline.hpp.
>> (jdk-10+43)
>>
>> This file provides incorrect translation to assembly code.  It creates a 3-arg effective address that is interpreted by the processor as (base_address, index, size).
>>
>> The size argument of read() and write() is translated as an index.  The size in the tupple above is always 1 in the translation, but this is only the size of the indexed object (1, 2, 4, 8).
>>
>> So, specifying a size of a cacheline to read() and write() will prefetch the cacheline after the one that was intended.
> 
> Wow. Thanks for spotting that. It seems that the word "interval" in the Prefetch interface should really be "offset".
> 
>>
>> In this instance, the  easiest workaround is to specify a size of 1.
> 
> Isn't 0 even better?
> 
> /Per
> 
>>
>> The actual prefetch instructions do not take a size argument.  They only take a byte address reference, and the cacheline containing the byte is prefetched.
>>
>> A more robust fix is to instantiate the number of prefetch instructions necessary to span the size, but this is only practical for perhaps a maximum of perhaps 4 prefetch instructions.  It is only practical for a compile-time constant size argument.
>>
>> Current Intel processors will perform a prefetch execlusive ownership 
>> for the PREFETCHW instruction.  Prior to BDW, except for potentially 
>> some early Pentium 4s, a NOP was executed for a PREFETCHW instance.
>> The file could enable execution of a PREFETCHW for a Prefetch::write(
>>
>> Hugh
>>
>>
>> -----Original Message-----
>> From: zgc-dev [mailto:zgc-dev-bounces at openjdk.java.net] On Behalf Of 
>> Per Liden
>> Sent: Thursday, March 1, 2018 6:18 AM
>> To: Steve Blackburn <steve.blackburn at anu.edu.au>; 
>> zgc-dev at openjdk.java.net
>> Subject: Re: RFR: Remove prefetch during mark
>>
>> On 03/01/2018 12:12 PM, Per Liden wrote:
>>> Hi,
>>>
>>> On 03/01/2018 01:52 AM, Steve Blackburn wrote:
>>>> Hi all,
>>>>
>>>> I just stumbled upon this thread, and thought I ought to chime in.
>>>>
>>>> You may find our prefetch paper from 10 years ago useful.   Or not! :-).
>>>>                    
>>>> http://users.cecs.anu.edu.au/~steveb/downloads/pdf/pf-ismm-2007.pdf
>>>
>>> Thanks for the pointer. Link above doesn't seem to work for me, but 
>>> I found the paper through ACM.
>>>
>>>>
>>>> The short version is that there were a number of efforts to get 
>>>> prefetching working well in the past, but none were effective.  We 
>>>> did a pretty detailed study and managed to get some very nice 
>>>> results, with two important changes:
>>>>
>>>>      *   FIFO front end to mark queue (without the FIFO the 
>>>> prefetch distance is unpredictable)
>>>>      *   Enqueue edges rather than nodes Obviously, the situation 
>>>> is different here (concurrent, big change in uarch, etc), but still 
>>>> there are some core ideas that you probably ought to know.
>>>>
>>>> The impatient may want to jump to section 7.2 and 7.3.    Note the 
>>>> last para of 7.3: just adding the FIFO, with no software prefetch 
>>>> may bring a win on some architectures.
>>>
>>> We do enqueue edges in ZGC (to enable "striped marking"), so we're 
>>> fairly good positioned for prefetching to work, one would think. I 
>>> recently did some quick tests with a FIFO in front of the mark stack 
>>> (which would match "EdgeSide" in the paper) with varying prefetch 
>>> distance, but wasn't able to observe any real improvements. More 
>>> measurements and analysis would be needed to understand why.
>>
>> Here's the FIFO prefetch patch I did, in case anyone is interested in doing more work/analysis in this area:
>>
>> http://cr.openjdk.java.net/~pliden/zgc/mark_prefetch/webrev.0/
>>
>> cheers,
>> Per
>>
>>>
>>> cheers,
>>> Per
>>>
>>>>
>>>> Cheers,
>>>>
>>>> --Steve
>>>>
>>>> On 02/14/2018 05:23 PM, Wilkinson, Hugh wrote:
>>>>> I have been looking at this also.
>>>>>
>>>>> I find that if the prefetching occurs 3 popped entries ahead of 
>>>>> the processing, then there is a worthwhile benefit.
>>>>>
>>>>> A bit of re-structuring is required to make this easy and efficient.
>>>>>
>>>>> I am prefetching 2 cache lines from the referenced object and also 
>>>>> doing a PREFETCHW of the mark bitmap.  (Prefetch::write() requires 
>>>>> modification for x86.)
>>>>>
>>>>> With the current code structure, removal of the Prefetch::read() 
>>>>> probably makes sense; however, I would like to highlight that 
>>>>> marking performance can be improved with sufficiently early 
>>>>> software cache prefetches.
>>>>>
>>>>> I expect to share more details later.
>>>>
>>>> Looking forward to that!
>>>>
>>>> cheers,
>>>> Per
>>>>