RFR(XXS): 8235931: add OM_CACHE_LINE_SIZE and use smaller size on SPARCv9 and X64

Fri Jan 17 18:06:24 UTC 2020

Hi Martin,

Thanks for chiming in on this review thread.

For the purposes of this fix (8235931), I'm going to stick with
OM_CACHE_LINE_SIZE and we'll have to hash out a broader definition
of "cache line size" via a different bug ID.

I hope you don't mind that I have no interest in taking on the broader
naming difficulties using 8235931...

Dan

On 12/20/19 9:24 AM, Doerr, Martin wrote:
> Hi,
>
> Just a comment on
>> CACHE_LINE_SIZE 64
>> DOUBLE_CACHE_LINE_SIZE 128
> I think CACHE_LINE_SIZE is too unspecific.
> There are L1I L1D, L2, L3, ... caches and they can have different cache line sizes.
> In addition, they can change between processor models.
>
> So we should have a better name than just "CACHE_LINE_SIZE".
>
> PPC and s390 already have large cache lines (which are real, not faked for performance considerations).
> I guess we don't want to use even more padding on these platforms.
> So we will probably want to set both constants to the same value (needs to get evaluated).
>
> Best regards,
> Martin
>
>
>> -----Original Message-----
>> From: hotspot-runtime-dev <hotspot-runtime-dev-
>> bounces at openjdk.java.net> On Behalf Of Daniel D. Daugherty
>> Sent: Donnerstag, 19. Dezember 2019 18:51
>> To: Claes Redestad <claes.redestad at oracle.com>; David Holmes
>> <david.holmes at oracle.com>; hotspot-runtime-dev at openjdk.java.net
>> Subject: Re: RFR(XXS): 8235931: add OM_CACHE_LINE_SIZE and use smaller
>> size on SPARCv9 and X64
>>
>> Hi David and Claes,
>>
>> Thanks for the fast reviews! Sorry for my delay in responding. I've been
>> working on the Async Monitor Deflation wiki updates...
>>
>> Since Claes was kind enough to reply to David's post and included all
>> of David's comments, I'm going to reply to both here.
>>
>>
>> On 12/18/19 10:03 AM, Claes Redestad wrote:
>>> Hi,
>>>
>>> On 2019-12-18 06:50, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> On 18/12/2019 7:34 am, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> I'm extracting another standalone fix from the Async Monitor Deflation
>>>>> project (JDK-8153224) and sending it out for review (and testing)
>>>>> separately. This is an XXS sized fix, but since it involves cache
>>>>> line sizes I don't think a trivial review is appropriate.
>> First, David's part:
>>
>>
>>>> Can you elaborate on why a smaller cache-line-size assumption is
>>>> beneficial here?
>> I originally heard about switching from 128 byte to 64 byte cache-line-size
>> at the June 2019 Runtime offsite. I don't remember who brought it up, but
>> they were saying that Intel no longer recommends using 128 byte
>> cache-line-size
>> (and hasn't recommended that for some time). IIRC, the person also said
>> that
>> their Intel contact was a bit surprised about the 128 byte cache-line-size
>> recommendation.
>>
>> When I talked about Async Monitor Deflation in June I mentioned that I
>> was going to try out 64 byte cache-line-size for ObjectMonitors and so
>> far I haven't seen any performance issues.
>>
>>
>>>> Do you have performance numbers for the bug report?
>> I've been doing Aurora SPECjbb2015 runs (tuned G1, 10 iterations) between
>> my baseline (jdk-14+26) and the various fixes. For this fix (8235931)
>> relative to my first baseline run:
>>
>>       criticalJOPS  -0.40%  (Non-significant)
>>       maxJOPS       +0.23%  (Non-significant)
>>
>> I reran the baseline a second time using the same bits and for this fix
>> (8236035) relative to the second run:
>>
>>       criticalJOPS  +6.96%  (Non-significant)
>>       maxJOPS       +9.96%  (Non-significant)
>>
>> When I compare the two baseline runs:
>>
>>       criticalJOPS   -6.65%  (Non-significant)
>>       maxJOPS        -8.86%  (Non-significant)
>>
>> So even with 10 iterations, these SPECjbb2015 runs are all over the
>> place. And Aurora is agreeing with that by flagging the results as
>> "Non-significant" because the 'p' values are too large.
>>
>> I'm doing another set of Aurora SPECjbb2015 runs and I'm bumping the
>> iterations from 10 to 15. In the recent past, Per told me I might
>> have to do 20 runs of SPECjbb2015 to shake out the variations.
>> We'll see what turns up with this latest set of runs...
>>
>>
>> Second, Claes' part:
>>
>>> I did numerous experiments on this during work on JEP-143 and saw no
>>> improvement on neither "real" benchmarks nor stress tests that try to
>>> explicitly provoke cross-monitor false sharing on various fields in
>>> ObjectMonitors. I've not re-run those experiments on new hardware, so
>>> YMMV. Should probably check if they're still usable..
>> As always, I'm interested in your ObjectMonitor results.
>>
>>
>>> Specifically, for padding values below 64 we could provoke false sharing
>>> effects on stress tests, with negligible movement on "real" benchmarks.
>>> For values above 64 we just waste space.
>> Yup. You've said that a few times and that's one of the reasons that
>> I've included this change in the Async Monitor Deflation project.
>>
>>
>>> There exists a legend that you need double the physical cache line
>>> padding to not provoke the wrath of false sharing in the presence of
>>> aggressive prefetching, which is the reason why the
>>> "DEFAULT_CACHE_LINE_SIZE" is set to twice the typical physical cache
>>> line size in bytes. If we explicitly think we need to pad two lines,
>>> then the constant DEFAULT_CACHE_LINE_SIZE should be named
>> differently.
>>
>> I wouldn't call it "a legend". When we made this change years ago, Dice
>> had very specific advice from someone at Intel to make this change. He
>> could probably still dig up the details...
>>
>>
>>> The patch in question looks good to me,
>> Thanks!
>>
>>
>>> although I'd prefer something like:
>>>
>>> CACHE_LINE_SIZE 64
>>> DOUBLE_CACHE_LINE_SIZE 128
>> I think we should make that change on a more HotSpot wide basis with
>> a different bug ID. Also, do you see those values (64 and 128) being
>> appropriate for all the CPUs/platforms? Or do you foresee those values
>> being defined per-CPU/platform?
>>
>>
>>> .. and then use whichever is deemed more appropriate on a case by case
>>> basis. (Hint: it's probably CACHE_LINE_SIZE)
>> I accept that hint which is why I'm going with OM_CACHE_LINE_SIZE
>> so that ObjectMonitors can have their own case... :-)
>>
>>
>>> $.02
>>>
>>> /Claes
>> And back to David:
>>>> Otherwise the mechanics of the change look fine, except I would
>>>> suggest that this:
>>>>
>>>> #ifndef OM_CACHE_LINE_SIZE
>>>> // Use DEFAULT_CACHE_LINE_SIZE if not already specified for
>>>> // the current build platform.
>>>> #define OM_CACHE_LINE_SIZE DEFAULT_CACHE_LINE_SIZE
>>>> #endif
>>>>
>>>> should go in the shared globalDefinitions.hpp file.
>> I intentionally didn't want to expose OM_CACHE_LINE_SIZE in
>> globalDefinitions.hpp. I'm trying really hard to stay out of that file.
>> Also, is globalDefinitions.hpp processed before or after the CPU/platform
>> specific files?
>>
>> Obviously I need to specify OM_CACHE_LINE_SIZE in the two CPU files:
>>
>>       src/hotspot/cpu/sparc/globalDefinitions_sparc.hpp
>>       src/hotspot/cpu/x86/globalDefinitions_x86.hpp
>>
>> and both halves of the ObjectMonitor subsystem need the above logic:
>>
>>       src/hotspot/share/runtime/objectMonitor.hpp
>>       src/hotspot/share/runtime/synchronizer.hpp
>>
>> but we don't need to expose OM_CACHE_LINE_SIZE elsewhere. Of course,
>> if we end up adding CACHE_LINE_SIZE and DOUBLE_CACHE_LINE_SIZE to
>> the system *and* if we can get all the CPUs/platforms to agree to use
>> CACHE_LINE_SIZE for ObjectMonitors, then OM_CACHE_LINE_SIZE can be
>> retired.
>>
>> Thanks again for the fast reviews!
>>
>> Dan
>>
>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>>       JDK-8235931 add OM_CACHE_LINE_SIZE and use smaller size on
>>>>> SPARCv9 and X64
>>>>>       https://bugs.openjdk.java.net/browse/JDK-8235931
>>>>>
>>>>> Here's the webrev URL:
>>>>>
>>>>> http://cr.openjdk.java.net/~dcubed/8235931-webrev/0-for-jdk15/
>>>>>
>>>>> Folks that have reviewed JDK-8153224 will recognize these changes as
>>>>> a subset of the cache line changes from the Async Monitor Deflation
>>>>> project. It's a subset because there are more fields that need to be
>>>>> on separate cache lines in the Async Monitor Deflation project.
>>>>>
>>>>> These changes are being tested in a Mach5 Tier[1-3] run that's still
>>>>> running (JDK14 ATR has priority). I've also run these changes thru
>>>>> a set of 10 SPECjbb2015 runs; I'm waiting for my various SPECjbb2015
>>>>> runs to finish so I can do the analysis.
>>>>>
>>>>> Thanks, in advance, for comments, questions or suggestions.
>>>>>
>>>>> Dan