RFR(XXS): 8235931: add OM_CACHE_LINE_SIZE and use smaller size on SPARCv9 and X64

Tue Jan 21 15:22:29 UTC 2020

On 1/20/20 5:46 AM, Doerr, Martin wrote:
> Hi Dan,
>
>> It's not clear whether you reviewed the entire patch or not.
> Now I have reviewed the entire patch. Sorry for the delay.

Thanks for the review!

>>> I hope you don't mind that I have no interest in taking on the broader
>>> naming difficulties using 8235931...
> Perfectly ok. I appreciate cleaning things up in a separate change.
>
> I don't like having the "#define OM_CACHE_LINE_SIZE DEFAULT_CACHE_LINE_SIZE" twice.
> Maybe padded.hpp would be a good place? It's included from both, synchronizer.hpp and objectMonitor.hpp.

I don't think any of us (David H, you or me) like OM_CACHE_LINE_SIZE being
defined in two places, but I'm trying to avoid polluting non-Monitor places
with an OM_... symbol. That just doesn't feel right.

I own the problem of updating OM_CACHE_LINE_SIZE to whatever symbol we 
decide
replaces DEFAULT_CACHE_LINE_SIZE in the future so my preference is to leave
things as they are until we reach that point. Can you live with that?

> synchronizer.cpp:
> Not sure if the padding change within SharedGlobals is beneficial and I don't see how it's related to the ObjectMonitors.

It seemed strange to change from DEFAULT_CACHE_LINE_SIZE to 
OM_CACHE_LINE_SIZE
in one part of synchronizer.cpp, but not in all parts... Also:

L1666: void 
ObjectSynchronizer::finish_deflate_idle_monitors(DeflateMonitorCounters* 
counters) {
L1692:   GVars.stw_random = os::random();
L1693:   GVars.stw_cycle++;

so two of the three fields are touched by ObjectMonitor code...

> The remaining parts make sense to me.

Thanks!

> Thanks for sharing performance results.

You're welcome! I'm doing another round of perf testing after rebasing
everything to jdk-14+32... this time 25 runs per config... I'm hoping to
get more stable results, but I'm not optimistic...

Dan

>
> Best regards,
> Martin
>
>
>> -----Original Message-----
>> From: Daniel D. Daugherty <daniel.daugherty at oracle.com>
>> Sent: Freitag, 17. Januar 2020 22:20
>> To: Doerr, Martin <martin.doerr at sap.com>; Claes Redestad
>> <claes.redestad at oracle.com>; David Holmes <david.holmes at oracle.com>;
>> hotspot-runtime-dev at openjdk.java.net
>> Subject: Re: RFR(XXS): 8235931: add OM_CACHE_LINE_SIZE and use smaller
>> size on SPARCv9 and X64
>>
>> Martin,
>>
>> It's not clear whether you reviewed the entire patch or not.
>>
>> Currently I'm planning to list dholmes and redestad as reviewers. Do
>> you also wish to be listed as a reviewer?
>>
>> Dan
>>
>>
>> On 1/17/20 1:06 PM, Daniel D. Daugherty wrote:
>>> Hi Martin,
>>>
>>> Thanks for chiming in on this review thread.
>>>
>>> For the purposes of this fix (8235931), I'm going to stick with
>>> OM_CACHE_LINE_SIZE and we'll have to hash out a broader definition
>>> of "cache line size" via a different bug ID.
>>>
>>> I hope you don't mind that I have no interest in taking on the broader
>>> naming difficulties using 8235931...
>>>
>>> Dan
>>>
>>>
>>> On 12/20/19 9:24 AM, Doerr, Martin wrote:
>>>> Hi,
>>>>
>>>> Just a comment on
>>>>> CACHE_LINE_SIZE 64
>>>>> DOUBLE_CACHE_LINE_SIZE 128
>>>> I think CACHE_LINE_SIZE is too unspecific.
>>>> There are L1I L1D, L2, L3, ... caches and they can have different
>>>> cache line sizes.
>>>> In addition, they can change between processor models.
>>>>
>>>> So we should have a better name than just "CACHE_LINE_SIZE".
>>>>
>>>> PPC and s390 already have large cache lines (which are real, not
>>>> faked for performance considerations).
>>>> I guess we don't want to use even more padding on these platforms.
>>>> So we will probably want to set both constants to the same value
>>>> (needs to get evaluated).
>>>>
>>>> Best regards,
>>>> Martin
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-runtime-dev <hotspot-runtime-dev-
>>>>> bounces at openjdk.java.net> On Behalf Of Daniel D. Daugherty
>>>>> Sent: Donnerstag, 19. Dezember 2019 18:51
>>>>> To: Claes Redestad <claes.redestad at oracle.com>; David Holmes
>>>>> <david.holmes at oracle.com>; hotspot-runtime-dev at openjdk.java.net
>>>>> Subject: Re: RFR(XXS): 8235931: add OM_CACHE_LINE_SIZE and use
>> smaller
>>>>> size on SPARCv9 and X64
>>>>>
>>>>> Hi David and Claes,
>>>>>
>>>>> Thanks for the fast reviews! Sorry for my delay in responding. I've
>>>>> been
>>>>> working on the Async Monitor Deflation wiki updates...
>>>>>
>>>>> Since Claes was kind enough to reply to David's post and included all
>>>>> of David's comments, I'm going to reply to both here.
>>>>>
>>>>>
>>>>> On 12/18/19 10:03 AM, Claes Redestad wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 2019-12-18 06:50, David Holmes wrote:
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> On 18/12/2019 7:34 am, Daniel D. Daugherty wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> I'm extracting another standalone fix from the Async Monitor
>>>>>>>> Deflation
>>>>>>>> project (JDK-8153224) and sending it out for review (and testing)
>>>>>>>> separately. This is an XXS sized fix, but since it involves cache
>>>>>>>> line sizes I don't think a trivial review is appropriate.
>>>>> First, David's part:
>>>>>
>>>>>
>>>>>>> Can you elaborate on why a smaller cache-line-size assumption is
>>>>>>> beneficial here?
>>>>> I originally heard about switching from 128 byte to 64 byte
>>>>> cache-line-size
>>>>> at the June 2019 Runtime offsite. I don't remember who brought it
>>>>> up, but
>>>>> they were saying that Intel no longer recommends using 128 byte
>>>>> cache-line-size
>>>>> (and hasn't recommended that for some time). IIRC, the person also said
>>>>> that
>>>>> their Intel contact was a bit surprised about the 128 byte
>>>>> cache-line-size
>>>>> recommendation.
>>>>>
>>>>> When I talked about Async Monitor Deflation in June I mentioned that I
>>>>> was going to try out 64 byte cache-line-size for ObjectMonitors and so
>>>>> far I haven't seen any performance issues.
>>>>>
>>>>>
>>>>>>> Do you have performance numbers for the bug report?
>>>>> I've been doing Aurora SPECjbb2015 runs (tuned G1, 10 iterations)
>>>>> between
>>>>> my baseline (jdk-14+26) and the various fixes. For this fix (8235931)
>>>>> relative to my first baseline run:
>>>>>
>>>>>        criticalJOPS  -0.40%  (Non-significant)
>>>>>        maxJOPS       +0.23%  (Non-significant)
>>>>>
>>>>> I reran the baseline a second time using the same bits and for this fix
>>>>> (8236035) relative to the second run:
>>>>>
>>>>>        criticalJOPS  +6.96%  (Non-significant)
>>>>>        maxJOPS       +9.96%  (Non-significant)
>>>>>
>>>>> When I compare the two baseline runs:
>>>>>
>>>>>        criticalJOPS   -6.65%  (Non-significant)
>>>>>        maxJOPS        -8.86%  (Non-significant)
>>>>>
>>>>> So even with 10 iterations, these SPECjbb2015 runs are all over the
>>>>> place. And Aurora is agreeing with that by flagging the results as
>>>>> "Non-significant" because the 'p' values are too large.
>>>>>
>>>>> I'm doing another set of Aurora SPECjbb2015 runs and I'm bumping the
>>>>> iterations from 10 to 15. In the recent past, Per told me I might
>>>>> have to do 20 runs of SPECjbb2015 to shake out the variations.
>>>>> We'll see what turns up with this latest set of runs...
>>>>>
>>>>>
>>>>> Second, Claes' part:
>>>>>
>>>>>> I did numerous experiments on this during work on JEP-143 and saw no
>>>>>> improvement on neither "real" benchmarks nor stress tests that try to
>>>>>> explicitly provoke cross-monitor false sharing on various fields in
>>>>>> ObjectMonitors. I've not re-run those experiments on new hardware,
>> so
>>>>>> YMMV. Should probably check if they're still usable..
>>>>> As always, I'm interested in your ObjectMonitor results.
>>>>>
>>>>>
>>>>>> Specifically, for padding values below 64 we could provoke false
>>>>>> sharing
>>>>>> effects on stress tests, with negligible movement on "real"
>>>>>> benchmarks.
>>>>>> For values above 64 we just waste space.
>>>>> Yup. You've said that a few times and that's one of the reasons that
>>>>> I've included this change in the Async Monitor Deflation project.
>>>>>
>>>>>
>>>>>> There exists a legend that you need double the physical cache line
>>>>>> padding to not provoke the wrath of false sharing in the presence of
>>>>>> aggressive prefetching, which is the reason why the
>>>>>> "DEFAULT_CACHE_LINE_SIZE" is set to twice the typical physical cache
>>>>>> line size in bytes. If we explicitly think we need to pad two lines,
>>>>>> then the constant DEFAULT_CACHE_LINE_SIZE should be named
>>>>> differently.
>>>>>
>>>>> I wouldn't call it "a legend". When we made this change years ago, Dice
>>>>> had very specific advice from someone at Intel to make this change. He
>>>>> could probably still dig up the details...
>>>>>
>>>>>
>>>>>> The patch in question looks good to me,
>>>>> Thanks!
>>>>>
>>>>>
>>>>>> although I'd prefer something like:
>>>>>>
>>>>>> CACHE_LINE_SIZE 64
>>>>>> DOUBLE_CACHE_LINE_SIZE 128
>>>>> I think we should make that change on a more HotSpot wide basis with
>>>>> a different bug ID. Also, do you see those values (64 and 128) being
>>>>> appropriate for all the CPUs/platforms? Or do you foresee those values
>>>>> being defined per-CPU/platform?
>>>>>
>>>>>
>>>>>> .. and then use whichever is deemed more appropriate on a case by
>> case
>>>>>> basis. (Hint: it's probably CACHE_LINE_SIZE)
>>>>> I accept that hint which is why I'm going with OM_CACHE_LINE_SIZE
>>>>> so that ObjectMonitors can have their own case... :-)
>>>>>
>>>>>
>>>>>> $.02
>>>>>>
>>>>>> /Claes
>>>>> And back to David:
>>>>>>> Otherwise the mechanics of the change look fine, except I would
>>>>>>> suggest that this:
>>>>>>>
>>>>>>> #ifndef OM_CACHE_LINE_SIZE
>>>>>>> // Use DEFAULT_CACHE_LINE_SIZE if not already specified for
>>>>>>> // the current build platform.
>>>>>>> #define OM_CACHE_LINE_SIZE DEFAULT_CACHE_LINE_SIZE
>>>>>>> #endif
>>>>>>>
>>>>>>> should go in the shared globalDefinitions.hpp file.
>>>>> I intentionally didn't want to expose OM_CACHE_LINE_SIZE in
>>>>> globalDefinitions.hpp. I'm trying really hard to stay out of that file.
>>>>> Also, is globalDefinitions.hpp processed before or after the
>>>>> CPU/platform
>>>>> specific files?
>>>>>
>>>>> Obviously I need to specify OM_CACHE_LINE_SIZE in the two CPU files:
>>>>>
>>>>>        src/hotspot/cpu/sparc/globalDefinitions_sparc.hpp
>>>>>        src/hotspot/cpu/x86/globalDefinitions_x86.hpp
>>>>>
>>>>> and both halves of the ObjectMonitor subsystem need the above logic:
>>>>>
>>>>>        src/hotspot/share/runtime/objectMonitor.hpp
>>>>>        src/hotspot/share/runtime/synchronizer.hpp
>>>>>
>>>>> but we don't need to expose OM_CACHE_LINE_SIZE elsewhere. Of
>> course,
>>>>> if we end up adding CACHE_LINE_SIZE and DOUBLE_CACHE_LINE_SIZE
>> to
>>>>> the system *and* if we can get all the CPUs/platforms to agree to use
>>>>> CACHE_LINE_SIZE for ObjectMonitors, then OM_CACHE_LINE_SIZE can
>> be
>>>>> retired.
>>>>>
>>>>> Thanks again for the fast reviews!
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>>        JDK-8235931 add OM_CACHE_LINE_SIZE and use smaller size on
>>>>>>>> SPARCv9 and X64
>>>>>>>>        https://bugs.openjdk.java.net/browse/JDK-8235931
>>>>>>>>
>>>>>>>> Here's the webrev URL:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8235931-webrev/0-for-jdk15/
>>>>>>>>
>>>>>>>> Folks that have reviewed JDK-8153224 will recognize these changes
>> as
>>>>>>>> a subset of the cache line changes from the Async Monitor Deflation
>>>>>>>> project. It's a subset because there are more fields that need to be
>>>>>>>> on separate cache lines in the Async Monitor Deflation project.
>>>>>>>>
>>>>>>>> These changes are being tested in a Mach5 Tier[1-3] run that's still
>>>>>>>> running (JDK14 ATR has priority). I've also run these changes thru
>>>>>>>> a set of 10 SPECjbb2015 runs; I'm waiting for my various SPECjbb2015
>>>>>>>> runs to finish so I can do the analysis.
>>>>>>>>
>>>>>>>> Thanks, in advance, for comments, questions or suggestions.
>>>>>>>>
>>>>>>>> Dan
>>>