RFR: 8253064: monitor list simplifications and getting rid of TSM [v2]

Mon Nov 9 23:09:59 UTC 2020

Hi Dan,

I'm going to top-post just to keep this short :)

Thanks for all the details below on the performance aspects and use of 
the heuristics - all good.

Regarding "ceiling" versus "watermark" ... a "high watermark" is a 
ceiling so its really a matter of personal preference what terminology 
you want to use.

Thanks again,
David

On 10/11/2020 6:16 am, Daniel D. Daugherty wrote:
> Hi David,
> 
> I'm going to try replying to this review comment via email to see how that
> works out...
> 
> On 11/9/20 1:06 AM, David Holmes wrote:
>> Hi Dan,
>>
>> On 9/11/2020 1:50 pm, Daniel D.Daugherty wrote:
>>> On Sun, 8 Nov 2020 21:43:00 GMT, David Holmes <dholmes at openjdk.org> 
>>> wrote:
>>>
>>>>> How about this:
>>>>>    static MonitorList   _in_use_list;
>>>>>    // The ratio of the current _in_use_list count to the ceiling is 
>>>>> used
>>>>>    // to determine if we are above MonitorUsedDeflationThreshold 
>>>>> and need
>>>>>    // to do an async monitor deflation cycle. The ceiling is 
>>>>> increased by
>>>>>    // AvgMonitorsPerThreadEstimate when a thread is added to the 
>>>>> system
>>>>>    // and is decreased by AvgMonitorsPerThreadEstimate when a 
>>>>> thread is
>>>>>    // removed from the system.
>>>>>    // Note: If the _in_use_list max exceeds the ceiling, then
>>>>>    // monitors_used_above_threshold() will use the in_use_list max 
>>>>> instead
>>>>>    // of the thread count derived ceiling because we have used more
>>>>>    // ObjectMonitors than the estimated average.
>>>>>    static jint          _in_use_list_ceiling;
>>>>
>>>> Thanks for the comment. So instead of checking the threshhold on 
>>>> each OM allocation we use this averaging technique to estimate the 
>>>> number of monitors in use? Can you explain how this came about 
>>>> rather than the simple/obvious check at allocation time. Thanks.
>>>
>>> I'm not sure I understand your question, but let me that a stab at it 
>>> anyway...
>>>
>>> We used to compare the sum of the in-use counts from all the in-use 
>>> lists
>>> with the total population of ObjectMonitors. If that ratio was higher 
>>> than
>>> MonitorUsedDeflationThreshold, then we would do an async deflation 
>>> cycle.
>>> Since we got rid of TSM, we no longer had a population of already 
>>> allocated
>>> ObjectMonitors, we had a max value instead. However, when the VMs use
>>> of ObjectMonitors is first spinning up, the max value is typically 
>>> very close
>>> to the in-use count so we would always be asking for an async-deflation
>>> during that spinning up phase.
>>>
>>> I created the idea of a ceiling value that is tied to thread count 
>>> and the
>>> AvgMonitorsPerThreadEstimate to replace the population value that we
>>> used to have. By comparing the in-use count against the ceiling 
>>> value, we
>>> no longer exceed the MonitorUsedDeflationThreshold when the VMs use
>>> of ObjectMonitors is first spinning up so we no longer do async 
>>> deflations
>>> continuously during that phase. If the max value exceeds the ceiling 
>>> value,
>>> then we're using a LOT of ObjectMonitors and, in that case, we compare
>>> the in-use count against the max to determine if we're exceeding the
>>> MonitorUsedDeflationThreshold.
>>>
>>> Does this help?
>>
>> It helps but I'm still wrestling with what 
>> MonitorUsedDeflationThreshold actually means now.
>>
>> So the existing MonitorUsedDeflationThreshold is used as a measure of 
>> the proportion of monitors actually in-use compared to the number of 
>> monitors pre-allocated.
> 
> Slight correction: not a measure of proportion, but a threshold on the
> proportion.
> 
> 
>> If an inflation request requires a new block to be allocated and we're 
>> above MonitorUsedDeflationThreshold % then a request for async 
>> deflation occurs (when we actually check).
> 
> Not quite. We did have code in om_alloc() that could cause a deflation
> cycle to be invoked, but that code was not enabled by default and was
> removed. See the fix for:
> 
>      JDK-8230940 Obsolete MonitorBound
>      https://bugs.openjdk.java.net/browse/JDK-8230940
> 
> which was pushed in jdk-15-B22.
> 
> In the current baseline, ObjectSynchronizer::is_async_deflation_needed()
> is used to ask if we need to perform an async deflation cycle. That
> function calls monitors_used_above_threshold() which is where the logic
> that uses the MonitorUsedDeflationThreshold value comes into play.
> 
> 
>> The new code, IIUC, says, lets assume we expect 
>> AvgMonitorsPerThreadEstimate monitors-per-thread. If the number of 
>> monitors in-use is > MonitorUsedDeflationThreshold % of 
>> (AvgMonitorsPerThreadEstimate * number_of_threads), then we request 
>> async deflation.
> 
> You understand correctly.
> 
> 
>> So ... obviously we need some kind of watermark based system for 
>> requesting deflation otherwise there will be far too many deflation 
>> requests.
> 
> Yes. That's what I saw before I added AvgMonitorsPerThreadEstimate and
> the ceiling. Although, I think I like the phrase "watermark" better!
> 
> Should I rename this new ceiling concept to watermark??
> 
> 
>> And we also don't want to have check for exceeding the threshold on 
>> every monitor allocation. So the deflation thread will wakeup 
>> periodically and check if the threshold is exceeded.
> 
> Exactly. GuaranteedSafepointInterval (default 1sec) controls how often
> the MonitorDeflationThread will wake up on its own to check for work.
> AsyncDeflationInterval (default 250ms) controls at most how often
> the MonitorDeflationThread will actually do the work.
> 
> However, if an async deflation cycle is directly requested, then it
> will be honored regardless of the limits. At this point, only the
> WhiteBox support uses that feature. Although we do have an RFE from
> you to restore that feature to System.gc():
> 
>    JDK-8249638 Re-instate idle monitor deflation as part of System.gc()
>    https://bugs.openjdk.java.net/browse/JDK-8249638
> 
> 
>> Okay ... so then it comes down to deciding whether 
>> AvgMonitorsPerThreadEstimate is the best way to establish the 
>> watermark and what the default value should be. This doesn't seem like 
>> something that an application developer could reasonably try to 
>> estimate so it is just going to be a tuning knob they adjust somewhat 
>> arbitrarily.
> 
> I actually don't expect anyone to care about AvgMonitorsPerThreadEstimate,
> but I've created it just-in-case.
> 
> 
>> I assume the 1024 default came from tuning something?
> 
> The default 1024 value is the closest I can come to the baseline VM
> behavior where we allowed a per-thread free-list to allocate at most
> 1024 ObjectMonitors in one attempt. Since the per-thread free list is
> the fastest code path for an ObjectMonitor allocation in the baseline
> VM, I thought using that value made for a reasonable default for the
> AvgMonitorsPerThreadEstimate.
> 
> However, we don't pre-allocate in the new code so a thread doesn't
> actually have (at most) 1024 ObjectMonitors waiting on the per-thread
> free list for fast allocation.
> 
> 
>> Have you looked at the affect on memory use these changes have (ie 
>> peak RSS use)?
> 
> Actually what I've been looking at is population values in the baseline
> VM and the max in the new code with Kitchensink8H runs. I don't have
> the absolutely latest values, but, IIRC, the new code has a much smaller
> max value than the baseline population value.
> 
> Obviously this just tells me the raw numbers of ObjectMonitors and doesn't
> include any "hidden" overhead that would be captured by RSS values, but I
> think it gives me a reasonable feel for memory utilization.
> 
> 
>> Did your performance measurements look at using different values?
> 
> No. All my performance runs used default values. I tend to focus on
> out-of-the-box settings unless I'm looking to justify changing some
> default value.
> 
> 
>> (I can imagine that with enough memory we can effectively disable 
>> deflation and so potentially increase performance. OTOH maybe 
>> deflation is so infrequent it is a non-issue.)
>>
>> I have to confess that I never really thought about the old set of 
>> heuristics for this, but the fact we're changing the heuristics does 
>> raise a concern about what impact applications may see.
> 
> None of the performance testing that we've done so far has raised any
> concerns about the new heuristics.
> 
> The new ObjectMonitor inflation mechanism is much, much faster than
> the old mechanism. During my Inflate2 stress testing, the baseline VM
> would peak at a population of about 12 million at 4.5 hours into an 8
> hour run with release bits. With the new code, Inflate2 would reach a
> max of 400+ million at about an hour with no signs that was the actual
> peak when I stopped the run.
> 
> 
>> BTW MonitorUsedDeflationThreshold should really be diagnostic not 
>> experimental, as real applications may need to tune it (and people 
>> often don't want to use experimental flags in production as a matter 
>> of policy).
> 
> MonitorUsedDeflationThreshold wasn't added with this project. It was
> added by Robbin using this bug ID:
> 
>      JDK-8181859 Monitor deflation is not checked in cleanup path
>      https://bugs.openjdk.java.net/browse/JDK-8181859
> 
> way back in jdk-10-B21... I don't know the reason that Robbin created
> the option as experimental rather than diagnostic, but I can investigate.
> 
> Thanks again for the review!
> 
> 
> At this point, I don't see anything that I plan to change in response
> to this set of comments. I do have a query up above about renaming the
> ceiling concept to watermark. Please let me know what you think.
> 
> Dan
> 
>>
>> Thanks,
>> David
>> -----
>>
>>> -------------
>>>
>>> PR: https://git.openjdk.java.net/jdk/pull/642
>>>
>