RFR: 8253064: monitor list simplifications and getting rid of TSM [v2]

Mon Nov 9 20:16:58 UTC 2020

Hi David,

I'm going to try replying to this review comment via email to see how that
works out...

On 11/9/20 1:06 AM, David Holmes wrote:
> Hi Dan,
>
> On 9/11/2020 1:50 pm, Daniel D.Daugherty wrote:
>> On Sun, 8 Nov 2020 21:43:00 GMT, David Holmes <dholmes at openjdk.org> 
>> wrote:
>>
>>>> How about this:
>>>>    static MonitorList   _in_use_list;
>>>>    // The ratio of the current _in_use_list count to the ceiling is 
>>>> used
>>>>    // to determine if we are above MonitorUsedDeflationThreshold 
>>>> and need
>>>>    // to do an async monitor deflation cycle. The ceiling is 
>>>> increased by
>>>>    // AvgMonitorsPerThreadEstimate when a thread is added to the 
>>>> system
>>>>    // and is decreased by AvgMonitorsPerThreadEstimate when a 
>>>> thread is
>>>>    // removed from the system.
>>>>    // Note: If the _in_use_list max exceeds the ceiling, then
>>>>    // monitors_used_above_threshold() will use the in_use_list max 
>>>> instead
>>>>    // of the thread count derived ceiling because we have used more
>>>>    // ObjectMonitors than the estimated average.
>>>>    static jint          _in_use_list_ceiling;
>>>
>>> Thanks for the comment. So instead of checking the threshhold on 
>>> each OM allocation we use this averaging technique to estimate the 
>>> number of monitors in use? Can you explain how this came about 
>>> rather than the simple/obvious check at allocation time. Thanks.
>>
>> I'm not sure I understand your question, but let me that a stab at it 
>> anyway...
>>
>> We used to compare the sum of the in-use counts from all the in-use 
>> lists
>> with the total population of ObjectMonitors. If that ratio was higher 
>> than
>> MonitorUsedDeflationThreshold, then we would do an async deflation 
>> cycle.
>> Since we got rid of TSM, we no longer had a population of already 
>> allocated
>> ObjectMonitors, we had a max value instead. However, when the VMs use
>> of ObjectMonitors is first spinning up, the max value is typically 
>> very close
>> to the in-use count so we would always be asking for an async-deflation
>> during that spinning up phase.
>>
>> I created the idea of a ceiling value that is tied to thread count 
>> and the
>> AvgMonitorsPerThreadEstimate to replace the population value that we
>> used to have. By comparing the in-use count against the ceiling 
>> value, we
>> no longer exceed the MonitorUsedDeflationThreshold when the VMs use
>> of ObjectMonitors is first spinning up so we no longer do async 
>> deflations
>> continuously during that phase. If the max value exceeds the ceiling 
>> value,
>> then we're using a LOT of ObjectMonitors and, in that case, we compare
>> the in-use count against the max to determine if we're exceeding the
>> MonitorUsedDeflationThreshold.
>>
>> Does this help?
>
> It helps but I'm still wrestling with what 
> MonitorUsedDeflationThreshold actually means now.
>
> So the existing MonitorUsedDeflationThreshold is used as a measure of 
> the proportion of monitors actually in-use compared to the number of 
> monitors pre-allocated.

Slight correction: not a measure of proportion, but a threshold on the
proportion.

> If an inflation request requires a new block to be allocated and we're 
> above MonitorUsedDeflationThreshold % then a request for async 
> deflation occurs (when we actually check).

Not quite. We did have code in om_alloc() that could cause a deflation
cycle to be invoked, but that code was not enabled by default and was
removed. See the fix for:

     JDK-8230940 Obsolete MonitorBound
     https://bugs.openjdk.java.net/browse/JDK-8230940

which was pushed in jdk-15-B22.

In the current baseline, ObjectSynchronizer::is_async_deflation_needed()
is used to ask if we need to perform an async deflation cycle. That
function calls monitors_used_above_threshold() which is where the logic
that uses the MonitorUsedDeflationThreshold value comes into play.

> The new code, IIUC, says, lets assume we expect 
> AvgMonitorsPerThreadEstimate monitors-per-thread. If the number of 
> monitors in-use is > MonitorUsedDeflationThreshold % of 
> (AvgMonitorsPerThreadEstimate * number_of_threads), then we request 
> async deflation.

You understand correctly.

> So ... obviously we need some kind of watermark based system for 
> requesting deflation otherwise there will be far too many deflation 
> requests.

Yes. That's what I saw before I added AvgMonitorsPerThreadEstimate and
the ceiling. Although, I think I like the phrase "watermark" better!

Should I rename this new ceiling concept to watermark??

> And we also don't want to have check for exceeding the threshold on 
> every monitor allocation. So the deflation thread will wakeup 
> periodically and check if the threshold is exceeded.

Exactly. GuaranteedSafepointInterval (default 1sec) controls how often
the MonitorDeflationThread will wake up on its own to check for work.
AsyncDeflationInterval (default 250ms) controls at most how often
the MonitorDeflationThread will actually do the work.

However, if an async deflation cycle is directly requested, then it
will be honored regardless of the limits. At this point, only the
WhiteBox support uses that feature. Although we do have an RFE from
you to restore that feature to System.gc():

   JDK-8249638 Re-instate idle monitor deflation as part of System.gc()
   https://bugs.openjdk.java.net/browse/JDK-8249638

> Okay ... so then it comes down to deciding whether 
> AvgMonitorsPerThreadEstimate is the best way to establish the 
> watermark and what the default value should be. This doesn't seem like 
> something that an application developer could reasonably try to 
> estimate so it is just going to be a tuning knob they adjust somewhat 
> arbitrarily.

I actually don't expect anyone to care about AvgMonitorsPerThreadEstimate,
but I've created it just-in-case.

> I assume the 1024 default came from tuning something?

The default 1024 value is the closest I can come to the baseline VM
behavior where we allowed a per-thread free-list to allocate at most
1024 ObjectMonitors in one attempt. Since the per-thread free list is
the fastest code path for an ObjectMonitor allocation in the baseline
VM, I thought using that value made for a reasonable default for the
AvgMonitorsPerThreadEstimate.

However, we don't pre-allocate in the new code so a thread doesn't
actually have (at most) 1024 ObjectMonitors waiting on the per-thread
free list for fast allocation.

> Have you looked at the affect on memory use these changes have (ie 
> peak RSS use)?

Actually what I've been looking at is population values in the baseline
VM and the max in the new code with Kitchensink8H runs. I don't have
the absolutely latest values, but, IIRC, the new code has a much smaller
max value than the baseline population value.

Obviously this just tells me the raw numbers of ObjectMonitors and doesn't
include any "hidden" overhead that would be captured by RSS values, but I
think it gives me a reasonable feel for memory utilization.

> Did your performance measurements look at using different values?

No. All my performance runs used default values. I tend to focus on
out-of-the-box settings unless I'm looking to justify changing some
default value.

> (I can imagine that with enough memory we can effectively disable 
> deflation and so potentially increase performance. OTOH maybe 
> deflation is so infrequent it is a non-issue.)
>
> I have to confess that I never really thought about the old set of 
> heuristics for this, but the fact we're changing the heuristics does 
> raise a concern about what impact applications may see.

None of the performance testing that we've done so far has raised any
concerns about the new heuristics.

The new ObjectMonitor inflation mechanism is much, much faster than
the old mechanism. During my Inflate2 stress testing, the baseline VM
would peak at a population of about 12 million at 4.5 hours into an 8
hour run with release bits. With the new code, Inflate2 would reach a
max of 400+ million at about an hour with no signs that was the actual
peak when I stopped the run.

> BTW MonitorUsedDeflationThreshold should really be diagnostic not 
> experimental, as real applications may need to tune it (and people 
> often don't want to use experimental flags in production as a matter 
> of policy).

MonitorUsedDeflationThreshold wasn't added with this project. It was
added by Robbin using this bug ID:

     JDK-8181859 Monitor deflation is not checked in cleanup path
     https://bugs.openjdk.java.net/browse/JDK-8181859

way back in jdk-10-B21... I don't know the reason that Robbin created
the option as experimental rather than diagnostic, but I can investigate.

Thanks again for the review!

At this point, I don't see anything that I plan to change in response
to this set of comments. I do have a query up above about renaming the
ceiling concept to watermark. Please let me know what you think.

Dan

>
> Thanks,
> David
> -----
>
>> -------------
>>
>> PR: https://git.openjdk.java.net/jdk/pull/642
>>