Small question about JDK-8253064 and ObjectMonitor allocation

Mon Jan 31 22:59:11 UTC 2022

Greetings,

I'm going to try and add some historical context here...

On 1/31/22 3:09 AM, Thomas Stüfe wrote:
> Thanks a lot for your answers, David.

Yes David, thanks for jumping in on this thread.

> On Mon, Jan 31, 2022 at 8:35 AM David Holmes <david.holmes at oracle.com>
> wrote:
>
>> On 31/01/2022 3:54 pm, Thomas Stüfe wrote:
>>> Hi David,
>>>
>>> Thank you for the answer!
>>>
>>> On Mon, Jan 31, 2022 at 6:23 AM David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>>      Hi Thomas,
>>>
>>>      On 31/01/2022 2:32 pm, Thomas Stüfe wrote:
>>>       > Hi,
>>>       >
>>>       > I have a small question about a detail of JDK-8253064.
>>>       >
>>>       > IIUC, before this patch, the VM kept thread-local freelists of
>>>       > pre-allocated ObjectMonitors to reduce allocation contention. Now we just
>>>       > malloc monitors right away.
>>>       >
>>>       > I looked through the issue and the associated PR, but could find no
>>>       > information on why this was done. Dan describes what he did very well:
>>>       > https://github.com/openjdk/jdk/pull/642#issuecomment-720753946, but not
>>>       > why.

Thank you and I'm sorry that my words on 8253064 did not explain the why.

>>>       > I assume that the complexity and memory overhead of the free lists was not
>>>       > worth it? That you found that malloc() is on our platforms "uncontented"
>>>       > enough?
>>>
>>>      The issue was not about freelists and contention it was about requiring
>>>      type-stable-memory: that once a piece of memory was allocated as an
>>>      ObjectMonitor it remained forever after an ObjectMonitor. This allowed
>>>      for various race conditions in the old monitor code maintaining safety.

Erik Osterlund did have some specific reasons for getting rid of 
type-stable-memory.
At least on reason was that it was complicating the work he wanted to do on:

     JDK-8247281 migrate ObjectMonitor::_object to OopStorage
     https://bugs.openjdk.java.net/browse/JDK-8247281

He and I did work together to split this work into the various pieces that
were integrated separately. So TSM didn't prevent work on JDK-8247281, but
it did make it more difficult.

Erik may have had other reasons for getting rid of TSM. I've added him to
this email thread directly so he has a better chance of seeing this query.

>>>      Over time that code changed substantially and the need for
>>>      type-stable-memory for ObjectMonitors disappeared, so we finally got rid
>>>      of it and just moved to a direct allocation scheme.
>>>
>>>
>>> I think I understand, but I was specifically concerned with the question
>>> of allocation contention of ObjectMonitors. That is somewhat independent
>>> from the question of where OMs are allocated.
>>>
>>> Can it happen that lock inflation happens clustered, or does that not
>>> occur in reality?
>>>
>>> AFAIU the old code managed OM storage itself, used global data
>>> structures to do so, and guarded access with a mutex. To reduce
>>> contention, it used a surprisingly large thread-local freelist of 1024
>>> entries. This looks like contention was once a real problem.

We need to take a step back and look at how we got where we are. There
are (at least) four different evolutions or seismic events to this 
subsystem:

1) type-stable-memory and the global lists (block and free)
   - JDK-5030359 "Back-end" synchronization improvements for 1.5.1 or 1.6
     https://bugs.openjdk.java.net/browse/JDK-5030359
   - integrated in jdk-6+50 (partial), jdk-6+53 (finished)

   - This integration introduced type-stable-memory (TSM) and global block
     and free-lists. Dice chose to not use malloc and use TSM for a few
     reasons:
     - direct control of the placement of the ObjectMonitor on a cache-line
       boundary; couldn't trust malloc to keep memory properly aligned.
     - malloc was slower than direct management of blocks of ObjectMonitors.
     - Dice did lots of benchmarking to prove the new code was faster in
       things we cared about at the time on the platforms we cared about.
   - This predates my joining the Runtime team so I don't have exhaustive
     records on this work.

2a) per-thread free list
   - JDK-6468954 Generic synchronization cleanups for J2SE 1.7
     https://bugs.openjdk.java.net/browse/JDK-6468954
   - integrated in jdk-7+12

   - This integration introduced the per-thread free list. This addition
     was to deal with contention with allocating an ObjectMonitor from
     the global free-list.
   - Again, Dice did targeted benchmarking to show that adding a per-thread
     free list  performed better for some specific platforms/configs.
   - This predates my joining the Runtime team so I don't have exhaustive
     records on this work.

2b) per-thread in-use list
   - JDK-6852873 Increase in delta between application stopped time and 
ParNew GC time over application lifetime
     https://bugs.openjdk.java.net/browse/JDK-6852873
   - integrated in jdk-7+99

   - This integration added the MonitorInUseLists concept and was done
     as part of some work to deal with a safepoint performance problem.
   - MonitorInUseLists was disabled by default.
   - The benchmarking done here was focused on safepointing.
   - I joined the Runtime team in Jan 2010, but I don't have exhaustive
     records on this work.

2c) global in-use list
   - JDK-6964164 +MonitorInUseLists can cause leakage of contended objects
     https://bugs.openjdk.java.net/browse/JDK-6964164
   - integrated in jdk-7+102

   - This integration added the global in-use list since having just a
     global block-list and global free-list had some races that caused
     leaks.
   - I joined the Runtime team in Jan 2010, but I don't have exhaustive
     records of this work.

2d) MonitorInUseLists on by default
   - JDK-8149442 MonitorInUseLists should be on by default, deflate idle 
monitors taking too long
     https://bugs.openjdk.java.net/browse/JDK-8149442
   - integrated in jdk-9+120

   - This default switch flip was done to improve the safepoint time
     it takes to deflate idle monitors. The benchmarks done were again
     safepoint focused.

3) async monitor deflation
    - JDK-8153224 Monitor deflation prolong safepoints
      https://bugs.openjdk.java.net/browse/JDK-8153224
    - This evolution switched to async deflation and lock-free lists and
      was very complicated.
    - https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
    - integrated in jdk-15+26

    - Again the focus was on reducing safepoint time. Lots of benchmarking
      done with the perf group. We got faster safepoints, but had to really
      fight to avoid regressions in artifically contended benchmarks which
      is where the lock-free lists stuff came from.

4) monitor list simplifications and getting rid of TSM
    - JDK-8253064 monitor list simplifications and getting rid of TSM
      https://bugs.openjdk.java.net/browse/JDK-8253064
    - integrated in jdk-16+24

    - Not much to add here about 8253064. We wanted to get rid of TSM to
      make other work simpler. Getting rid of TSM allowed us to switch to
      much simpler list management logic. So we got rid of a lot of the
      logic that we added with async monitor deflation (8153224).
    - Lots of benchmarking done with the perf group and those 
pre-integration
      benchmarks showed no worrisome regressions and some speedups.
    - Later benchmarking showed some regressions (that have since been 
fixed)
      and we never figured out why we had different results later 
compared to
      the pre-integration benchmarks. One theory is that we see different
      results in AMD-X64 versus Intel-X64.

>> You can always create a benchmark to show contention in the monitor
>> inflation code. I don't recall now whether this was a real issue or a
>> microbenchmark issue. As the code stated:
>>
>> ObjectMonitor * ATTR ObjectSynchronizer::omAlloc (Thread * Self) {
>>       // A large MAXPRIVATE value reduces both list lock contention
>>       // and list coherency traffic, but also tends to increase the
>>       // number of objectMonitors in circulation as well as the STW
>>       // scavenge costs.  As usual, we lean toward time in space-time
>>       // tradeoffs.
>>
>>       const int MAXPRIVATE = 1024 ;
>>
>> so general performance was a consideration.
>>
>>> OTOH the new code just uses malloc, which also may lock depending on the
>>> malloc allocator internals and the used libc settings. Therefore I
>>> wonder whether OM allocation is still a problem, not a problem with
>>> real-life malloc, or maybe never really had been a problem and the old
>>> code was just overly cautious?
>> Whenever we make significant changes to a subsystem we always
>> investigate the performance profile of the changes. We're prepared to
>> accept some performance loss if we have a good improvement in code
>> complexity/maintainability etc, but if a significant performance issue
>> arose we would revisit it. See for example discussion in:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8263864
>>
>> and related.

We have not seen any performance regressions that we can attribute to
malloc lock contention. In fact, I had to put upper limits on my personal
ObjectMonitor inflation stress programs because they could allocate
inflated ObjectMonitors so fast that the MonitorDeflationThread could
not keep up. You'll notice that the deflation code now stops its
current loop at 1 million objects and takes a check-for-a-safepoint
breather. So I would say that malloc() of ObjectMonitor is not a
performance issue any longer. I believe Dice said that it was before
back in the JDK6 days, but no longer...

Hopefully some of this history stuff is useful.

Dan

>>
>> Cheers,
>> David
>> -----
>>
>>> Thanks, Thomas
>>>
>>>