Small question about JDK-8253064 and ObjectMonitor allocation

Wed Feb 2 10:37:58 UTC 2022

Hi Dan,

thanks a lot for the very thorough archeological dig!

The background for my question is that in the context of Lilliput I
experiment with replacing OM pointers with smaller-sized references. This
would mean we'd have to allocate them in an array or a confined address
space. That would require some form of global storage and raise the
question of allocation and deallocation contention. My current plan is to
deal with that using thread-local bulk operations, and so I arrive at
something that looks and feels similar to what existed before JDK-8253064.
It looks suspiciously simple, but the amount of work that went into the
original code feels like I may re-discover a lot of pitfalls. I also have
to understand why TSM stood in the way of JDK-8247281.

I'm still very much at the code reading and experimenting phase, so I don't
have more specific questions. If some come up, I'll ask again. Your history
lesson is very helpful.

Thanks!

Cheers, Thomas

On Mon, Jan 31, 2022 at 11:59 PM <daniel.daugherty at oracle.com> wrote:

> Greetings,
>
> I'm going to try and add some historical context here...
>
>
> On 1/31/22 3:09 AM, Thomas Stüfe wrote:
> > Thanks a lot for your answers, David.
>
> Yes David, thanks for jumping in on this thread.
>
>
> > On Mon, Jan 31, 2022 at 8:35 AM David Holmes <david.holmes at oracle.com>
> > wrote:
> >
> >> On 31/01/2022 3:54 pm, Thomas Stüfe wrote:
> >>> Hi David,
> >>>
> >>> Thank you for the answer!
> >>>
> >>> On Mon, Jan 31, 2022 at 6:23 AM David Holmes <david.holmes at oracle.com
> >>> <mailto:david.holmes at oracle.com>> wrote:
> >>>
> >>>      Hi Thomas,
> >>>
> >>>      On 31/01/2022 2:32 pm, Thomas Stüfe wrote:
> >>>       > Hi,
> >>>       >
> >>>       > I have a small question about a detail of JDK-8253064.
> >>>       >
> >>>       > IIUC, before this patch, the VM kept thread-local freelists of
> >>>       > pre-allocated ObjectMonitors to reduce allocation contention.
> Now we just
> >>>       > malloc monitors right away.
> >>>       >
> >>>       > I looked through the issue and the associated PR, but could
> find no
> >>>       > information on why this was done. Dan describes what he did
> very well:
> >>>       > https://github.com/openjdk/jdk/pull/642#issuecomment-720753946,
> but not
> >>>       > why.
>
> Thank you and I'm sorry that my words on 8253064 did not explain the why.
>
>
> >>>       > I assume that the complexity and memory overhead of the free
> lists was not
> >>>       > worth it? That you found that malloc() is on our platforms
> "uncontented"
> >>>       > enough?
> >>>
> >>>      The issue was not about freelists and contention it was about
> requiring
> >>>      type-stable-memory: that once a piece of memory was allocated as
> an
> >>>      ObjectMonitor it remained forever after an ObjectMonitor. This
> allowed
> >>>      for various race conditions in the old monitor code maintaining
> safety.
>
> Erik Osterlund did have some specific reasons for getting rid of
> type-stable-memory.
> At least on reason was that it was complicating the work he wanted to do
> on:
>
>      JDK-8247281 migrate ObjectMonitor::_object to OopStorage
>      https://bugs.openjdk.java.net/browse/JDK-8247281
>
> He and I did work together to split this work into the various pieces that
> were integrated separately. So TSM didn't prevent work on JDK-8247281, but
> it did make it more difficult.
>
> Erik may have had other reasons for getting rid of TSM. I've added him to
> this email thread directly so he has a better chance of seeing this query.
>
>
>
> >>>      Over time that code changed substantially and the need for
> >>>      type-stable-memory for ObjectMonitors disappeared, so we finally
> got rid
> >>>      of it and just moved to a direct allocation scheme.
> >>>
> >>>
> >>> I think I understand, but I was specifically concerned with the
> question
> >>> of allocation contention of ObjectMonitors. That is somewhat
> independent
> >>> from the question of where OMs are allocated.
> >>>
> >>> Can it happen that lock inflation happens clustered, or does that not
> >>> occur in reality?
> >>>
> >>> AFAIU the old code managed OM storage itself, used global data
> >>> structures to do so, and guarded access with a mutex. To reduce
> >>> contention, it used a surprisingly large thread-local freelist of 1024
> >>> entries. This looks like contention was once a real problem.
>
> We need to take a step back and look at how we got where we are. There
> are (at least) four different evolutions or seismic events to this
> subsystem:
>
> 1) type-stable-memory and the global lists (block and free)
>    - JDK-5030359 "Back-end" synchronization improvements for 1.5.1 or 1.6
>      https://bugs.openjdk.java.net/browse/JDK-5030359
>    - integrated in jdk-6+50 (partial), jdk-6+53 (finished)
>
>    - This integration introduced type-stable-memory (TSM) and global block
>      and free-lists. Dice chose to not use malloc and use TSM for a few
>      reasons:
>      - direct control of the placement of the ObjectMonitor on a cache-line
>        boundary; couldn't trust malloc to keep memory properly aligned.
>      - malloc was slower than direct management of blocks of
> ObjectMonitors.
>      - Dice did lots of benchmarking to prove the new code was faster in
>        things we cared about at the time on the platforms we cared about.
>    - This predates my joining the Runtime team so I don't have exhaustive
>      records on this work.
>
> 2a) per-thread free list
>    - JDK-6468954 Generic synchronization cleanups for J2SE 1.7
>      https://bugs.openjdk.java.net/browse/JDK-6468954
>    - integrated in jdk-7+12
>
>    - This integration introduced the per-thread free list. This addition
>      was to deal with contention with allocating an ObjectMonitor from
>      the global free-list.
>    - Again, Dice did targeted benchmarking to show that adding a per-thread
>      free list  performed better for some specific platforms/configs.
>    - This predates my joining the Runtime team so I don't have exhaustive
>      records on this work.
>
> 2b) per-thread in-use list
>    - JDK-6852873 Increase in delta between application stopped time and
> ParNew GC time over application lifetime
>      https://bugs.openjdk.java.net/browse/JDK-6852873
>    - integrated in jdk-7+99
>
>    - This integration added the MonitorInUseLists concept and was done
>      as part of some work to deal with a safepoint performance problem.
>    - MonitorInUseLists was disabled by default.
>    - The benchmarking done here was focused on safepointing.
>    - I joined the Runtime team in Jan 2010, but I don't have exhaustive
>      records on this work.
>
> 2c) global in-use list
>    - JDK-6964164 +MonitorInUseLists can cause leakage of contended objects
>      https://bugs.openjdk.java.net/browse/JDK-6964164
>    - integrated in jdk-7+102
>
>    - This integration added the global in-use list since having just a
>      global block-list and global free-list had some races that caused
>      leaks.
>    - I joined the Runtime team in Jan 2010, but I don't have exhaustive
>      records of this work.
>
> 2d) MonitorInUseLists on by default
>    - JDK-8149442 MonitorInUseLists should be on by default, deflate idle
> monitors taking too long
>      https://bugs.openjdk.java.net/browse/JDK-8149442
>    - integrated in jdk-9+120
>
>    - This default switch flip was done to improve the safepoint time
>      it takes to deflate idle monitors. The benchmarks done were again
>      safepoint focused.
>
> 3) async monitor deflation
>     - JDK-8153224 Monitor deflation prolong safepoints
>       https://bugs.openjdk.java.net/browse/JDK-8153224
>     - This evolution switched to async deflation and lock-free lists and
>       was very complicated.
>     -
> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>     - integrated in jdk-15+26
>
>     - Again the focus was on reducing safepoint time. Lots of benchmarking
>       done with the perf group. We got faster safepoints, but had to really
>       fight to avoid regressions in artifically contended benchmarks which
>       is where the lock-free lists stuff came from.
>
> 4) monitor list simplifications and getting rid of TSM
>     - JDK-8253064 monitor list simplifications and getting rid of TSM
>       https://bugs.openjdk.java.net/browse/JDK-8253064
>     - integrated in jdk-16+24
>
>     - Not much to add here about 8253064. We wanted to get rid of TSM to
>       make other work simpler. Getting rid of TSM allowed us to switch to
>       much simpler list management logic. So we got rid of a lot of the
>       logic that we added with async monitor deflation (8153224).
>     - Lots of benchmarking done with the perf group and those
> pre-integration
>       benchmarks showed no worrisome regressions and some speedups.
>     - Later benchmarking showed some regressions (that have since been
> fixed)
>       and we never figured out why we had different results later
> compared to
>       the pre-integration benchmarks. One theory is that we see different
>       results in AMD-X64 versus Intel-X64.
>
>
> >> You can always create a benchmark to show contention in the monitor
> >> inflation code. I don't recall now whether this was a real issue or a
> >> microbenchmark issue. As the code stated:
> >>
> >> ObjectMonitor * ATTR ObjectSynchronizer::omAlloc (Thread * Self) {
> >>       // A large MAXPRIVATE value reduces both list lock contention
> >>       // and list coherency traffic, but also tends to increase the
> >>       // number of objectMonitors in circulation as well as the STW
> >>       // scavenge costs.  As usual, we lean toward time in space-time
> >>       // tradeoffs.
> >>
> >>       const int MAXPRIVATE = 1024 ;
> >>
> >> so general performance was a consideration.
> >>
> >>> OTOH the new code just uses malloc, which also may lock depending on
> the
> >>> malloc allocator internals and the used libc settings. Therefore I
> >>> wonder whether OM allocation is still a problem, not a problem with
> >>> real-life malloc, or maybe never really had been a problem and the old
> >>> code was just overly cautious?
> >> Whenever we make significant changes to a subsystem we always
> >> investigate the performance profile of the changes. We're prepared to
> >> accept some performance loss if we have a good improvement in code
> >> complexity/maintainability etc, but if a significant performance issue
> >> arose we would revisit it. See for example discussion in:
> >>
> >> https://bugs.openjdk.java.net/browse/JDK-8263864
> >>
> >> and related.
>
> We have not seen any performance regressions that we can attribute to
> malloc lock contention. In fact, I had to put upper limits on my personal
> ObjectMonitor inflation stress programs because they could allocate
> inflated ObjectMonitors so fast that the MonitorDeflationThread could
> not keep up. You'll notice that the deflation code now stops its
> current loop at 1 million objects and takes a check-for-a-safepoint
> breather. So I would say that malloc() of ObjectMonitor is not a
> performance issue any longer. I believe Dice said that it was before
> back in the JDK6 days, but no longer...
>
>
> Hopefully some of this history stuff is useful.
>
> Dan
>
>
>
> >>
> >> Cheers,
> >> David
> >> -----
> >>
> >>> Thanks, Thomas
> >>>
> >>>
>
>