Small question about JDK-8253064 and ObjectMonitor allocation

Fri Mar 4 16:38:52 UTC 2022

Sorry for the delay in getting back to replying to this thread...

On 3/3/22 9:45 PM, David Holmes wrote:
> Hi Thomas,
>
> On 4/03/2022 1:06 am, Thomas Stüfe wrote:
>> Hi Dan,
>>
>> sorry for being quiet so long, I'm really thankful for all the 
>> explanations.
>>
>> I have one more question, if I may.
>>
>> In ObjectMonitor, we carefully shepherd different groups of members 
>> into different cache lines. But what prevents the end of one OM and 
>> the start of the next OM to share one cacheline? If the underlying 
>> libc allocator feels inclined to place them side by side?
>
> I think nothing prevents it but why do we care? The padding is to 
> prevent sharing of hot fields in the same OM, it does nothing for 
> sharing with different OMs (or any other object).

Time for a bit more history...

Prior to the fix for:

     JDK-8253064 monitor list simplifications and getting rid of TSM
     https://bugs.openjdk.java.net/browse/JDK-8253064

we not only had padding in between certain fields in the ObjectMonitor,
but we also made sure that each ObjectMonitor was carefully allocated
on a cache line boundary and that it did not share a cache line with
anything else.

This was accomplished by this code:

 > #ifndef OM_CACHE_LINE_SIZE
 > // Use DEFAULT_CACHE_LINE_SIZE if not already specified for
 > // the current build platform.
 > #define OM_CACHE_LINE_SIZE DEFAULT_CACHE_LINE_SIZE
 > #endif
 >
 > typedef PaddedEnd<ObjectMonitor, OM_CACHE_LINE_SIZE> PaddedObjectMonitor;

and:

 >     // 3: allocate a block of new ObjectMonitors
 >     // Both the local and global free lists are empty -- resort to 
malloc().
 >     // In the current implementation ObjectMonitors are TSM - immortal.
 >     // Ideally, we'd write "new ObjectMonitor[_BLOCKSIZE], but we want
 >     // each ObjectMonitor to start at the beginning of a cache line,
 >     // so we use align_up().
 >     // A better solution would be to use C++ placement-new.
 >     // BEWARE: As it stands currently, we don't run the ctors!
 >     assert(_BLOCKSIZE > 1, "invariant");
 >     size_t neededsize = sizeof(PaddedObjectMonitor) * _BLOCKSIZE;
 >     PaddedObjectMonitor* temp;
 >     size_t aligned_size = neededsize + (OM_CACHE_LINE_SIZE - 1);
 >     void* real_malloc_addr = NEW_C_HEAP_ARRAY(char, aligned_size, 
mtInternal);
 >     temp = (PaddedObjectMonitor*)align_up(real_malloc_addr, 
OM_CACHE_LINE_SIZE);
 >     (void)memset((void *) temp, 0, neededsize);

So we used to allocate a block of PaddedObjectMonitors and we made sure 
that new
block started on a cache line boundary. Since the PaddedObjectMonitor added
padding to the end of the ObjectMonitor that it contains, that made sure 
that
each ObjectMonitor started on a cache line boundary so each 
ObjectMonitor was
isolated from cache line interference with every other ObjectMonitor.

Because we were allocating ObjectMonitors in blocks, that guaranteed that we
had ObjectMonitors logically next to other ObjectMonitors so we had to do
something to prevent cache line overlaps.

By switching to individual ObjectMonitor allocation with malloc(), 
there's no
guarantee that ObjectMonitor-1 will be allocated next to 
ObjectMonitor-2, but
there's also no guarantee that two ObjectMonitors won't be allocated next to
each other.

So, yes, it is now possible for an ObjectMonitor to share the same cache 
line
with another ObjectMonitor. Our performance testing prior to integration of
the fix for:

     JDK-8253064 monitor list simplifications and getting rid of TSM
     https://bugs.openjdk.java.net/browse/JDK-8253064

did not give us any results that indicated that this issue is a real 
problem.
However, absence of evidence is not evidence of absence. The removal of end
padding and padding in general was discussed during the 8253064 code review.
\
In fact, I have a follow-up bug assigned to me:

     JDK-8256303 revisit ObjectMonitor padding between fields
     https://bugs.openjdk.java.net/browse/JDK-8256303

Please let me know if this helps in your investigation or if you need more
information.

Dan
-

>
> David
> -----
>
>> I see the last member group inside an OM spilling over into the next 
>> cache line:
>>
>> ```
>> 0  ObjectMonitor* _next_om;
>> 8 volatile intx _recursions;
>> 16  ObjectWaiter* volatile _EntryList;
>> 24  ObjectWaiter* volatile _cxq;
>> 32  JavaThread* volatile _succ;
>> 40  JavaThread* volatile _Responsible;
>> 48  volatile int _Spinner;
>> 52?  volatile int _SpinDuration;
>> 56?  int _contentions;
>> 64?  ObjectWaiter* volatile _WaitSet;
>>    volatile int  _waiters;
>>    volatile int _WaitSetLock;
>> ```
>>
>> Depending on whether the compiler packs 32bits into words or not, I 
>> assume members starting at _WaitSet will start at a new cache line. 
>> The next OM may follow immediately after that. Is that not a problem? 
>> Are we just lucky that the natural malloc allocation granularity 
>> prevents this here? Or should we also add padding at that point?
>>
>> Thanks!
>>
>> Thomas
>>
>>
>>
>>
>> On Wed, Feb 9, 2022 at 7:04 PM <daniel.daugherty at oracle.com 
>> <mailto:daniel.daugherty at oracle.com>> wrote:
>>
>>     On 2/9/22 1:34 AM, Thomas Stüfe wrote:
>>>     Hi,
>>>
>>>     I wonder if I understand this correctly:
>>>
>>>     - Monitors are only deflated by the MonitorDeflationThread, if I
>>>     ignore races on inflation
>>
>>     ObjectSynchronizer::deflate_idle_monitors() is called by:
>>
>>     - the MonitorDeflationThread in its work loop
>>     - the VMThread at VM exit time via
>>     ObjectSynchronizer::do_final_audit_and_print_stats()
>>     - the VMThread via VM_ThreadDump::doit() when locked_monitors are
>>     requested in the dump
>>
>>
>>>     - The MonitorDeflationThread processes them in intervals of 250ms
>>>     (AsyncDeflationInterval),
>>
>>     The MonitorDeflationThread wakes up every 250ms to check to see 
>> if there
>>     is work to do unless a deflation is requested in which case it 
>> wakes up
>>     after the notify.
>>
>>
>>>     but only if there are more than 90 (MonitorUsedDeflationThreshold)
>>>     monitors in use,
>>
>>     Not a count of 90; it is a percentage. Here's the option 
>> description:
>>
>>        product(intx, MonitorUsedDeflationThreshold, 90,
>>     DIAGNOSTIC,              \
>>                "Percentage of used monitors before triggering deflation
>>     (0 is "  \
>>                "off). The check is performed on
>>     GuaranteedSafepointInterval "    \
>>                "or
>> AsyncDeflationInterval.")                                     \
>>                range(0,
>>     100)                                                     \
>>
>>     So if the used monitors count is > 90% of the "ceiling", we 
>> trigger a
>>     deflation. The initial ceiling estimate is based on a per-thread
>>     estimate,
>>     but if the in-use list has had a maximum value greater than that
>>     estimate,
>>     then we use the list maximum value.
>>
>>     See src/hotspot/share/runtime/synchronizer.cpp:
>>     monitors_used_above_threshold().
>>
>>
>>>     and only in bulks of 1mio (MonitorDeflationMax).
>>
>>     Not quite in bulks of 1 million. Here's the option description:
>>
>>        product(intx, MonitorDeflationMax, 1000000,
>>     DIAGNOSTIC,                   \
>>                "The maximum number of monitors to deflate, unlink and
>>     delete "   \
>>                "at one time (minimum is 1024).")                      \
>>                range(1024, max_jint)
>>
>>     so we'll deflate up to MonitorDeflationMax at one time. If there's
>>     only 1
>>     to deflate, then we'll do 1.
>>
>>
>>>
>>>     Is there any upward bound preventing the creation of
>>>     ObjectMonitors, since creation races with this one thread, and we
>>>     also cap its ability to process more than 1mio?
>>
>>     No, there is no upward bound. When the safepoint cleanup used to
>>     deflate idle
>>     monitors, that was a "natural" limit on the creation rate by virtue
>>     of the
>>     safepoint pausing the JavaThreads that were creating inflated
>>     ObjectMonitors
>>     while the VMThread deflated all the monitors created during the 
>> previous
>>     interval.
>>
>>
>>>     The background of my question is that I try to get a feel for the
>>>     maximum number of live monitors.
>>
>>     There is no maximum number. If you have a badly behaving program
>>     that simply creates
>>     inflated ObjectMonitors as fast as it can, then you'll eventually
>>     bring the VM to its
>>     knees. As I've mentioned before, I've had to change the way that my
>>     ObjectMonitor
>>     stress programs work because they would create more inflated
>>     ObjectMonitors that
>>     could be dealt with by the MonitorDeflationThread.
>>
>>
>>>     When running renaissance philosophers benchmark I see that the VM
>>>     has, depending on machine speed and optimization level, > 2.5
>>>     million live ObjectMonitors, taking > 400MB heap space. Is this by
>>>     design?
>>
>>     Sounds like a perfect use for a lower MonitorUsedDeflationThreshold
>>     value and
>>     possibly a smaller AsyncDeflationInterval value, but I suspect that
>>     lowering
>>     MonitorUsedDeflationThreshold would be "enough". Of course, that's
>>     just a
>>     blind guess on my part.
>>
>>     Is this behavior by the renaissance philosophers benchmark
>>     considered a bug or
>>     a feature?
>>
>>>
>>>     The annoying thing is that these peaks in C-heap usage can be
>>>     sticky, depending on libc implementation. The glibc especially has
>>>     poor reclaim behavior. Also, it is hidden in NMT under
>>>     "mtInternal", which is why I propose to at least give it its own
>>>     category (https://bugs.openjdk.java.net/browse/JDK-8281450
>>>     <https://bugs.openjdk.java.net/browse/JDK-8281450>).
>>
>>     I think you mean:
>>
>>          JDK-8281460 Let ObjectMonitor have its own NMT category
>>     https://bugs.openjdk.java.net/browse/JDK-8281460
>>     <https://bugs.openjdk.java.net/browse/JDK-8281460>
>>
>>     instead of:
>>
>>          JDK-8281450 Remove unnecessary operator new and delete from
>>     ObjectMonitor
>>     https://bugs.openjdk.java.net/browse/JDK-8281450
>>     <https://bugs.openjdk.java.net/browse/JDK-8281450>
>>
>>     Dan
>>
>>
>>
>>>
>>>     Thanks, Thomas
>>>
>>>     On Mon, Jan 31, 2022 at 11:59 PM <daniel.daugherty at oracle.com
>>>     <mailto:daniel.daugherty at oracle.com>> wrote:
>>>
>>>         Greetings,
>>>
>>>         I'm going to try and add some historical context here...
>>>
>>>
>>>         On 1/31/22 3:09 AM, Thomas Stüfe wrote:
>>>         > Thanks a lot for your answers, David.
>>>
>>>         Yes David, thanks for jumping in on this thread.
>>>
>>>
>>>         > On Mon, Jan 31, 2022 at 8:35 AM David Holmes
>>>         <david.holmes at oracle.com <mailto:david.holmes at oracle.com>>
>>>         > wrote:
>>>         >
>>>         >> On 31/01/2022 3:54 pm, Thomas Stüfe wrote:
>>>         >>> Hi David,
>>>         >>>
>>>         >>> Thank you for the answer!
>>>         >>>
>>>         >>> On Mon, Jan 31, 2022 at 6:23 AM David Holmes
>>>         <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>>         >>> <mailto:david.holmes at oracle.com
>>>         <mailto:david.holmes at oracle.com>>> wrote:
>>>         >>>
>>>         >>>      Hi Thomas,
>>>         >>>
>>>         >>>      On 31/01/2022 2:32 pm, Thomas Stüfe wrote:
>>>         >>>       > Hi,
>>>         >>>       >
>>>         >>>       > I have a small question about a detail of 
>>> JDK-8253064.
>>>         >>>       >
>>>         >>>       > IIUC, before this patch, the VM kept thread-local
>>>         freelists of
>>>         >>>       > pre-allocated ObjectMonitors to reduce allocation
>>>         contention. Now we just
>>>         >>>       > malloc monitors right away.
>>>         >>>       >
>>>         >>>       > I looked through the issue and the associated PR,
>>>         but could find no
>>>         >>>       > information on why this was done. Dan describes
>>>         what he did very well:
>>>         >>>       >
>>> https://github.com/openjdk/jdk/pull/642#issuecomment-720753946
>>> <https://github.com/openjdk/jdk/pull/642#issuecomment-720753946>,
>>>         but not
>>>         >>>       > why.
>>>
>>>         Thank you and I'm sorry that my words on 8253064 did not
>>>         explain the why.
>>>
>>>
>>>         >>>       > I assume that the complexity and memory overhead
>>>         of the free lists was not
>>>         >>>       > worth it? That you found that malloc() is on our
>>>         platforms "uncontented"
>>>         >>>       > enough?
>>>         >>>
>>>         >>>      The issue was not about freelists and contention it
>>>         was about requiring
>>>         >>>      type-stable-memory: that once a piece of memory was
>>>         allocated as an
>>>         >>>      ObjectMonitor it remained forever after an
>>>         ObjectMonitor. This allowed
>>>         >>>      for various race conditions in the old monitor code
>>>         maintaining safety.
>>>
>>>         Erik Osterlund did have some specific reasons for getting 
>>> rid of
>>>         type-stable-memory.
>>>         At least on reason was that it was complicating the work he
>>>         wanted to do on:
>>>
>>>              JDK-8247281 migrate ObjectMonitor::_object to OopStorage
>>>         https://bugs.openjdk.java.net/browse/JDK-8247281
>>> <https://bugs.openjdk.java.net/browse/JDK-8247281>
>>>
>>>         He and I did work together to split this work into the various
>>>         pieces that
>>>         were integrated separately. So TSM didn't prevent work on
>>>         JDK-8247281, but
>>>         it did make it more difficult.
>>>
>>>         Erik may have had other reasons for getting rid of TSM. I've
>>>         added him to
>>>         this email thread directly so he has a better chance of seeing
>>>         this query.
>>>
>>>
>>>
>>>         >>>      Over time that code changed substantially and the
>>>         need for
>>>         >>>      type-stable-memory for ObjectMonitors disappeared, so
>>>         we finally got rid
>>>         >>>      of it and just moved to a direct allocation scheme.
>>>         >>>
>>>         >>>
>>>         >>> I think I understand, but I was specifically concerned
>>>         with the question
>>>         >>> of allocation contention of ObjectMonitors. That is
>>>         somewhat independent
>>>         >>> from the question of where OMs are allocated.
>>>         >>>
>>>         >>> Can it happen that lock inflation happens clustered, or
>>>         does that not
>>>         >>> occur in reality?
>>>         >>>
>>>         >>> AFAIU the old code managed OM storage itself, used 
>>> global data
>>>         >>> structures to do so, and guarded access with a mutex. To
>>>         reduce
>>>         >>> contention, it used a surprisingly large thread-local
>>>         freelist of 1024
>>>         >>> entries. This looks like contention was once a real 
>>> problem.
>>>
>>>         We need to take a step back and look at how we got where we
>>>         are. There
>>>         are (at least) four different evolutions or seismic events to
>>>         this
>>>         subsystem:
>>>
>>>         1) type-stable-memory and the global lists (block and free)
>>>            - JDK-5030359 "Back-end" synchronization improvements for
>>>         1.5.1 or 1.6
>>>         https://bugs.openjdk.java.net/browse/JDK-5030359
>>> <https://bugs.openjdk.java.net/browse/JDK-5030359>
>>>            - integrated in jdk-6+50 (partial), jdk-6+53 (finished)
>>>
>>>            - This integration introduced type-stable-memory (TSM) and
>>>         global block
>>>              and free-lists. Dice chose to not use malloc and use TSM
>>>         for a few
>>>              reasons:
>>>              - direct control of the placement of the ObjectMonitor on
>>>         a cache-line
>>>                boundary; couldn't trust malloc to keep memory properly
>>>         aligned.
>>>              - malloc was slower than direct management of blocks of
>>>         ObjectMonitors.
>>>              - Dice did lots of benchmarking to prove the new code was
>>>         faster in
>>>                things we cared about at the time on the platforms we
>>>         cared about.
>>>            - This predates my joining the Runtime team so I don't have
>>>         exhaustive
>>>              records on this work.
>>>
>>>         2a) per-thread free list
>>>            - JDK-6468954 Generic synchronization cleanups for J2SE 1.7
>>>         https://bugs.openjdk.java.net/browse/JDK-6468954
>>> <https://bugs.openjdk.java.net/browse/JDK-6468954>
>>>            - integrated in jdk-7+12
>>>
>>>            - This integration introduced the per-thread free list.
>>>         This addition
>>>              was to deal with contention with allocating an
>>>         ObjectMonitor from
>>>              the global free-list.
>>>            - Again, Dice did targeted benchmarking to show that adding
>>>         a per-thread
>>>              free list  performed better for some specific
>>>         platforms/configs.
>>>            - This predates my joining the Runtime team so I don't have
>>>         exhaustive
>>>              records on this work.
>>>
>>>         2b) per-thread in-use list
>>>            - JDK-6852873 Increase in delta between application stopped
>>>         time and
>>>         ParNew GC time over application lifetime
>>>         https://bugs.openjdk.java.net/browse/JDK-6852873
>>> <https://bugs.openjdk.java.net/browse/JDK-6852873>
>>>            - integrated in jdk-7+99
>>>
>>>            - This integration added the MonitorInUseLists concept and
>>>         was done
>>>              as part of some work to deal with a safepoint performance
>>>         problem.
>>>            - MonitorInUseLists was disabled by default.
>>>            - The benchmarking done here was focused on safepointing.
>>>            - I joined the Runtime team in Jan 2010, but I don't have
>>>         exhaustive
>>>              records on this work.
>>>
>>>         2c) global in-use list
>>>            - JDK-6964164 +MonitorInUseLists can cause leakage of
>>>         contended objects
>>>         https://bugs.openjdk.java.net/browse/JDK-6964164
>>> <https://bugs.openjdk.java.net/browse/JDK-6964164>
>>>            - integrated in jdk-7+102
>>>
>>>            - This integration added the global in-use list since
>>>         having just a
>>>              global block-list and global free-list had some races
>>>         that caused
>>>              leaks.
>>>            - I joined the Runtime team in Jan 2010, but I don't have
>>>         exhaustive
>>>              records of this work.
>>>
>>>         2d) MonitorInUseLists on by default
>>>            - JDK-8149442 MonitorInUseLists should be on by default,
>>>         deflate idle
>>>         monitors taking too long
>>>         https://bugs.openjdk.java.net/browse/JDK-8149442
>>> <https://bugs.openjdk.java.net/browse/JDK-8149442>
>>>            - integrated in jdk-9+120
>>>
>>>            - This default switch flip was done to improve the
>>>         safepoint time
>>>              it takes to deflate idle monitors. The benchmarks done
>>>         were again
>>>              safepoint focused.
>>>
>>>         3) async monitor deflation
>>>             - JDK-8153224 Monitor deflation prolong safepoints
>>>         https://bugs.openjdk.java.net/browse/JDK-8153224
>>> <https://bugs.openjdk.java.net/browse/JDK-8153224>
>>>             - This evolution switched to async deflation and lock-free
>>>         lists and
>>>               was very complicated.
>>>             -
>>> https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation
>>> <https://wiki.openjdk.java.net/display/HotSpot/Async+Monitor+Deflation>
>>>             - integrated in jdk-15+26
>>>
>>>             - Again the focus was on reducing safepoint time. Lots of
>>>         benchmarking
>>>               done with the perf group. We got faster safepoints, but
>>>         had to really
>>>               fight to avoid regressions in artifically contended
>>>         benchmarks which
>>>               is where the lock-free lists stuff came from.
>>>
>>>         4) monitor list simplifications and getting rid of TSM
>>>             - JDK-8253064 monitor list simplifications and getting rid
>>>         of TSM
>>>         https://bugs.openjdk.java.net/browse/JDK-8253064
>>> <https://bugs.openjdk.java.net/browse/JDK-8253064>
>>>             - integrated in jdk-16+24
>>>
>>>             - Not much to add here about 8253064. We wanted to get rid
>>>         of TSM to
>>>               make other work simpler. Getting rid of TSM allowed us
>>>         to switch to
>>>               much simpler list management logic. So we got rid of a
>>>         lot of the
>>>               logic that we added with async monitor deflation 
>>> (8153224).
>>>             - Lots of benchmarking done with the perf group and those
>>>         pre-integration
>>>               benchmarks showed no worrisome regressions and some
>>>         speedups.
>>>             - Later benchmarking showed some regressions (that have
>>>         since been
>>>         fixed)
>>>               and we never figured out why we had different results 
>>> later
>>>         compared to
>>>               the pre-integration benchmarks. One theory is that we
>>>         see different
>>>               results in AMD-X64 versus Intel-X64.
>>>
>>>
>>>         >> You can always create a benchmark to show contention in the
>>>         monitor
>>>         >> inflation code. I don't recall now whether this was a real
>>>         issue or a
>>>         >> microbenchmark issue. As the code stated:
>>>         >>
>>>         >> ObjectMonitor * ATTR ObjectSynchronizer::omAlloc (Thread *
>>>         Self) {
>>>         >>       // A large MAXPRIVATE value reduces both list lock
>>>         contention
>>>         >>       // and list coherency traffic, but also tends to
>>>         increase the
>>>         >>       // number of objectMonitors in circulation as well as
>>>         the STW
>>>         >>       // scavenge costs.  As usual, we lean toward time in
>>>         space-time
>>>         >>       // tradeoffs.
>>>         >>
>>>         >>       const int MAXPRIVATE = 1024 ;
>>>         >>
>>>         >> so general performance was a consideration.
>>>         >>
>>>         >>> OTOH the new code just uses malloc, which also may lock
>>>         depending on the
>>>         >>> malloc allocator internals and the used libc settings.
>>>         Therefore I
>>>         >>> wonder whether OM allocation is still a problem, not a
>>>         problem with
>>>         >>> real-life malloc, or maybe never really had been a problem
>>>         and the old
>>>         >>> code was just overly cautious?
>>>         >> Whenever we make significant changes to a subsystem we 
>>> always
>>>         >> investigate the performance profile of the changes. We're
>>>         prepared to
>>>         >> accept some performance loss if we have a good improvement
>>>         in code
>>>         >> complexity/maintainability etc, but if a significant
>>>         performance issue
>>>         >> arose we would revisit it. See for example discussion in:
>>>         >>
>>>         >> https://bugs.openjdk.java.net/browse/JDK-8263864
>>> <https://bugs.openjdk.java.net/browse/JDK-8263864>
>>>         >>
>>>         >> and related.
>>>
>>>         We have not seen any performance regressions that we can
>>>         attribute to
>>>         malloc lock contention. In fact, I had to put upper limits on
>>>         my personal
>>>         ObjectMonitor inflation stress programs because they could
>>>         allocate
>>>         inflated ObjectMonitors so fast that the
>>>         MonitorDeflationThread could
>>>         not keep up. You'll notice that the deflation code now stops 
>>> its
>>>         current loop at 1 million objects and takes a
>>>         check-for-a-safepoint
>>>         breather. So I would say that malloc() of ObjectMonitor is 
>>> not a
>>>         performance issue any longer. I believe Dice said that it was
>>>         before
>>>         back in the JDK6 days, but no longer...
>>>
>>>
>>>         Hopefully some of this history stuff is useful.
>>>
>>>         Dan
>>>
>>>
>>>
>>>         >>
>>>         >> Cheers,
>>>         >> David
>>>         >> -----
>>>         >>
>>>         >>> Thanks, Thomas
>>>         >>>
>>>         >>>
>>>
>>