RFR: 8339114: DaCapo xalan performance with -XX:+UseObjectMonitorTable [v3]

Fri Mar 28 16:28:26 UTC 2025

On Wed, 26 Mar 2025 07:54:43 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> - When successfully looked up an OM in the OMCache, then make it avaiable in the BasicLock cache. Use that cache whenever possible.
>> - When successfully looked up an OM in the OMCache, then don't push-back the OM on that cache to avoid shuffling the cache on each monitor enter.
>> - In the runtime path of monitorexit, attempt to use the BasicLock cache, then the OMCache, and only look up in the table when the caches failed.
>> - Some small code shufflings.
>> 
>> I did 50 runs of xalan, each of which do 10 warmup iterations before doing one measurement. The following results are the averages of the measurement iterations across the 50 runs:
>> without-OMT: 773.3 ms
>> with-OMT: 797.28 ms
>> 
>> That is still a regression of ~3%
>
> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Include objectMonitor.inline.hpp

My testing also showed that the try_spin was the most beneficial of these changes to getting UseObjectMonitorTable performance mostly up to without UseObjectMonitorTable.

I've just started a Oracle internal benchmark run comparing mainline (base) with a version with try_spin instead of try_lock to see if that change has any adverse effects on any of the other benchmarks without UseObjectMonitorTable.  xalan does benefit from adaptive spinning, so it makes sense moving this extra level of spinning earlier.

The BasicLock cache and OMCache changes are really good, so we want these and the try_spin changes if they don't regress any benchmarks (will update with results).  That's this PR.

The cleanup to add read_caches() to encapsulate both BasicLock and OMCache is nice but I don't really like the new read_monitor().  LightweightSynchronizer::read_caches() can be private in lightweightSynchronizer.cpp and added to this change.  Not sure about the CacheSetter changes to Axel's changes.  And the successor changes should be in another PR.

I still think this PR is a good improvement, as is, pending try_spin performance testing.  The additional work should be done as a set of separate PRs.  1. Fix CacheSetter to be only used for C2 (possibly, but do we want more conditional code?) 2. Experiment with setting successor in ObjectMonitor::try_spin.  I did a little of this once and didn't have any good numbers to show for it, so I think it should be a new investigation. Your numbers are better for this.

Lastly, thank you Axel for running these performance tests. It looks like the performance for the PR change is quite good on the Oracle benchmarking machines, but less good on your personal cloud instance.  The performance tests that we've (Eric Caspole and I have) done show that the PR changes are mostly good with some expected variation that we always see with locking changes.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24098#issuecomment-2761849268