RFR: 8339114: DaCapo xalan performance with -XX:+UseObjectMonitorTable [v3]

Fri Mar 28 09:20:38 UTC 2025

On Wed, 26 Mar 2025 07:54:43 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

>> - When successfully looked up an OM in the OMCache, then make it avaiable in the BasicLock cache. Use that cache whenever possible.
>> - When successfully looked up an OM in the OMCache, then don't push-back the OM on that cache to avoid shuffling the cache on each monitor enter.
>> - In the runtime path of monitorexit, attempt to use the BasicLock cache, then the OMCache, and only look up in the table when the caches failed.
>> - Some small code shufflings.
>> 
>> I did 50 runs of xalan, each of which do 10 warmup iterations before doing one measurement. The following results are the averages of the measurement iterations across the 50 runs:
>> without-OMT: 773.3 ms
>> with-OMT: 797.28 ms
>> 
>> That is still a regression of ~3%
>
> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Include objectMonitor.inline.hpp

The exit and hashcode changes seem good, I see no issue there.

The enter changes require more thought and testing I think.

# First the caches:

There are two fundamental differences between the two caches. 

First the thread stack cache (BasicLock) is assumed to be guarded from use after free by only containing a monitor if it is locked by the thread. So deflation cannot occur. There is a small window where it exists because C2, due to register pressure, saves a potential value to the stack. Currently we never read this value in the slow path, and the CacheSetter overwrites it with the monitor we actually lock on or nullptr in the case of fast locking.

Second exit assumes that whatever is in the cache is what is locked on. So if we do not do inflated locking in enter for whatever reason, it must be cleared. 

We must also verify that all paths that create a BasicLock also clears the memory, so we do not read garbage. And that no platform relies on the CacheSetter, I do not think that is the case. I see you fixed the C++ constructions with a constructor. But need to look at (interpreter, c1, c2, jvmci) x (all supported platforms). I think the CacheSetter in quick_enter was added specifically for x86_32 which used to share the C2 code with x86_64. However after [JDK-8338383](https://bugs.openjdk.org/browse/JDK-8338383) / #21565 x86_32 C2 support was disabled due to 64bit CAS requirements. So this might work, and the x86_32 is deprecated and being actively removed, but I think it would be good to have the clear the thread stack cache (BasicLock) for NOT_LP64. 

# Second the spinning:

`try_enter` -> `spin_enter`

This changes the monitor spinning quite dramatically. For virtual threads the number of `try_spin` calls increase from 1 -> 2 (+100%). And for non virtual threads the number of `try_spin` calls increase from 2 -> 3 (+50%). This seems significant enough to warrant looking at a large suite of benchmarks and see how they behave. And remember multiple calls to `try_spin` affect both the total spin duration and the velocity of the adaptive spinning change.

# Performance evaluation

I spent some time yesterday trying to understand what in this patch is important for performance. Xalan seems to be a workload which wants to spend its time in `try_spin`. When playing around with the patch, the `try_enter` -> `spin_enter` change seems to be the most important change, at least in our (Oracle) java performance system. But have not exhaustively tested all changes. And the behavior in our java performance system and what I see on my virtualized sky lake xeon based cloud instance differs quite a bit. 

I suspect it boils down to the sort of "effective" scheduling of our successor windows (times when there is a successor set). Which avoids exit slow paths and re-locking.

This made me do some experiments with the successor and evaluate the performance.

Explanation of the different versions
* Base: b891bfa7e67c21478475642e2bfa2cdc65a3bffe
  * Mainline commit used as a baseline (the current PRs base)
* PR: fc5062a48767646414f9fd7e57a9370c766bda3e
  * The PR branch as of fc5062a: Include objectMonitor.inline.hpp
  * Personal instance:
    * Reduced regression by 27% with table (-XX:+UseObjectMonitorTable)
    * 16% regression without table (-XX:-UseObjectMonitorTable)
  * Oracle Java Performance System:
    * Reduced regression by 73% with table (-XX:+UseObjectMonitorTable) on Linux x64
    * Regression gone with table (-XX:+UseObjectMonitorTable) on Linux aarch64
      * Not significant, needs more runs.
    * Run without table (-XX:-UseObjectMonitorTable) is still in the queue.
* V3: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v3
  * This is just my review comments / cleanups on the PR. 
  * Has the same performance characteristics as the PR.
* V4: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v4
  * Adhoc experiment with successor. Only clear the successor after the final spin, before parking.
  * Most likely incorrect / contains holes, but should only lead to stranding. (Looking at GHA there seem to be some stranding occurring). Though it is an interesting idea to evaluate.
  * Personal instance:
    * Similar performance as the PR / V3 branch
  * Oracle Java Performance System:
    * Regression gone with table (-XX:+UseObjectMonitorTable) on both Linux aarch64 and x64
      * Not significant, needs more runs.
* V5: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v5
  * Instead of only clearing before the parking. It clears the successor more often, every 256 iteration of the `try_spin` spin loop. The idea is to instead of prologing the successor window, create more small windows where the successor is clear so that other threads that may be earlier in their spinning can take over the successor. 
  * This is probably correct, there might be that a fence is needed after clearing the successor inside the loop. The reason that it is not there is because the spinner will always either get the lock or retry setting the successor, which if it succeeds will lead to a fence later, or another thread have become the successor, and that thread will fence.
  * Personal instance:
    * Marginal improvements over V4.
  * Oracle Java Performance System:
    * Is still in the queue.
* V6: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v6
  * Based on V4, but it sets the successor unconditionally in `try_spin` which may overwrite another successor. The idea is similar to V5 but instead of sometimes clearing, we simply allow for as long as a thread is in `try_spin` there is a successor, and reducing the number of clear successors (and fences) required.
  * The branch has stranding issues because it is based on V4. Will probably rerun this experiment but without the V4 changes to the successor.
  * Personal instance:
    * Marginal improvements over V4.
  * Oracle Java Performance System:
    * Is still in the queue.
* V7: https://github.com/xmas92/jdk/tree/JDK-8339114-review-v7
  * Based on V6. But now clearing the successor is done with a CAS, and only if it succeeds do the thread fence.
  * The branch has stranding issues because it is based on V4.
  * Combining this with V6 is probably a bad idea, because all threads are writing to this cache line, and at the same time some thread is trying to CAS. It is probably more effective do just do the load check store fence, and sometimes overwrite another threads successor claiming. (The idea was to not allow any time slice where there are no successor if there is a thread inside, but seems not worth the tradeoff).
  * Personal instance:
    * No improvement over V3, (effectively negates the improvements gained by V4+V6)
  * Oracle Java Performance System:
    * Is still in the queue.

# Conclusion

I think parts of this PR could go in. But the enter changes should be held of on until we understand the impact on other benchmarks (even if we can just use `try_enter` over `spin_enter` for -XX:-UseObjectMonitorTable to avoid the -XX:-UseObjectMonitorTable regressions).

I also think there is some room for improvements with this CacheSetter and C2. Using the thread stack cache is (currently) only relevant for C2. If keep track in the runtime that we came from a context where this thread stack cache is preloaded with what is in the thread local stack, or nullptr. We can optimize away a lot of the CacheSetter logic. Have only be something we do if we retry due to deflation or if we come from a another context. From other contexts clear the thread stack cache is required and setting the thread stack cache is beneficial due to OSR. But it may not be worth setting the the thread local cache unless we are locking from C2, because on C2 uses this cache for lookups, and C2 is what we want to be fast. So limit the OMCache to only C2 locked monitors is probably beneficial. 

I will update the this post with the rest of the benchmark results as they finished, maybe something I do next week. Similarly I might experiment some more with a better version of V6. 

Cheers.

# Benchmarks results

## On Personal Intel cloud instance:

* Difference vs Base Without Table (-XX:-UseObjectMonitorTable) b891bfa7e67c21478475642e2bfa2cdc65a3bffe
  * With Table (-XX:+UseObjectMonitorTable)
    * Columns
      * `Name               Cnt     Base    Error      Test    Error Unit  Change (* = significant)`
    * Base b891bfa7e67c21478475642e2bfa2cdc65a3bffe
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1474.200 ± 50.614   ms   0.75x (p = 0.000*)`
    * PR branch fc5062a48767646414f9fd7e57a9370c766bda3e
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1407.650 ± 37.815   ms   0.79x (p = 0.000*)`
    * v3 branch b18c83a8118d58ef615405de494234971336878d
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1414.650 ± 50.783   ms   0.79x (p = 0.000*)`
    * v4 branch ec91413b81a410cc9f00920f6987c5c3ca4d1ed2
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1423.000 ± 39.226   ms   0.78x (p = 0.000*)`
    * v5 branch 19340a0f8930fa916154f2f7549faedcf1f20f02
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1405.600 ± 78.450   ms   0.79x (p = 0.000*)`
    * v6 branch 9a332cb090fb5797907052533f41556dd7cdeb94
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1371.750 ± 36.835   ms   0.81x (p = 0.000*)`
    * v7 branch 13297b13bd2a9e89e1a51c7a4e27ad39928a8b00
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1474.700 ± 72.045   ms   0.75x (p = 0.000*)`
  * Without Table (-XX:-UseObjectMonitorTable)
    * Columns
      * `Name               Cnt     Base    Error      Test    Error Unit  Change (* = significant)`
    * PR branch fc5062a48767646414f9fd7e57a9370c766bda3e
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1324.650 ± 46.528   ms   0.84x (p = 0.000*)` 
    * v3 branch b18c83a8118d58ef615405de494234971336878d
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1322.600 ± 46.727   ms   0.84x (p = 0.000*)`
    * v4 branch ec91413b81a410cc9f00920f6987c5c3ca4d1ed2
      * `Not run yet`
    * v5 branch 19340a0f8930fa916154f2f7549faedcf1f20f02
      * `DaCapo-xalan-large  20 1111.150 ± 49.733  1309.450 ± 31.553   ms   0.85x (p = 0.000*)`
    * v6 branch 9a332cb090fb5797907052533f41556dd7cdeb94
      * `Not run yet`
    * v7 branch 13297b13bd2a9e89e1a51c7a4e27ad39928a8b00
      * `Not run yet`

## In Oracle Java Performance System:

### Linux x64
* Difference vs Base Without Table (-XX:-UseObjectMonitorTable) b891bfa7e67c21478475642e2bfa2cdc65a3bffe
  * With Table (-XX:+UseObjectMonitorTable)
    * Columns
      * `Name                Cnt    Base    Error      Test    Error Unit  Change (* = significant)`
    * Base b891bfa7e67c21478475642e2bfa2cdc65a3bffe
      * `DaCapo-xalan-large   20 1172.50 ±  39.45   1344.15 ±  83.49   ms   0.75x (p = 0.000*)`
      * `DaCapo23-xalan-large  5 9763.60 ± 479.44  10803.20 ± 864.69   ms   0.89x (p = 0.0553)`
    * PR branch fc5062a48767646414f9fd7e57a9370c766bda3e
      * `DaCapo-xalan-large   20 1172.50 ±  39.45   1218.40 ±  64.03   ms   0.96x (p = 0.0103)`
      * `DaCapo23-xalan-large  5 9763.60 ± 479.44  10182.40 ± 391.79   ms   0.96x (p = 0.1703)`
    * v3 branch b18c83a8118d58ef615405de494234971336878d
      * `DaCapo-xalan-large   20 1172.50 ±  39.45   1235.75 ±  60.57   ms   0.95x (p = 0.0004*)`
      * `DaCapo23-xalan-large  5 9763.60 ± 479.44  10111.40 ± 441.37   ms   0.96x (p = 0.2671)`
    * v4 branch ec91413b81a410cc9f00920f6987c5c3ca4d1ed2
      * `DaCapo-xalan-large   20 1172.50 ±  39.45   1161.15 ±  47.45   ms   1.01x (p = 0.4161)`
      * `DaCapo23-xalan-large  5 9763.60 ± 479.44   9812.00 ± 419.23   ms   1.00x (p = 0.8694)`
    * v5 branch 19340a0f8930fa916154f2f7549faedcf1f20f02
      * `In queue`
    * v6 branch 9a332cb090fb5797907052533f41556dd7cdeb94
      * `In queue`
    * v7 branch 13297b13bd2a9e89e1a51c7a4e27ad39928a8b00
      * `In queue`
  * Without Table (-XX:-UseObjectMonitorTable)
    * Columns
      * `Name                Cnt     Base    Error      Test    Error Unit  Change (* = significant)`
    * PR branch fc5062a48767646414f9fd7e57a9370c766bda3e
      * `In queue`

### Linux aarch64

* Difference vs Base Without Table (-XX:-UseObjectMonitorTable) b891bfa7e67c21478475642e2bfa2cdc65a3bffe
  * With Table (-XX:+UseObjectMonitorTable)
    * Columns
      * `Name                Cnt     Base    Error      Test    Error Unit  Change (* = significant)`
    * Base b891bfa7e67c21478475642e2bfa2cdc65a3bffe
      * `DaCapo-xalan-large   20  1999.55 ±  81.08   2176.85 ± 100.63   ms   0.91x (p = 0.000*)`
      * `DaCapo23-xalan-large  5 15595.20 ± 524.15  15326.60 ± 442.64   ms   1.02x (p = 0.4075)`
    * PR branch fc5062a48767646414f9fd7e57a9370c766bda3e
      * `DaCapo-xalan-large   20  1999.55 ±  81.08   2003.25 ±  83.89   ms   1.00x (p = 0.8880)`
      * `DaCapo23-xalan-large  5 15595.20 ± 524.15  15960.80 ± 826.91   ms   0.98x (p = 0.4322)`
    * v3 branch b18c83a8118d58ef615405de494234971336878d
      * `DaCapo-xalan-large   20  1999.55 ±  81.08   2020.60 ±  70.67   ms   0.99x (p = 0.3870)`
      * `DaCapo23-xalan-large  5 15595.20 ± 524.15  15382.00 ± 385.40   ms   1.01x (p = 0.4864)`
    * v4 branch ec91413b81a410cc9f00920f6987c5c3ca4d1ed2
      * `DaCapo-xalan-large   20  1999.55 ±  81.08   1987.55 ±  73.91   ms   1.01x (p = 0.6276)`
      * `DaCapo23-xalan-large  5 15595.20 ± 524.15  15601.00 ± 996.23   ms   1.00x (p = 0.9912)`
    * v5 branch 19340a0f8930fa916154f2f7549faedcf1f20f02
      * `In queue`
    * v6 branch 9a332cb090fb5797907052533f41556dd7cdeb94
      * `In queue`
    * v7 branch 13297b13bd2a9e89e1a51c7a4e27ad39928a8b00
      * `In queue`
  * Without Table (-XX:-UseObjectMonitorTable)
    * Columns
      * `Name                Cnt     Base    Error      Test    Error Unit  Change (* = significant)`
    * PR branch fc5062a48767646414f9fd7e57a9370c766bda3e
      * `In queue`

-------------

Changes requested by aboldtch (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24098#pullrequestreview-2724780944