RFR: 8320318: ObjectMonitor Responsible thread

Thu Sep 12 09:36:05 UTC 2024

On Thu, 12 Sep 2024 08:33:12 GMT, Martin Doerr <mdoerr at openjdk.org> wrote:

>> I've done basic testing on ppc64le, riscv64 and s390x using QEMU, but would appreciate if @TheRealMDoerr, @RealFYang and @offamitkumar could take it for a real test drive.
>
>> > @fbredber, @dholmes-ora: I got a substantial performance drop on our 96 Thread Xeon server:
>> 
>> What OS for the Xeon? We have only seen issues with Windows.
> 
> Sorry, I forgot to mention that it's linux (SUSE Linux Enterprise Server 15 SP4).

@TheRealMDoerr, @dholmes-ora 
> Works with `micro:LockUnlock` on real PPC64 hardware, too. However, we need to run more tests and also check performance. Please note that this PR has conflicts with other changes (#20922 and recent developments in the loom repo).

Good that it works on real PPC64 hardware, but please run more tests. I'll sync with loom, and make sure to resolve any conflicts before integrating.

> The JBS issue refers to "memory barriers (not a fence)", but you're using `StoreLoad` barriers which are nothing else than a "fence". I don't agree with the general statement that they have become significantly cheaper. That may be true for single chip designs, but not for large server systems (multi-socket). Did you run benchmarks which stress monitors on any large multi-socket system?

I've run a substantial amount of performance tests available on our performance site. This PR has shown great performance increase on several tests and platforms (Windows being the exception, but that is handled as a separate [issue](https://bugs.openjdk.org/browse/JDK-8339730)). As an example: The DaCapo-xalan-large test showed an increased performance of 36% on Linux-aarch64.

I asked our performance team about what kind of system they run, and got the answer that they do run multi-socket systems. But probably not what you would call large.

> I got a substantial performance drop on our 96 Thread Xeon server: `LockUnlock.testContendedLock` seems to be less than half as fast as without this patch. Also, some of the `LockUnlock.testInflated*` seem to be affected. (Large PPC64 servers as well.) Can you reproduce this on your side?

Please note that I've changed the `LockUnlock.testContendedLock` from `@Threads(2)` to `@Threads(3)` which might be the reason for your substantial performance drop. The reason I did this change was because it enabled me to increase the code coverage, and thereby execute all(?) the corner cases when doing ObjectMonitor locking.

I can see how an added `StoreLoad` barrier will decrease performance if you run certain micro benchmarks, Then again it's only there if you have an inflated monitor (i.e. you are experiencing contended locking). In a real world application where you inflate, park and unpark, one added  `StoreLoad` might not change the overall performance that much. Which is probably why we don't see any real regression when we run our performance tests (like DaCapo, Renaissance, SPECjvm etc.).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19454#issuecomment-2345755769