RFR: 8350285: Regression caused by ShenandoahLock under extreme contention

Wed Feb 19 21:23:54 UTC 2025

On Wed, 19 Feb 2025 15:58:01 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> We have noticed there is significant regression in at-safepoint time with recent changes made to ShenandoahLock, more specifically this [PR](https://github.com/openjdk/jdk/pull/19570), a local reproducer was written to reproduce the issue, here is the top N at-safepoint time in `ns` comparison:
> 
> Tip:
> 
> 94069776
> 50993550
> 49321667
> 33903446
> 32291313
> 30587810
> 27759958
> 25080997
> 24657404
> 23874338
> 
> Tip + reverting [PR](https://github.com/openjdk/jdk/pull/19570)
> 
> 58428998
> 44410618
> 30788370
> 20636942
> 15986465
> 15307468
> 9686426
> 9432094
> 7473938
> 6854014
> 
> Note: command line for the test:
> 
> java -Xms256m -Xmx256m -XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC -XX:-ShenandoahPacing  -XX:-UseTLAB -Xlog:gc -Xlog:safepoint ~/Alloc.java | grep -Po "At safepoint: \d+ ns" | grep -Po "\d+" | sort -nr
> 
> 
> With further digging, we found the real problem is more runnable threads after the [PR](https://github.com/openjdk/jdk/pull/19570) causes longer time for VM_Thread to call `futex(FUTEX_WAKE_PRIVATE)` to disarm wait barrier when leaving safepoint. Fixing in the issue in VM_Thread benefits other GCs as well but it is more complicated(see the details here https://bugs.openjdk.org/browse/JDK-8350324). 
> With some tweaks in ShenandoahLock, we could mitigate the regression caused by [PR](https://github.com/openjdk/jdk/pull/19570), also improve the long tails of at-saftpoint time by more than 10x, here is the result from the same test with this changes of this PR:
> 
> 
> 1890706
> 1222180
> 1042758
> 853157
> 792057
> 785697
> 780627
> 757817
> 740607
> 736646
> 725727
> 725596
> 724106
> 
> 
> ### Other test
> - [x] `make test TEST=hotspot_gc_shenandoah`
> - [x] Tier 2

Yielding 5x for every 1 nanosleep seems a bit "arbitrary".  I assume you found that the number 5 delivered the "best performance" compared to other numbers you might have chosen.  I wonder if different architectures with different numbers of cores, different operating systems, and/or different test applications that have different numbers of runnable threads would also perform best with this same magic number 5.

Could we at least add a comment explaining how/why we chose 5 here?

-------------

PR Review: https://git.openjdk.org/jdk/pull/23701#pullrequestreview-2628009211