RFR: 8331411: Shenandoah: Reconsider spinning duration in ShenandoahLock [v4]

Tue Jun 25 07:19:14 UTC 2024

On Tue, 25 Jun 2024 02:07:20 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

>> ### Notes
>> While doing CAS to get the lock, original implementation sleep/yield once after spinning 0xFFF times, and do these over and over again until get the lock successfully, it is like ```(N spins + sleep/yield) loop ```, based on test results, it seems doing more spins results in worse performance, we decided to change the algorithm to ```(N spins) + (yield loop)```, meanwhile block thread immediately if Safepoint is pending. But still need to determine the best N value for spins, tested multiple possible values: 0, 0x01, 0x7, 0xF, 0x1F, 0x3F, 0x7F, 0xFF, and compare the results with the baseline data(original implementation).
>> 
>> #### Test code
>> 
>> public class Alloc {
>> 	static final int THREADS = 1280; //32 threads per CPU core, 40 cores
>> 	static final Object[] sinks = new Object[64*THREADS];
>> 	static volatile boolean start;
>> 	static volatile boolean stop;
>> 
>> 	public static void main(String... args) throws Throwable {
>> 			for (int t = 0; t < THREADS; t++) {
>> 					int ft = t;
>> 					new Thread(() -> work(ft * 64)).start();
>> 			}
>> 
>> 			Thread.sleep(1000);
>> 			start = true;
>> 			Thread.sleep(30_000);
>> 			stop = true;
>> 	}
>> 
>> 	public static void work(int idx) {
>> 			while (!start) { Thread.onSpinWait(); }
>> 			while (!stop) {
>> 					sinks[idx] = new byte[128];
>> 			}
>> 	}
>> }
>> 
>> 
>> Run it like this and observe TTSP times:
>> 
>> 
>> java -Xms256m -Xmx256m -XX:+UseShenandoahGC -XX:-UseTLAB -Xlog:gc -Xlog:safepoint Alloc.java
>> 
>> 
>> #### Metrics from tests(TTSP, allocation rate)
>> ##### Heavy contention(1280 threads, 32 per CPU core)
>> | Test     | SP polls | Average TTSP | 2% TRIMMEAN | MAX      | MIN   |
>> | -------- | -------- | ------------ | ----------- | -------- | ----- |
>> | baseline | 18       | 3882361      | 3882361     | 43310117 | 49197 |
>> | 0x00     | 168      | 861677       | 589036      | 46937732 | 44005 |
>> | 0x01     | 164      | 627056       | 572697      | 10004767 | 55472 |
>> | 0x07     | 163      | 650578       | 625329      | 5312631  | 53734 |
>> | 0x0F     | 164      | 590398       | 557325      | 6481761  | 56794 |
>> | 0x1F     | 144      | 814400       | 790089      | 5024881  | 56041 |
>> | 0x3F     | 137      | 830288       | 801192      | 5533538  | 54982 |
>> | 0x7F     | 132      | 1101625      | 845626      | 35425614 | 57492 |
>> | 0xFF     | 125      | 1005433      | 970988      | 6193342  | 54362 |
>> 
>> 
>> ##### Light contention(40 threads, 1 per CPU core)
>> | Spins    | SP polls | Average TTSP | 2% T...
>
> Xiaolong Peng has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix linux-x64-hs-zero build failure

Good work! I have only stylistic nits.

src/hotspot/share/gc/shenandoah/shenandoahLock.cpp line 37:

> 35: void ShenandoahLock::contended_lock(bool allow_block_for_safepoint) {
> 36:   Thread* thread = Thread::current();
> 37:   if (thread->is_Java_thread() && allow_block_for_safepoint) {

Please flip `&&` back. It is easier to short-cut test after a boolean test we know is likely to constant-fold.

src/hotspot/share/gc/shenandoah/shenandoahLock.cpp line 49:

> 47:   // Spin this much on multi-processor, do not spin on multi-processor.
> 48:   int ctr = os::is_MP() ? 0x1F : 0;
> 49:   // Apply TTAS to avoid more expenseive CAS calls if the lock is still held by other thread.

`expenseive` -> `expensive`

src/hotspot/share/gc/shenandoah/shenandoahLock.cpp line 51:

> 49:   // Apply TTAS to avoid more expenseive CAS calls if the lock is still held by other thread.
> 50:   while (Atomic::load(&_state) == locked ||
> 51:     Atomic::cmpxchg(&_state, unlocked, locked) != unlocked) {

Please indent it as the old code: `Atomic::` should align.

src/hotspot/share/gc/shenandoah/shenandoahLock.cpp line 71:

> 69:           // VM thread to arm the poll sooner.
> 70:           while (SafepointSynchronize::is_synchronizing() &&
> 71:             !SafepointMechanism::local_poll_armed(java_thread)) {

Same: indent so that `SafepointSynchronize::` and `!SafepointMechanism` align.

src/hotspot/share/gc/shenandoah/shenandoahLock.hpp line 59:

> 57:       if (Atomic::cmpxchg(&_state, unlocked, locked) != unlocked) {
> 58:         contended_lock(allow_block_for_safepoint);
> 59:       }

I am thinking we can increase the chances for inline if we collapse both branches like:

  // Dive into the contended locking handling if there is a safepoint pending and we can
  // block the thread, or the fast-path locking have failed.
  if ((allow_block_for_safepoint && SafepointSynchronize:is_synchronizing()) ||
      (Atomic::cmpxchg(&_state, unlocked, locked) != unlocked)) {
    contended_lock(allow_block_for_safepoint);
  }

-------------

PR Review: https://git.openjdk.org/jdk/pull/19570#pullrequestreview-2137588815
PR Review Comment: https://git.openjdk.org/jdk/pull/19570#discussion_r1652118100
PR Review Comment: https://git.openjdk.org/jdk/pull/19570#discussion_r1652118951
PR Review Comment: https://git.openjdk.org/jdk/pull/19570#discussion_r1652119621
PR Review Comment: https://git.openjdk.org/jdk/pull/19570#discussion_r1652120574
PR Review Comment: https://git.openjdk.org/jdk/pull/19570#discussion_r1652128318