RFR: 8340547: Starting many threads can delay safepoints [v3]
Oli Gillespie
ogillespie at openjdk.org
Mon Sep 23 12:37:36 UTC 2024
On Mon, 23 Sep 2024 11:28:10 GMT, Oli Gillespie <ogillespie at openjdk.org> wrote:
>> Mitigate the impact of JVM_StartThread on safepoint synchronization, by adding a new ThreadStart_lock which limits the number of JVM_StartThread invocations competing for Threads_lock at any given time to 1.
>> This gives a VM thread trying to call a safepoint a much better chance of acquiring Threads_lock when there are many JVM_StartThread invocations in flight, at the cost of one extra lock/unlock for every new thread.
>>
>> Can be disabled with new diagnostic flag `-XX:-UseExtraThreadStartLock`.
>>
>> Before (ThreadStartTtsp.java is shared in JDK-8340547):
>>
>> java -Xlog:safepoint ThreadStartTtsp.java | grep -o 'Reaching safepoint: [0-9]* ns'
>> Reaching safepoint: 1291591 ns
>> Reaching safepoint: 59962 ns
>> Reaching safepoint: 1958065 ns
>> Reaching safepoint: 14456666258 ns <-- 14 seconds!
>> ...
>>
>>
>> After:
>>
>> java -Xlog:safepoint ThreadStartTtsp.java | grep -o 'Reaching safepoint: [0-9]* ns'
>> Reaching safepoint: 214269 ns
>> Reaching safepoint: 60253 ns
>> Reaching safepoint: 2040680 ns
>> Reaching safepoint: 3089284 ns
>> Reaching safepoint: 2998303 ns
>> Reaching safepoint: 4433713 ns <-- 4.4ms
>> Reaching safepoint: 3368436 ns
>> Reaching safepoint: 2986519 ns
>> Reaching safepoint: 3269102 ns
>> ...
>>
>>
>>
>> **Alternatives**
>>
>> I considered some other options for mitigating this. For example, could we reduce the time spent holding the lock in StartThread? Most of the time is spent managing the threads list for ThreadSMR support, and each time we add a thread to that list we need to copy the whole list and free every entry in the original, which is slow. But I didn't see an easy way to avoid this.
>> I also looked at some kind of signal from the VM thread that it is ready to start synchronizing that StartThread could check before trying to grab Threads_lock, but I didn't find anything better than this extra lock.
>
> Oli Gillespie has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix lock ranking
With a slightly modified version of my reproducer (removing the GC triggers):
import java.util.concurrent.Semaphore;
import java.util.concurrent.ThreadLocalRandom;
public class ThreadStartTtsp {
static int NUM_THREADS = 1_000;
public static void main(String[] args) throws InterruptedException {
Semaphore s = new Semaphore(NUM_THREADS);
// Start lots of threads at the same time to cause contention on freelist
for (int i = 0; i < NUM_THREADS; i++) {
new Thread(() -> {
while (true) {
try { s.acquire(); } catch (Exception e) { throw new RuntimeException(e); }
new Thread().start();
}
}).start();
}
// Periodically 'trigger' the thread creation
for (int i = 0; i < 100; i++) {
s.release(NUM_THREADS);
long start = System.nanoTime();
while (s.availablePermits() != 0) { }
System.out.printf("Started %d threads in %.3fms%n", NUM_THREADS, (System.nanoTime() - start)/1_000_000.0);
}
}
}
```
I see higher throughput with the new lock enabled compared to disabled.
```
java -XX:+UnlockDiagnosticVMOptions -XX:-UseThreadStartLock ThreadStartTtsp.java
...
Started 1000 threads in 134.874ms
Started 1000 threads in 134.799ms
Started 1000 threads in 132.662ms
...
./build/release/images/jdk/bin/java -XX:+UnlockDiagnosticVMOptions -XX:+UseThreadStartLock ThreadStartTtsp.java
...
Started 1000 threads in 89.469ms
Started 1000 threads in 90.030ms
Started 1000 threads in 89.855ms
...
I think that's because of reduced contention with `JavaThread::exit`, not sure if it's a fair comparison.
Probably fairer is this:
import java.util.concurrent.atomic.*;
public class ThreadStartTtsp {
static int NUM_THREADS = 1_00;
static AtomicInteger started = new AtomicInteger(0);
static long start = System.nanoTime();
public static void main(String[] args) throws InterruptedException {
for (int i = 0; i < NUM_THREADS; i++) {
new Thread(() -> {
while (true) {
new Thread(() -> {
if (started.incrementAndGet() % 10_000 == 0) {
System.out.printf("Started %d threads in %.3fms%n", 10_000, (System.nanoTime() - start)/1_000_000.0);
start = System.nanoTime();
}
}).start();
}
}).start();
}
}
}
For which I still see about 4% improvement from adding the new lock, and the gap seems to widen as concurrency increases.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21111#issuecomment-2368086158
More information about the hotspot-dev
mailing list