RFR: 8318986: Improve GenericWaitBarrier performance [v3]
Robbin Ehn
rehn at openjdk.org
Tue Nov 7 09:09:35 UTC 2023
On Thu, 2 Nov 2023 21:00:16 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> See the symptoms, reproducer and analysis in the bug.
>>
>> Current code waits on `disarm()`, which effectively stalls leaving the safepoint if some threads lag behind. Having more runnable threads than CPUs nearly guarantees that we would wait for quite some time, but it also reproduces well if you have enough threads near the CPU count. Just waiting at `arm()` is insufficient, but we can have several `Semaphores` to do what we want.
>>
>> This PR implements a more efficient `GenericWaitBarrier` to recover the performance. Most of the implementation discussion is in the code comments. The key observation that drives this work is that we want to reuse `Semaphore` and related counters without being stuck waiting for threads to leave.
>>
>> (AFAICS, futex-based `LinuxWaitBarrier` does roughly the same, but handles this reuse on futex side, by assigning the "address" per futex.)
>>
>> This issue affects everything except Linux. I initially found this on my M1 Mac, but pretty sure it is easy to reproduce on Windows as well. The safepoints from the reproducer in the bug improved dramatically on a Mac. Note not only the orders of magnitude better safepoint times, but also the several times more GC safepoints in the time-bound allocation test, which means the attainable GC throughput is similarly better, since we don't waste time at this wait barrier.
>>
>> 
>>
>> Additional testing:
>> - [x] MacOS AArch64 server fastdebug, `tier1`
>> - [x] Linux x86_64 server fastdebug, `tier1 tier2 tier3` (generic wait barrier enabled explicitly)
>> - [x] Linux AArch64 server fastdebug, `tier1 tier2 tier3` (generic wait barrier enabled explicitly)
>> - [x] MacOS AArch64 server fastdebug, `tier2 tier3`
>> - [x] Linux x86_64 server fastdebug, `tier4` (generic wait barrier enabled explicitly)
>> - [x] Linux AArch64 server fastdebug, `tier4` (generic wait barrier enabled explicitly)
>
> Aleksey Shipilev has updated the pull request incrementally with four additional commits since the last revision:
>
> - Touchups
> - More comments work
> - Tight up the comments
> - Rework to a single atomic counter per cell
Thanks for addressing this!
I had some comment.
src/hotspot/share/utilities/waitBarrier_generic.cpp line 120:
> 118: SpinYield sp;
> 119: while (Atomic::load_acquire(&_state) < -1) {
> 120: sp.wait();
A warning once would be helpful that cells might be too few.
src/hotspot/share/utilities/waitBarrier_generic.cpp line 151:
> 149: _sem.signal();
> 150:
> 151: if (wakeups++ > max) {
I would assume max = 2, would call signal() max 2 times ?
Here we end when wakeups are larger than max with post inc, so isn't that 4 times? (0->3)
src/hotspot/share/utilities/waitBarrier_generic.cpp line 163:
> 161: int s = Atomic::load_acquire(&_state);
> 162: assert(s > 0, "Mid disarm: Should be armed. State: %d", s);
> 163: if (Atomic::cmpxchg(&_state, s, -s) == s) {
When we hit this branch we have effectively left the outer loop.
I think it will read easier if this scope actually was outside the scope of the outer loop, no?
src/hotspot/share/utilities/waitBarrier_generic.cpp line 192:
> 190: return;
> 191: }
> 192: assert(s > 0, "Before wait: Should be armed. State: %d", s);
If we are context switched here until this cell is re-used ?
Hence we have the wrong barrier tag ?
For safepoint this can't happen since the safepoint id for this safepoint safe thread will be wrong.
Thus we can't re-use cells until this thread returns and change his safepoint id.
But it seems like this is what saves us, so if there was another use-case it could happen, no?
-------------
Changes requested by rehn (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/16404#pullrequestreview-1717132378
PR Review Comment: https://git.openjdk.org/jdk/pull/16404#discussion_r1384571746
PR Review Comment: https://git.openjdk.org/jdk/pull/16404#discussion_r1384577576
PR Review Comment: https://git.openjdk.org/jdk/pull/16404#discussion_r1384580535
PR Review Comment: https://git.openjdk.org/jdk/pull/16404#discussion_r1384588931
More information about the hotspot-dev
mailing list