RFR: 8340490: Shenandoah: Optimize ShenandoahPacer

Fri Sep 20 18:31:56 UTC 2024

On Thu, 19 Sep 2024 23:32:14 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> In a simple latency benchmark for memory allocation, I found ShenandoahPacer contributed quite a lot to the long tail latency > 10ms, when there are multi mutator threads failed at fast path to claim budget [here](https://github.com/openjdk/jdk/blob/fdc16a373459cb2311316448c765b1bee5c73694/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L230), all of them will forcefully claim and them wait for up to 10ms([code link](https://github.com/openjdk/jdk/blob/fdc16a373459cb2311316448c765b1bee5c73694/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L239-L277))
> 
> The change in this PR makes ShenandoahPacer impact long tail latency much less, instead forcefully claim budget and them wait, it attempts to claim after waiting for 1ms, and keep doing this until: 1/ either spent 10ms waiting in total; 2/ or successfully claimed the budget.
> 
> Here the latency comparison for the optimization:
> ![hdr-histogram-optimize-pacer](https://github.com/user-attachments/assets/811f48c5-87eb-462d-8b27-d15bd08be7b0)
> 
> With the optimization, long tail latency from the test code below has been much improved from over 20ms to ~10ms on MacOS with M3 chip:
> 
>     static final int threadCount = Runtime.getRuntime().availableProcessors();
>     static final LongAdder totalCount = new LongAdder();
>     static volatile byte[] sink;
>     public static void main(String[] args) {
>         runAllocationTest(100000);
>     }
>     static void recordTimeToAllocate(final int dataSize, final Histogram histogram) {
>         long startTime = System.nanoTime();
>         sink = new byte[dataSize];
>         long endTime = System.nanoTime();
>         histogram.recordValue(endTime - startTime);
>     }
> 
>     static void runAllocationTest(final int dataSize) {
>         final long endTime = System.currentTimeMillis() + 30_000;
>         final CountDownLatch startSignal = new CountDownLatch(1);
>         final CountDownLatch finished = new CountDownLatch(threadCount);
>         final Thread[] threads = new Thread[threadCount];
>         final Histogram[] histograms = new Histogram[threadCount];
>         final Histogram totalHistogram = new Histogram(3600000000000L, 3);
>         for (int i = 0; i < threadCount; i++) {
>             final var histogram = new Histogram(3600000000000L, 3);
>             histograms[i] = histogram;
>             threads[i] = new Thread(() -> {
>                 wait(startSignal);
>                 do {
>                     recordTimeToAllocate(dataSize, histogram);
>                 } while (System.currentTimeMillis() < e...

I am trying to remember why we even bothered to go into negative budget on this path, and then waited for it to recover. I think it is from here: https://mail.openjdk.org/pipermail/shenandoah-dev/2018-April/005559.html. AFAICS, the intent for that fix was to make sure that unsuccessful pacing claim the budget, which this patch also does. And given it apparently improves performance, I don't mind it going in.

Comprehension question: the actual improvement comes from waiting in 1ms slices, not from anything else? In retrospect, _it is_ silly to wait until the deadline before attempting to claim the pacing budget.

I am good with this, assuming performance runs show good results.

src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp line 191:

> 189:   _need_notify_waiters.try_set();
> 190: }
> 191: template<bool FORCE>

Newline before `template`, please.

src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp line 206:

> 204:     }
> 205:     new_val = cur - tax;
> 206:   } while (Atomic::load(&_budget) == cur &&

I don't think we need this load, since we have _just_ had another load nearby. This should be enough to resolve the contention issues TTAS pattern tries to avoid.

src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp line 256:

> 254:   double total_delay = 0;
> 255: 
> 256:   double start = os::elapsedTime();

While we are here, let's avoid some integer divisions and floating-point math. Try to rewrite this using `jlong os::elapsed_counter()`, which returns integer nanoseconds? Do the math in `jlong`-s.

src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp line 257:

> 255: 
> 256:   double start = os::elapsedTime();
> 257:   while (!claimed) {

I suggest we common some exit paths by writing the loop like this:

double start_time = os::elapsedTime();
while (!claimed && (os::elapsedTime() - start_time) < max_delay) {
  // We could instead assist GC, but this would suffice for now.
  wait(1);
  claimed = claim_for_alloc<false>(words);
}
if (!claimed) {
  // Spent local time budget to wait for enough GC progress.
  // Force allocating anyway, which may mean we outpace GC,
  // and start Degenerated GC cycle.
  claimed = claim_for_alloc<true>(words);
  assert(claimed, "Should always succeed");
}
ShenandoahThreadLocalData::add_paced_time(JavaThread::current(), os::elapsedTime() - start_time);

src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp line 265:

> 263:     // and start Degenerated GC cycle.
> 264:     claimed = claim_for_alloc<true>(words);
> 265:     assert(claimed, "Should always succeed");

Come to think about it, we don't need to check for return value here. We don't check in other place where we call `claim_for_alloc<true>(words);`

src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp line 267:

> 265:     assert(claimed, "Should always succeed");
> 266:   }
> 267:   ShenandoahThreadLocalData::add_paced_time(JavaThread::current(), (double)(os::elapsed_counter() - start_time) / (double) NANOSECS_PER_SEC);

We already have `current` (`JavaThread::current()`) in scope here, use that :)
I also think a second cast to `(double) NANOSECS_PER_SEC` is redundant.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21099#pullrequestreview-2317722311
Marked as reviewed by shade (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/21099#pullrequestreview-2318988155
PR Review Comment: https://git.openjdk.org/jdk/pull/21099#discussion_r1768272092
PR Review Comment: https://git.openjdk.org/jdk/pull/21099#discussion_r1768271671
PR Review Comment: https://git.openjdk.org/jdk/pull/21099#discussion_r1768281644
PR Review Comment: https://git.openjdk.org/jdk/pull/21099#discussion_r1768291970
PR Review Comment: https://git.openjdk.org/jdk/pull/21099#discussion_r1769025976
PR Review Comment: https://git.openjdk.org/jdk/pull/21099#discussion_r1769027052