RFR: 8340490: Shenandoah: Optimize ShenandoahPacer

Fri Sep 20 18:31:55 UTC 2024

In a simple latency benchmark for memory allocation, I found ShenandoahPacer contributed quite a lot to the long tail latency > 10ms, when there are multi mutator threads failed at fast path to claim budget [here](https://github.com/openjdk/jdk/blob/fdc16a373459cb2311316448c765b1bee5c73694/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L230), all of them will forcefully claim and them wait for up to 10ms([code link](https://github.com/openjdk/jdk/blob/fdc16a373459cb2311316448c765b1bee5c73694/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L239-L277))

The change in this PR makes ShenandoahPacer impact long tail latency much less, instead forcefully claim budget and them wait, it attempts to claim after waiting for 1ms, and keep doing this until: 1/ either spent 10ms waiting in total; 2/ or successfully claimed the budget.

Here the latency comparison for the optimization:
![hdr-histogram-optimize-pacer](https://github.com/user-attachments/assets/811f48c5-87eb-462d-8b27-d15bd08be7b0)

With the optimization, long tail latency from the test code below has been much improved from over 20ms to ~10ms on MacOS with M3 chip:

    static final int threadCount = Runtime.getRuntime().availableProcessors();
    static final LongAdder totalCount = new LongAdder();
    static volatile byte[] sink;
    public static void main(String[] args) {
        runAllocationTest(100000);
    }
    static void recordTimeToAllocate(final int dataSize, final Histogram histogram) {
        long startTime = System.nanoTime();
        sink = new byte[dataSize];
        long endTime = System.nanoTime();
        histogram.recordValue(endTime - startTime);
    }

    static void runAllocationTest(final int dataSize) {
        final long endTime = System.currentTimeMillis() + 30_000;
        final CountDownLatch startSignal = new CountDownLatch(1);
        final CountDownLatch finished = new CountDownLatch(threadCount);
        final Thread[] threads = new Thread[threadCount];
        final Histogram[] histograms = new Histogram[threadCount];
        final Histogram totalHistogram = new Histogram(3600000000000L, 3);
        for (int i = 0; i < threadCount; i++) {
            final var histogram = new Histogram(3600000000000L, 3);
            histograms[i] = histogram;
            threads[i] = new Thread(() -> {
                wait(startSignal);
                do {
                    recordTimeToAllocate(dataSize, histogram);
                } while (System.currentTimeMillis() < endTime);
                finished.countDown();
            });
            threads[i].start();
        }

        startSignal.countDown(); //Start to test
        wait(finished);

        for (Histogram histogram : histograms) {
            totalHistogram.add(histogram);
        }

        totalHistogram.outputPercentileDistribution(System.out, 1000.0);

    }

    public static void wait(final CountDownLatch latch) {
        try {
            latch.await();
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

### Additional test
- [x] MacOS AArch64 server fastdebug, hotspot_gc_shenandoah

-------------

Commit messages:
 - use const
 - refactor
 - Clean code
 - try claim_for_alloc before calculating total_delay
 - try claim_for_alloc before calculating total_delay
 - clean up
 - 8340490: Shenandoah: Optimize ShenandoahPacer

Changes: https://git.openjdk.org/jdk/pull/21099/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21099&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8340490
  Stats: 41 lines in 3 files changed: 8 ins; 16 del; 17 mod
  Patch: https://git.openjdk.org/jdk/pull/21099.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/21099/head:pull/21099

PR: https://git.openjdk.org/jdk/pull/21099