RFR: 8335356: Shenandoah: Improve concurrent cleanup locking

Wed Jul 10 09:02:21 UTC 2024

On Mon, 8 Jul 2024 21:38:41 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> Hi all,
>    This PR is to improve the usage of heap lock in ShenandoahFreeSet::recycle_trash, the original observation mentioned in the [bug](https://bugs.openjdk.org/browse/JDK-8335356) should be caused by an uncommitted/reverted change I added when Aleksey and I worked on [JDK-8331411](https://bugs.openjdk.org/browse/JDK-8331411). Even the change was not committed, the way ShenandoahFreeSet::recycle_trash using heap lock is still not efficient, we think we should improve it. 
>     With the logs added in this commit https://github.com/openjdk/jdk/pull/20086/commits/5688ee2c0754483818a89bb7915f58a7464c2df2, I got some key metrics: average time to acquire heap lock is about 450 ~ 900 ns, average time to recycle one trash regions is about 600ns (if not batched, it is 1000+ns, might be related to branch prediction). The current implementation takes heap lock once for every trash region, assume there are 1000 regions to recycle, the time wasted on acquiring heap lock is more than 0.6ms.
>     
>    The PR splits the recycling process into two steps: 1. Filter out all the trash regions; 2. recycle the trash regions in batches. I can see some benefits which improve the performance:
>     1. Less time spent on acquiring heap lock, less contention with mutators/allocators
>     2.Simpler loops in filtering and batch recycling, presumably benefit CPU branch prediction
>    
> Here are some logs from test running h2 benchmark:
> 
> TIP with debug log, [code link](https://github.com/openjdk/jdk/compare/master...pengxiaolong:jdk:JDK-8335356-baseline?expand=1), Average time per region: 2312 ns
> 
> [6.013s][info][gc] GC(0) Recycled 0 regions in 58675ns, break down: acquiring lock -> 0, recycling -> 0.
> [6.093s][info][gc] GC(0) Recycled 641 regions in 3025757ns, break down: acquiring lock -> 260016, recycling -> 548345.
> [9.354s][info][gc] GC(1) Recycled 1 regions in 61793ns, break down: acquiring lock -> 481, recycling -> 1141.
> [9.428s][info][gc] GC(1) Recycled 600 regions in 1083206ns, break down: acquiring lock -> 256578, recycling -> 511334.
> [12.145s][info][gc] GC(2) Recycled 35 regions in 118390ns, break down: acquiring lock -> 13703, recycling -> 27438.
> [12.202s][info][gc] GC(2) Recycled 553 regions in 911747ns, break down: acquiring lock -> 209511, recycling -> 426575.
> [15.086s][info][gc] GC(3) Recycled 106 regions in 218396ns, break down: acquiring lock -> 39089, recycling -> 80520.
> [15.164s][info][gc] GC(3) Recycled 454 regions in 762128ns, break down: acquiring lock -> 172583, recycl...

Found an easy workload to demonstrate the impact on max latencies on allocation path.

public class TimedAlloc {
        static volatile Object sink;

        public static void main(String... args) throws Throwable {
                for (int c = 0; c < 10; c++) {
                        run();
                } 
        }

        public static void run() {
                long cur = System.nanoTime();
                long end = cur + 3_000_000_000L;

                long sum = 0;
                long max = 0;
                long allocs = 0;

                while (cur < end) {
                        long start = System.nanoTime();
                        sink = new Object[10000];
                        cur = System.nanoTime();
                        long v = (cur - start);
                        sum += v; 
                        max = Math.max(max, v);
                        allocs++;
                }

                System.out.printf("Allocs: %15d; Avg: %8d, Max: %9d%n", allocs, (sum / allocs), max);
        }
}

$ java -Xms30g -Xmx30g -XX:+AlwaysPreTouch -XX:+UseShenandoahGC ../TimedAlloc.java

# Baseline
Allocs:         3303444; Avg:      869, Max:    572869
Allocs:         3331117; Avg:      864, Max:    543211
Allocs:         3368134; Avg:      854, Max:    504501
Allocs:         3369721; Avg:      854, Max:    380658
Allocs:         3370053; Avg:      854, Max:    471238
Allocs:         3370687; Avg:      854, Max:    409657
Allocs:         3370261; Avg:      854, Max:    371527
Allocs:         3370627; Avg:      854, Max:    452230
Allocs:         3369164; Avg:      854, Max:    473957
Allocs:         3370499; Avg:      854, Max:    450008

# Patched
Allocs:         3311185; Avg:      866, Max:     58489
Allocs:         3338693; Avg:      861, Max:     81620
Allocs:         3378208; Avg:      852, Max:     81913
Allocs:         3379441; Avg:      852, Max:     69214
Allocs:         3380807; Avg:      851, Max:    186430
Allocs:         3380362; Avg:      851, Max:     61498
Allocs:         3380142; Avg:      851, Max:     84395
Allocs:         3379811; Avg:      851, Max:     81258
Allocs:         3380572; Avg:      851, Max:     55142
Allocs:         3379696; Avg:      852, Max:     81612

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20086#issuecomment-2219949530