RFR: 8325553: Parallel: Use per-marker cache for marking stats during Full GC

Wed Feb 14 11:18:04 UTC 2024

On Fri, 9 Feb 2024 12:06:04 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

> Add cache to avoid too much contentious writes to global marking stats.
> 
> Test: tier1-6; using the attached bm, ~30% reduction in pause-time is observed. Only marginal improvement (probably just noise) is observed in BigRamTester.

Changes requested by tschatzl (Reviewer).

src/hotspot/share/gc/parallel/psCompactionManager.hpp line 114:

> 112:   OopTaskQueue*  oop_stack()       { return &_oop_stack; }
> 113: 
> 114:   class MarkingStatsCache : public CHeapObj<mtGC> {

A comment about the class, the reason and its purpose would be nice.

src/hotspot/share/gc/parallel/psCompactionManager.hpp line 145:

> 143:   public:
> 144:     void push(oop obj, size_t live_words) {
> 145:       ParallelCompactData& data = PSParallelCompact::summary_data();

Please move the definitions of these all methods into the .inline.hpp file where necessary. They clog up the hpp file.

src/hotspot/share/gc/parallel/psCompactionManager.hpp line 169:

> 167:       const size_t end_offset = data.region_offset(addr + live_words - 1);
> 168:       data.region(end_region_id)->set_partial_obj_size(end_offset + 1);
> 169:       data.region(end_region_id)->set_partial_obj_addr(addr);

I'm wondering whether not caching humongous object statistics would be interesting for the G1 mark stats cache as well because it avoids the code to distribute the live bytes across regions later which is imo more complicated than just not caching the single update for the humongous object.

src/hotspot/share/gc/parallel/psCompactionManager.hpp line 189:

> 187: 
> 188:   MarkingStatsCache* _marking_stats_cache;
> 189: public:

Suggestion:

  MarkingStatsCache* _marking_stats_cache;

public:

src/hotspot/share/gc/parallel/psCompactionManager.hpp line 281:

> 279:   }
> 280: 
> 281:   void destroy_marking_stats_cache() {

This should be named something like `flush_and_destroy...` at least because `destroy` (which is very unusual for g1/parallel/serial code to use, but I'm seeing this a lot on ZGC code) implies just memory freeing to me and not the additional cache eviction work.

Actually it is kind of unusual that `ParCompactionManager` contains anything that is freed and released every GC. Most data is static, or pseudo-static (instance members, but never freed and recreated every gc like `_deferred_obj_array`). I would prefer this change to keep this style for Parallel GC unless there is a compelling reason.

src/hotspot/share/gc/parallel/psCompactionManager.hpp line 285:

> 283:     delete _marking_stats_cache;
> 284:     _marking_stats_cache = nullptr;
> 285:   }

These do not need to be in the .hpp file as well, please move to .cpp if kept.

src/hotspot/share/gc/parallel/psParallelCompact.cpp line 2059:

> 2057:       }
> 2058:     } task;
> 2059:     ParallelScavengeHeap::heap()->workers().run_task(&task);

Is it really worth potentially spinning up (100's of) worker threads for this very short operation? Either way, I would prefer hiding this in a helper method.

For testing, I added some code that shows the costs of spinning up threads for such a negligible operation:

2 threads (parallel version takes almost twice the time):

Parallel version:
$ java -Xmx3g -Xms3g -XX:-ScavengeBeforeFullGC -XX:MarkSweepDeadRatio=0 -XX:ParallelGCThreads=2 -XX:+UseParallelGC -XX:NewSize=1g "-Xlog:gc,gc+phases=debug" -XX:+UnlockDiagnosticVMOptions -XX:-UseNewCode fullgc | grep lush
[0,200s][debug][gc,phases] GC(0) Flush Marking Stats 0,032ms
[0,534s][debug][gc,phases] GC(1) Flush Marking Stats 0,022ms
[0,819s][debug][gc,phases] GC(2) Flush Marking Stats 0,020ms
[1,159s][debug][gc,phases] GC(3) Flush Marking Stats 0,021ms
[1,461s][debug][gc,phases] GC(4) Flush Marking Stats 0,019ms
[1,784s][debug][gc,phases] GC(5) Flush Marking Stats 0,022ms

Serial version:
$ java -Xmx3g -Xms3g -XX:-ScavengeBeforeFullGC -XX:MarkSweepDeadRatio=0 -XX:ParallelGCThreads=2 -XX:+UseParallelGC -XX:NewSize=1g "-Xlog:gc,gc+phases=debug" -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode fullgc | grep lush
[0,247s][debug][gc,phases] GC(0) Flush Marking Stats 0,012ms
[0,597s][debug][gc,phases] GC(1) Flush Marking Stats 0,014ms
[0,881s][debug][gc,phases] GC(2) Flush Marking Stats 0,034ms
[1,208s][debug][gc,phases] GC(3) Flush Marking Stats 0,016ms
[1,513s][debug][gc,phases] GC(4) Flush Marking Stats 0,017ms
[1,843s][debug][gc,phases] GC(5) Flush Marking Stats 0,017ms

8 threads, showing speedup of ~2:

Parallel version:

$ java -Xmx3g -Xms3g -XX:-ScavengeBeforeFullGC -XX:MarkSweepDeadRatio=0 -XX:ParallelGCThreads=8 -XX:+UseParallelGC -XX:NewSize=1g "-Xlog:gc,gc+phases=debug" -XX:+UnlockDiagnosticVMOptions -XX:-UseNewCode fullgc | grep lush
[0,135s][debug][gc,phases] GC(0) Flush Marking Stats 0,028ms
[0,266s][debug][gc,phases] GC(1) Flush Marking Stats 0,029ms
[0,378s][debug][gc,phases] GC(2) Flush Marking Stats 0,037ms
[0,509s][debug][gc,phases] GC(3) Flush Marking Stats 0,030ms
[0,628s][debug][gc,phases] GC(4) Flush Marking Stats 0,027ms
[0,747s][debug][gc,phases] GC(5) Flush Marking Stats 0,027ms

Serial version:

$ java -Xmx3g -Xms3g -XX:-ScavengeBeforeFullGC -XX:MarkSweepDeadRatio=0 -XX:ParallelGCThreads=8 -XX:+UseParallelGC -XX:NewSize=1g "-Xlog:gc,gc+phases=debug" -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode fullgc | grep lush
[0,142s][debug][gc,phases] GC(0) Flush Marking Stats 0,043ms
[0,265s][debug][gc,phases] GC(1) Flush Marking Stats 0,051ms
[0,376s][debug][gc,phases] GC(2) Flush Marking Stats 0,061ms
[0,528s][debug][gc,phases] GC(3) Flush Marking Stats 0,059ms
[0,649s][debug][gc,phases] GC(4) Flush Marking Stats 0,022ms
[0,772s][debug][gc,phases] GC(5) Flush Marking Stats 0,033ms

30 threads, showing inverse scaling:

$ java -Xmx3g -Xms3g -XX:-ScavengeBeforeFullGC -XX:MarkSweepDeadRatio=0 -XX:ParallelGCThreads=30 -XX:+UseParallelGC -XX:NewSize=1g "-Xlog:gc,gc+phases=debug" -XX:+UnlockDiagnosticVMOptions -XX:-UseNewCode fullgc | grep lush
[0,122s][debug][gc,phases] GC(0) Flush Marking Stats 0,041ms
[0,220s][debug][gc,phases] GC(1) Flush Marking Stats 0,046ms
[0,345s][debug][gc,phases] GC(2) Flush Marking Stats 0,047ms
[0,449s][debug][gc,phases] GC(3) Flush Marking Stats 0,046ms
[0,580s][debug][gc,phases] GC(4) Flush Marking Stats 0,045ms
[0,708s][debug][gc,phases] GC(5) Flush Marking Stats 0,046ms

serial version:
$ java -Xmx3g -Xms3g -XX:-ScavengeBeforeFullGC -XX:MarkSweepDeadRatio=0 -XX:ParallelGCThreads=30 -XX:+UseParallelGC -XX:NewSize=1g "-Xlog:gc,gc+phases=debug" -XX:+UnlockDiagnosticVMOptions -XX:+UseNewCode fullgc | grep lush
[0,121s][debug][gc,phases] GC(0) Flush Marking Stats 0,152ms
[0,216s][debug][gc,phases] GC(1) Flush Marking Stats 0,092ms
[0,328s][debug][gc,phases] GC(2) Flush Marking Stats 0,090ms
[0,445s][debug][gc,phases] GC(3) Flush Marking Stats 0,104ms
[0,569s][debug][gc,phases] GC(4) Flush Marking Stats 0,091ms
[0,687s][debug][gc,phases] GC(5) Flush Marking Stats 0,102ms

Pause times (at 30 threads):

[0,167s][info ][gc       ] GC(0) Pause Full (System.gc()) 146M->115M(2944M) 63,913ms
[0,283s][info ][gc       ] GC(1) Pause Full (System.gc()) 192M->191M(2944M) 83,758ms
[0,386s][info ][gc       ] GC(2) Pause Full (System.gc()) 268M->191M(2944M) 87,499ms
[0,496s][info ][gc       ] GC(3) Pause Full (System.gc()) 268M->191M(2944M) 95,813ms
[0,612s][info ][gc       ] GC(4) Pause Full (System.gc()) 268M->191M(2944M) 91,299ms
[0,738s][info ][gc       ] GC(5) Pause Full (System.gc()) 268M->191M(2944M) 92,403ms

I.e. in all cases the time this operation takes is in the noise (still good to measure), so parallelizing does not feel a good investment; if so, do not spin up all threads.

-------------

PR Review: https://git.openjdk.org/jdk/pull/17788#pullrequestreview-1875129839
PR Review Comment: https://git.openjdk.org/jdk/pull/17788#discussion_r1489249707
PR Review Comment: https://git.openjdk.org/jdk/pull/17788#discussion_r1489178153
PR Review Comment: https://git.openjdk.org/jdk/pull/17788#discussion_r1486299365
PR Review Comment: https://git.openjdk.org/jdk/pull/17788#discussion_r1489180530
PR Review Comment: https://git.openjdk.org/jdk/pull/17788#discussion_r1489264838
PR Review Comment: https://git.openjdk.org/jdk/pull/17788#discussion_r1489178872
PR Review Comment: https://git.openjdk.org/jdk/pull/17788#discussion_r1489220996