RFR: JDK-8297796 GenShen: instrument the remembered set scan [v12]

Mon Dec 19 23:42:04 UTC 2022

> **Note:**
> This pull request is a draft to share the diffs with the project team. The following additional work is planned before this is ready to commit. (Thanks to Kevin, Roman, William etc. for feedback & suggestions.)
> 
> (1) Collect performance data from SpecJBB and from the pipeline to assess the impact of instrumentation on concurrent remembered set scanning and concurrent update refs phase durations, in addition to the existing data from Extermem mentioned in the ticket.
> 
> (2) Make available the instrumentation only in non-product (optimized) mode until better performance is achieved.
> 
> (3) Any improvements that come from further feedback on this draft (e.g. better or different logging of the metrics data), or other suggestions that I may have missed mentioning above.
> 
> The fix to ShenandoahNumberSeq will be separated out and made into a separate pull request on mainline.
> 
> **Summary:**
> The main change is card stats collection during RS scanning. The code is protected by a new diagnostic flag `ShenandoahEnableCardStats`, which is off by default. With the flag disabled there is a small performance impact (measured with extremem; more data will be collected, see above). With the flag enabled there is a larger performance impact because of the large number of clusters, with shared stats updates at the end of each cluster processed. Since we expect the loops in process_clusters() to change in the near future, informed by the learnings from these stats, we expect to work further on reducing the cost of the stats collection as well. Currently the stats are logged per thread at the end of each RS scan. I'm happy to refine both the stats that we collect as well as how frequently we log the data once we have gathered some experience on how we use this.
> 
> **Details of files changed:**
> 
> 1. shenandoahGeneration.cpp: add a call to log info at the end of remembered set scan when card stats are enabled.
> 2. shenandoahHeap.cpp: minor retsructuring of a loop for task claiming during update refs; introduce a worker id option to downstream code for card stats
> 3. shenandoahNumberSeq.cpp: fix a minor issue with a boundary condition check in code that tries to find the right bucket to increment. This was triggering an assert in the update code.
> 4. shenandoahNumberSeq.hpp: provide missing allocation spec for BinaryMagnitudeSeq
> 5. shenandoahScanRemembered.cpp: new class ShenandoahCardStats methods. Minor restructure of loop for task claiming during RS scanning (akin to the one for update refs in 2 above).
> 6. shenandoahScanRemembered.hpp: Diff looks large because of git-diff'ism having issues with indentation change in restructured if-else branches. Not sure how to make the diffs more easily readable. Updated some documentation comments that were slightly obsolete. New class ShenandoahCardStats and implementation of inline methods. Class ShenandoahScanRemembered keeps cumuative running histograms. Remove some inline declarations for larger methods that we shouldn't force inlining on. Update some old comments.
> 7. shenandoahScanRemembered.inline.hpp: As in 6, diff looks larger than it should because of the same indentation change. ShenandoahScanRemembered::process_clusters() is the method where the instrumentation probes have been inserted. A couple of variables were renamed for clarity, as well as ti update local variables rather than method arguments. The large diffs at (old) line 589 onwards is the git-diff'ism to do with indentation change. Delete some unused methods.
> 8. shenandoah_globals.hpp: new diagnostic flag `ShenandoahEnableCardStats` protects the stats collection code and is disabled by default.
> 
> **Format of stats produced and how to interpret them: (sample)**
> 
> 
> [1211.515s][info][gc,task     ] GC(7069) Using 10 of 20 workers for Concurrent remembered set scanning
> [1211.529s][info][gc,remset   ] GC(7069) Worker 0 Card Stats Histo: 
> [1211.529s][info][gc,remset   ] GC(7069)          dirty_run: [     0.00     0.00     0.00     0.00   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)          clean_run: [     0.00     0.00     0.00    99.61   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)        dirty_cards: [     0.00     0.00     0.00     0.00   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)        clean_cards: [     0.00    99.61    99.61    99.61   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)      max_dirty_run: [     0.00     0.00     0.00     0.00   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)      max_clean_run: [     0.00    99.61    99.61    99.61   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)         dirty_objs: [     0.00     0.00     0.00     0.00  1245.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)         clean_objs: [     0.00     0.00     0.00     0.00    64.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)        dirty_scans: [     0.00     0.00     0.00     0.00  1157.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)        clean_scans: [     0.00     0.00     0.00     0.00    17.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)       alternations: [     0.00     0.00     0.00     0.00    39.00 ]
> [1211.529s][info][gc,remset   ] GC(7069) Worker 1 Card Stats Histo: 
> [1211.529s][info][gc,remset   ] GC(7069)          dirty_run: [     0.00     0.00     0.00     0.00   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)          clean_run: [     0.00     0.00     0.00    99.61   100.00 ]   
> [1211.529s][info][gc,remset   ] GC(7069)        dirty_cards: [     0.00     0.00     0.00     0.00   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)        clean_cards: [     0.00    99.61    99.61    99.61   100.00 ]   
> [1211.529s][info][gc,remset   ] GC(7069)      max_dirty_run: [     0.00     0.00     0.00     0.00   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)      max_clean_run: [     0.00    99.61    99.61    99.61   100.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)         dirty_objs: [     0.00     0.00     0.00     0.00  1257.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)         clean_objs: [     0.00     0.00     0.00     0.00    64.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)        dirty_scans: [     0.00     0.00     0.00     0.00  1197.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)        clean_scans: [     0.00     0.00     0.00     0.00    17.00 ]
> [1211.529s][info][gc,remset   ] GC(7069)       alternations: [     0.00     0.00     0.00     0.00    39.00 ]
> ...
> 
> 
> The rows represent the metric that's being tracked, and the columns are, respectively, minimum, the 3 quartiles (25%, 50%, 75%) and the maximum. The metrics are:
> 
> - dirty_run: the length of an uninterrupted run of dirty cards, interpretedas a percentage of a chunk of work assignment (cluster) processed by a thread
> - clean_run: as above, but the length of an uninterrupted run of clean cards
> - dirty_cards, clean_cards: as above, but counts of cards as a percentage of chunk
> - max_dirty_run & max_clean_run: Similarly for the maximum of each.
> - dirty_objs, clean_objs: these are numbers of objects in any chunk walked, or scanned
> - dirty_scans, clean_scans: numbers of objects scanned by the closure
> - alternations: the number of times that we transitioned from clean to dirty or dirty to clean in a chunk
> 
> The data above indicates that at least 75% of the chunks have no alternations at all,
> and cards are almost always mostly clean for this specific benchmark config (extremem).
> 
> Comparing worker stats from worker 0 and worker 9 indicates very little difference between
> their statistics, as one might typically expect for well-balanced RS scans.
> 
> **Questions:**
> 
> 1. Would it make sense to print also, for example, the 1, 10, 90 and 99 percentiles for these metrics as well, in addition to the quartiles?
> 2. The distributions are per worker for the cumulative history of the run. Would data per RS scan or per Refs Update phase provide more useful information?
> 3. Any suggestions for a more easily consumable format?
> 4. I welcome any other feedback on the pull request.

Y. Srinivas Ramakrishna has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 65 commits:

 - Merge branch 'master' into JVM-1264-dependent
 - Add a previously missed ticket#. Doing it here rather than in parent to
   avoid an otherwise unnecessary re-review touchpoint.
 - Merge branch 'stats_merge' into JVM-1264-dependent
 - Merge branch 'master' into stats_merge
 - jcheck space fix
 - Fix compiler error on windows.
 - Fix some tier1 tests.
 - Remove an unnecessary include, fix some type incorrectness.
 - Merge branch 'JVM-1264' into JVM-1264-dependent
 - Merge branch 'master' into JVM-1264
 - ... and 55 more: https://git.openjdk.org/shenandoah/compare/bbd4ef34...9c5c741f

-------------

Changes: https://git.openjdk.org/shenandoah/pull/176/files
 Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=176&range=11
  Stats: 864 lines in 9 files changed: 495 ins; 206 del; 163 mod
  Patch: https://git.openjdk.org/shenandoah/pull/176.diff
  Fetch: git fetch https://git.openjdk.org/shenandoah pull/176/head:pull/176

PR: https://git.openjdk.org/shenandoah/pull/176