RFC: Instrumentation to Help Understand Shenandoah Performance and Latency

Wed May 27 20:15:49 UTC 2020

In trying to understand Shenandoah performance under different workload conditions, it would be useful to have some additional GC logging output.  Perhaps this is already available.  Please let me know if there is a way to capture this information currently.

The information that would be of value includes:

1. Number of GC pacing pauses imposed on thread A
2. Total number of accumulated ms of pacing pauses imposed on thread A
3. Number of "slow" paths taken through the reference load barrier for thread A.  Of these:
     a) How many required thread A to copy the referenced object?
     b) How many total bytes of data were copied?
     c) How many times did thread A have to abandon its copy of the referenced object?
     d) How many total bytes of data were abandoned?
4. Can my application thread "know" about transitions between GC phases?  Would be nice if a thread could ask :
      a) What is the sequence number of  the most recently initiated GC cycle?
      b) What is the current phase of GC within this cycle?

As I study performance of certain critical code components, I might want to ask at the start of the code for all of this information about a particular thread, and then ask for the relevant information updates at the of the critical code.

For my purposes, these metrics are primarily for use during "research" on performance of particular workloads and of particular alternative GC implementation approaches.  I would personally be satisfied if these metrics were only available in a JVM that is compiled "with instrumentation".  Whether it might be appropriate to deploy "production releases" with this instrumentation in place would depend on how much performance overhead it incurs.

I am suggesting that these metrics might be gathered on a per-thread basis to reduce synchronization overheads.  The results of individual threads can be accumulated by user code when desired.  Each thread accumulates its thread-local data into a global accumulator, using synchronization to coordinate access to the global accumulator.