RFC: Instrumentation to Help Understand Shenandoah Performance and Latency

Mon Jun 1 13:28:46 UTC 2020

Hi,

On 5/27/20 10:15 PM, Nilsen, Kelvin wrote:
> 1. Number of GC pacing pauses imposed on thread 
> 2. Total number of accumulated ms of pacing pauses imposed on thread A

These are not available, but doable. You can hack it in the current source like this:
  http://cr.openjdk.java.net/~shade/shenandoah/pacing-stats.patch

That thing is probably doable for product configuration, as it is not on performance-sensitive path.
I'll try to massage it in.

> 3. Number of "slow" paths taken through the reference load barrier for thread A.
> Of these: a) How many required thread A to copy the referenced object?> b) How many total bytes of data were copied?
> c) How many times did thread A have to abandon its copy of the referenced object?
> d) How many total bytes of data were abandoned?

These are not available, but doable. LRB slowpath is quite performance-sensitive, so I would be
cautious to introduce accidental bottlenecks there. The counters above do not seem too intrusive,
though. Any LRB slowpath counter that involves timestamps would be a no-go.

> 4. Can my application thread "know" about transitions between GC phases? > Would be nice if a thread could ask :
> a) What is the sequence number of the most recently initiated GC cycle? 
> b) What is the current phase of GC within this cycle?

These require more thorough hacking in through JMX. I believe at some point the hassle of pushing
the data across to Java level would not pay off for the small observability benefits it gives. This
one seems a bit over that arbitrary line.

> For my purposes, these metrics are primarily for use during "research" on performance of
> particular workloads and of particular alternative GC implementation approaches.  I would
> personally be satisfied if these metrics were only available in a JVM that is compiled "with
> instrumentation".  Whether it might be appropriate to deploy "production releases" with this
> instrumentation in place would depend on how much performance overhead it incurs.

Technically, there is the "optimized" build configuration that compiles like "release", but enables
the block of code protected by OPTIMIZED macro. But, I have doubts we want to support yet another
build configuration. The counters that are not available in release VMs are known to bit-rot very
quickly.

So it seems to me there are realistically two types of counters we should be doing:
  a) Those we can make arbitrarily low-overhead, especially when disabled with runtime flag;
  b) Those we can only make as self-patched/self-built VM;

> I am suggesting that these metrics might be gathered on a per-thread basis to reduce
> synchronization overheads.  The results of individual threads can be accumulated by user code
> when desired.  Each thread accumulates its thread-local data into a global accumulator, using
> synchronization to coordinate access to the global accumulator.

See the patch above, that's almost exactly how our usual hacks work.

-- 
Thanks,
-Aleksey