Timing Differences Between JFR and GC Logs

Sat Aug 15 19:48:56 UTC 2020

> On Aug 14, 2020, at 6:09 PM, Ana Marsh <t-maana at microsoft.com> wrote:
> 
> Hi everyone,
> 
> My name is Ana Marsh and I am interning at Microsoft on the Java Engineering Team (under Martijn Verburg) and this summer I have been diving into JFR in the JVM and analyzing how it gathers information on garbage collection events. I have noticed over the past two months of digging that the start time of a garbage collection event returned by JFR varies widely from the start time of the same event given with Unified Logging. I decided to explore this further by looking at the specifics of OpenJDK code and found some interesting points to highlight.
> 
> 
> […]
> 
> For JFR files, the start time is set to the RDTSC value while in GC logs it uses the OS clock as it prints out the log. This is obviously a further point of concern for timing differences, so I wanted to see how much of a contrast there was in the times these clocks reported.
> 
> […]
> Observations on Results:
> Both ElapsedCounterSource (OS) and FastUnorderedElapsedCounterSource (RDTSC) have their own way of determining seconds so there could be some issues in there as well. Additionally in this example the clocks are not being set instantaneously causing some numbers to be off, however at the beginning of my test the OS clock ends up reporting a later time even though it is called first. Over time though RDTSC seems to begin reporting later times. Overall these clocks get far very off, in degrees of seconds which seems big in the clock world.
> 
> Any insight or thoughts on this matter would be helpful. Thank you!

What you are seeing is the result of a lot of history and struggles
with the time subsystems provided by various platforms.  Search JBS
for JDK issues with the label "time" for an idea of that history.

One of the primary goals of JFR is to be low-overhead. If the overhead is
too high, developers will be reluctant to add the events, limiting the
effectiveness of the facility.  At least in the past, the OS level time
facilities have often been found to have a pretty high cost; using them in
JFR was found to account for a substantial fraction of the overall cost of
JFR.  Hence the decision to use a low-level and not always reliable
platform-specific solution on x86, based on TSC.  One reason for this is
that even the OS level time facilities are or have been problematic.  See,
for example, the discussion of QueryPerformanceCounter here:
https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps

On the other side of this we have the GC, which also needs to collect timing
information. And while the GC also wants low overhead, it is using some of
the timing information it gathers as part of a control system. That makes it
much more sensitive to the kinds of problems using TSC as a time source has
historically had. Basically GC can't really cope with those kinds of
problems, but also (by design) mostly tries to avoid needing quite as
fine-grained timing information as might be wanted for JFR. So the GC uses
the OS level time facilities that account for and avoid a problematic TSC.

And because GC and JFR are dealing with different clock sources that aren't
necessarily synchronized, one sees the kinds of problems you describe.

It might be that the benefit for JFR of directly using TSC is no longer
worth the costs. I personally think that is probably the case. But that's
just a somewhat informed opinion. The question needs careful investigation,
including questions about what platforms need to be supported and at what
level of quality.  Are people running the latest version of Java on old
hardware where TSC is problematic?  Is the HPET fallback sufficient for such
folks?  It can be hard to get good data for questions like that.