Timing Differences Between JFR and GC Logs
Markus Gronlund
markus.gronlund at oracle.com
Mon Aug 17 10:41:09 UTC 2020
Hi Ana,
Thank you, Kim and David, for your responses, with Kim alluding to some of the historical aspects.
JFR performance as it relates to event generation, which is also functional for JFR, reduce to a large extent to how quickly a timestamp can be acquired. Since everything in JFR is an event, and an event will have at least one timestamp, and two timestamps for events that represent durations, the event generation leans heavily on the clock source. Clock access latencies is therefore of central importance for JFR, maybe even more so than correctness. And historically, operating systems have varied quite a bit when it comes to access latency and resolution for the performance timers they expose.
What you see in your example is that os::elapsed_counter() (which on Windows maps to QueryPerformanceCounter() with a JVM relative epoch offset) and the rdtsc() counter are disjoint epochs, and they are treated as such in Hotspot. Therefore, attempting to compare the raw counter values is not semantically valid.
Relying on and using rdtsc() come with disclaimers and problems and is generally not recommended. Apart from the historical and performance related aspects already detailed, here is a short description of how it is treated in JFR:
JFR will only attempt to use this source if it has the InvariantTSC property, with timestamp values only treated relative to some other, more stable, clock source. Each "chunk" (file) in JFR reifies a relative epoch, with the chunk start time anchored to a stable timestamp (on Windows this is UTC nanoseconds). rdtsc() timestamps for events generated during that epoch are only treated relative to this start time during post-processing, which gives very high resolution to JFR events. As JFR runs, new "chunks", and therefore new time epochs, are constructed, continuously, each anchored anew to a stable timestamp.
The nature of rdtsc() querying different cores / sockets with no guarantee of them having been synchronized is of course a problem using this mechanism. However, over the years, possible skews have proven not as problematic as one might think in JFR. In general, the relative relations between the recorded JFR events give enough information to understand a situation and to solve a problem. Of course, there are exceptions, for example, when analyzing low-level aspects expecting high accuracy, usually involving some correlation to some other non-JFR related component. For these situations, an alternative is to turn off rdtsc() usages in JFR using the flags: -XX:+UnlockExperimentalVMOptions -XX:-UseFastUnorderedTimeStamps. JFR will now use os::elapsed_counter() as the time source. This comes with higher overhead, but if this overhead is not deemed problematic in an environment, then this is of course a better solution.
As other have already pointed out, there have been evolution in recent years in how operating systems provide performance counter information to user mode. It might very well be that now the access latencies are within acceptable overhead, combined with high timer resolution. If that is the case, the rdtsc() usages should be phased out due to its inherent problems. This requires a systematic investigation and some policy on how to handle older HW/SW combinations - if there needs to be a fallback to continue to use rdtsc(), it follows it is not feasible to phase it out completely.
Glad to see you interning in relation to JFR.
If you have more questions specifically related to JFR, you can post them to hotspot-jfr-dev at openjdk.java.net
Thank you
Markus
-----Original Message-----
From: Kim Barrett
Sent: den 15 augusti 2020 21:49
To: Ana Marsh <t-maana at microsoft.com>
Cc: hotspot-gc-dev at openjdk.java.net; Monica Beckwith <Monica.Beckwith at microsoft.com>; Kirk Pepperdine <Kirk.Pepperdine at microsoft.com>
Subject: Re: Timing Differences Between JFR and GC Logs
> On Aug 14, 2020, at 6:09 PM, Ana Marsh <t-maana at microsoft.com> wrote:
>
> Hi everyone,
>
> My name is Ana Marsh and I am interning at Microsoft on the Java Engineering Team (under Martijn Verburg) and this summer I have been diving into JFR in the JVM and analyzing how it gathers information on garbage collection events. I have noticed over the past two months of digging that the start time of a garbage collection event returned by JFR varies widely from the start time of the same event given with Unified Logging. I decided to explore this further by looking at the specifics of OpenJDK code and found some interesting points to highlight.
>
>
> […]
>
> For JFR files, the start time is set to the RDTSC value while in GC logs it uses the OS clock as it prints out the log. This is obviously a further point of concern for timing differences, so I wanted to see how much of a contrast there was in the times these clocks reported.
>
> […]
> Observations on Results:
> Both ElapsedCounterSource (OS) and FastUnorderedElapsedCounterSource (RDTSC) have their own way of determining seconds so there could be some issues in there as well. Additionally in this example the clocks are not being set instantaneously causing some numbers to be off, however at the beginning of my test the OS clock ends up reporting a later time even though it is called first. Over time though RDTSC seems to begin reporting later times. Overall these clocks get far very off, in degrees of seconds which seems big in the clock world.
>
> Any insight or thoughts on this matter would be helpful. Thank you!
What you are seeing is the result of a lot of history and struggles with the time subsystems provided by various platforms. Search JBS for JDK issues with the label "time" for an idea of that history.
One of the primary goals of JFR is to be low-overhead. If the overhead is too high, developers will be reluctant to add the events, limiting the effectiveness of the facility. At least in the past, the OS level time facilities have often been found to have a pretty high cost; using them in JFR was found to account for a substantial fraction of the overall cost of JFR. Hence the decision to use a low-level and not always reliable platform-specific solution on x86, based on TSC. One reason for this is that even the OS level time facilities are or have been problematic. See, for example, the discussion of QueryPerformanceCounter here:
https://docs.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps
On the other side of this we have the GC, which also needs to collect timing information. And while the GC also wants low overhead, it is using some of the timing information it gathers as part of a control system. That makes it much more sensitive to the kinds of problems using TSC as a time source has historically had. Basically GC can't really cope with those kinds of problems, but also (by design) mostly tries to avoid needing quite as fine-grained timing information as might be wanted for JFR. So the GC uses the OS level time facilities that account for and avoid a problematic TSC.
And because GC and JFR are dealing with different clock sources that aren't necessarily synchronized, one sees the kinds of problems you describe.
It might be that the benefit for JFR of directly using TSC is no longer worth the costs. I personally think that is probably the case. But that's just a somewhat informed opinion. The question needs careful investigation, including questions about what platforms need to be supported and at what level of quality. Are people running the latest version of Java on old hardware where TSC is problematic? Is the HPET fallback sufficient for such folks? It can be hard to get good data for questions like that.
More information about the hotspot-gc-dev
mailing list