RFR: 8342818: Implement CPU Time Profiling for JFR

Johannes Bechberger jbechberger at openjdk.org
Tue Oct 29 11:00:23 UTC 2024


On Wed, 28 Aug 2024 16:47:21 GMT, Johannes Bechberger <jbechberger at openjdk.org> wrote:

> This is the code for the [JEP draft: CPU Time based profiling for JFR].

I couldn't figure that out for now.

> Why are events with stack trace = null emitted? Isn't CPUTimeExecutionSampleLoss sufficient?

@egahlin CPUTimeExecutionSampleLoss is emitted when our internal queue is full and we cannot obtain any information for these samples (other than that they exist). On the other hand, `stackTrace=null,failed=null` is emitted when we can't walk the stack (or resolve any of the methods), but we still have enough space in the internal queue so that we can record the important thread and time information. As @jbachorik explained before, this is really important, as we know that the thread did some work at this point in time on the CPU.

We really want to capture as much information as possible. CPUTimeExecutionSampleLoss happens usually not that often (< 1% of the samples are typically lost), but stack walking is impossible in around 10% of the cases.

The semantics are different, as `maxRate` is just upper bounding, while `throttle` is used to sub-sample (not using the  throttle code under the hood), this is why I use a new name. Else I would have to check all over JFR which meaning `throttle` has for every event.

The only option for throttling (or rate-limiting) that I see is to upper bound the amount of events generated by lower-bounding the allowed interval, computing the lower-bound of the interval as `throttle/#cores` (as explained before). This is the only fast and unbiased way to upper-bounding the rate of CPU-time sample. Every time for the interval is set, it is checked against the computed lower-bound. This also happens when setting the throttle.

The two options for setting the throttle for this event are now:

1. Use the existing `throttle` setting: The semantics of upper-bounding the event emission rate are similar enough that this should not lead to confusion (description of `@Throttle`: "Event annotation, determines the event emission rate in events per time unit."). So we could use the pre-existing `throttle` setting and compute the lower-bound of the interval based on that. The main disadvantage is that the setting is currently bound to the [JfrEventThrottler](https://github.com/openjdk/jdk/blob/73ebb848fdb66861e912ea747c039ddd1f7a5f48/src/hotspot/share/jfr/recorder/service/jfrEventThrottler.hpp) (see [`JfrEvent#evaluate`](https://github.com/openjdk/jdk/blob/73ebb848fdb66861e912ea747c039ddd1f7a5f48/src/hotspot/share/jfr/recorder/service/jfrEvent.hpp#L164). So we would have to modify the code to add a special case for the CPU-time-based even, because we can't use the throttler (biasing and more).
2. Use a new `maxRate` setting: This reduces the amount of existing functionality that is affected by this PR and signifies that `maxRate` and `throttle` have slightly different semantics.

I can see that the first option leads to less confusion. But I'm happy to hear whether the first option is ok for you.

I've settled for option 1, which significantly reduces the amount of code for this PR. Furthermore, there are now tests for the expected throttle behavior.

How should I treat the `off` setting? I disabled the rate setting for now because it doesn't make any sense when the period can be arbitrarily small. Of course, I could replace it with the smallest reasonable value of "1ms". Any opinions?

And how should I document the new behaviour? Add it to the documentation of the throttle annotation?

The only place where the current throttle syntax is defined seems to be the throttle annotation.

Ok, so is the following ok?


<Event name="CPUTimeSample" category="Java Virtual Machine, Profiling" label="CPU Time Method Sample"
    description="Snapshot of a threads state from the CPU time sampler. The throttle can be either an upper bound for the event emission rate, e.g. 100/s, or the cpu-time period, e.g. 10ms, with s, ms, us and ns supported as time units."
    period="everyChunk" throttle="true" experimental="true">
    <Field type="Thread" name="sampledThread" label="Thread" />
    <Field type="StackTrace" name="stackTrace" label="Stack Trace" />
    <Field type="boolean" name="failed" label="Failed" description="Failed to obtain the stack trace" />
    <Field type="Tickspan" name="sampleTime" label="CPU time sampling period for this event"/>
  </Event>

The tests failed because I forgot to add two files; I will do that this evening.

But the ergonomics of the throttle options feels good.

There are fewer changes in tools that already process the two existing events. But yes, we can of course change it. I never understood why the old events had a special thread property anyway.

Regarding the field `samplingPeriod`: @apangin mentioned in a recent meeting that naming it `cpuTime` and storing the thread's CPU time at the point of sampling would be better, giving us more flexibility. Not only does it convey more information, but it also allows profiler writers to merge adjacent traces if they are similar, making it possible to store the traces in a more compact format without breaking the API.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2365140305
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2366890705
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2379712271
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2380550554
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2386016365
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2416348809
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2416409643
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2416781819
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2416994088
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2418868037
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2421954045
PR Comment: https://git.openjdk.org/jdk/pull/20752#issuecomment-2443890244


More information about the hotspot-dev mailing list