RFR: 8241160: Concurrent class unloading reports GCTraceTime events as JFR pause sub-phase events
Stefan Karlsson
stefan.karlsson at oracle.com
Thu Mar 26 09:55:37 UTC 2020
On 2020-03-26 10:54, Roman Kennke wrote:
> Hey Stefan,
>
> Sorry, this went under my radar. Give us half a day or so, yes?
Sure.
StefanK
>
> Thanks,
> Roman
>
>> Shenandoah devs, any comments w.r.t. to the Shenandoah section below?
>>
>> Thanks,
>> StefanK
>>
>> On 2020-03-19 10:44, Stefan Karlsson wrote:
>>> Hi all,
>>>
>>> Please review this patch to rewrite the GCTimer, and associated
>>> classes, to not allow nested phases of different types (pause or
>>> concurrent).
>>>
>>> https://cr.openjdk.java.net/~stefank/8241160/webrev.01/
>>> https://bugs.openjdk.java.net/browse/JDK-8241160
>>>
>>> A bug was found when I was looking at JFR events from ZGC. A
>>> GCPhasePauseLevel1 event was nested within a GCPhaseConcurrent. The
>>> only valid parent is a GCPhasePause event. The reason why this
>>> happened was that the we use a GCTraceTime class inside the class
>>> unloading code. Previously, we only used GCTraceTimes inside pauses,
>>> but ever since class unloading was moved out to a concurrent phase,
>>> this isn't true anymore. GCTraceTime used
>>> GCTimer::register_gc_phase_start(name, Ticks, phase = <pause>), and
>>> therefore always reported pauses and pause sub-phases.
>>>
>>> With this patch, I suggest that we become stricter in our usages of
>>> the GCTimer. The effects of the patch are:
>>>
>>> 1) When a top-level pause (or concurrent) phase is created, the code
>>> must be explicit about what type of phase is created. The code will
>>> now assert if this is abused. Most places were already explicit, but I
>>> had to change two places:
>>>
>>> a) Shenandoah type-erased ConcurrentGCTimer and therefore didn't have
>>> access to register_gc_pause_start. I made that function public,
>>> instead of protected, so that we didn't have to deal with that problem.
>>>
>>> b) G1 used GCTraceTime to note the Remark/Cleanup pauses (in
>>> VM_G1Concurrent). This is the only place that uses GCTraceTime to
>>> start a pause. All other places use GCTraceTime to create sub-phases.
>>> I could have copy-n-pasted the entire
>>> GCTraceTime/GCTraceTimeWrapper/GCTraceTimeWrapper implementation and
>>> create a version that calls register_gc_pause_start instead of
>>> register_gc_phase_start. Instead of doing that I opted for creating a
>>> system where the code code register a set of callbacks to be called
>>> when the start and end time is registered. This is used in the backend
>>> of GCTraceTime, but then also used by G1 to allow us to not have to
>>> copy-n-paste a lot of the code.
>>>
>>> I would have liked to make GCTraceTimeImpl/GCTraceTimeWrapper agnostic
>>> to the default callbacks (unfied logging and GCTimer) but couldn't
>>> find a nice way to express that, because of the way we macro-expand
>>> the UL tags. Maybe something we can consider for a future investigation.
>>>
>>> 2) sub-phases now inherit the type from the parent phase, and there's
>>> no possibility to incorrectly nest phases anymore. This also removed
>>> the need for ConcurrentGCTimer::_is_concurrent_phase_active.
>>>
>>> 3) This allows (and encourages concurrent sub-phases). When the JFR
>>> events were ported to HotSpot, only pauses got sub-phases, because
>>> there wasn't a big need for concurrent sub-phases. In this patch I
>>> added level of sub-phases to JFR. Maybe it would be better to add more
>>> right away? (I'm not a fan of having the explicit sub-phase level
>>> events, instead of a counter in *the* phase event, but the JMC team at
>>> that time needed it to be logged as separate events. Maybe something
>>> that could be reconsidered some time)
>>>
>>> 4) The different consumers of the timestamps are separated into their
>>> own classes.
>>>
>>> 5) Shenandoah devs need to consider what to do about this change:
>>>
>>> - unloading_occurred = SystemDictionary::do_unloading(heap->gc_timer());
>>> + // FIXME: This turns off the previously broken JFR events. If we
>>> want to keep reporting them,
>>> + // but with the correct type (Concurrent) then a top-level
>>> concurrent phase is required.
>>> + unloading_occurred = SystemDictionary::do_unloading(NULL /* gc_timer
>>> */);
>>>
>>> Where this code caused GCPhasePauseLevel1 events for ZGC, this used to
>>> create GCPhasePause events for Shenandoah. It uses GCTraceTime to log
>>> sub-phases, but the current Shenandoah code hasn't registered a
>>> top-level phase at this point. Either we keep this code with the
>>> removal of the gc_timer argument, or we add a top-level phase
>>> somewhere. If we want the latter, then I need suggestions on where to
>>> add them. Or maybe push the current code, and fix it as a follow-up
>>> patch?
>>>
>>> What do you think? An alternative is to (continue?) completely forbid
>>> concurrent sub-phases, and remove the gc_timers passed to GCTraceTimes
>>> during concurrent phases. Even if we decide to do that, I think
>>> there's some merit to the stricter GCTimer code, and the slight
>>> separation of concern in GCTraceTime.
>>>
>>> Tested tier1-3
>>>
>>> Thanks,
>>> StefanK
More information about the hotspot-gc-dev
mailing list