A followup for JDK-8258414 (OldObjectSample event usability)

Wed Jan 13 12:26:41 UTC 2021

Hi,

Just to have some common background on this discussion:

The event jdk.OldObjectSample is the outward manifestation of the JFR memory leak profiler, the subsystem for tracking memory leaks by taking samples of memory allocations (objects). The profiler keeps the sampled objects in a small priority queue and stand in tight collaboration with the GCs (as the samples can be reclaimed at any time). When we introduced the memory leak profiling subsystem, it did impose new challenges to JFR: this is mainly because the lifetimes of an object allocation sample can, and most likely will, span across JFR epochs (chunks), which are boundaries for metadata / constant pools for all other events in JFR.
The tricky problem is how to efficiently and minimally save metadata for jdk.OldObjectSample events across epoch / chunk boundaries. This is a harder problem that one might think at first – the problem reduces to the fact that after the object sample has been taken, any external reference could already have been unloaded by the JVM.

To support all of this, there is a lot of special handling in place for jdk.OldObjectSample:

1. Metadata / constant tracking use a separate tag category, the “leakp” tag bit in the traceid field.
2. The hash of the stacktrace is saved to the sampled object representation upon creation, in order to, later at the epoch / chunk boundary, if the object is not CG:ed, perform a reverse lookup to transform the metadata associated with the stacktrace into binary large objects (blobs in the JFR binary file format) to be installed to the sampled objects. These blobs are reference counted to let the serialized metadata (now in blob form) be released if / when the sampled object is reclaimed by the GC. There are multiple blobs needed for each surviving sample objects, each corresponding to the metadata part needed to fully resolve the sampled object “post” epoch / chunk rotation. There are blobs for the thread and thread group, the stacktrace and the transitive constant set (class, method, clds, packages, modules etc…

As for configuration, in the JDK, we only enable stacktraces for jdk.OldObjectSample in the profile.jfc – and the reason for this is mainly the size overheads being talked about here.

The way it is currently done, with jdk.OldObjectSample candidates taking a stacktrace before being selected by the memory leak profiler, is an optimization based on the profile.jfc configuration:
In profile.jfc, also memory allocation tracking events jdk.ObjectAllocationInNewTLAB and jdk.ObjectAllocationOutsideTLAB is, or at least was up to JDK 16, enabled with stacktraces turned on. 
There is an optimization in place that share the stacktrace id for all the three memory allocation events at the common allocation site, based on the assumption that they are enabled together as a unit in profile.jfc.
Of course, the memory allocation events are known to be very high traffic and will cause a lot of data, just the sheer number of events produced, and if stacktraces are added to this, the associated metadata can be huge as well. 
The shortcomings of the existing memory allocation events and there being no means to control the data amount produced is what lead to the introduction of event throttling as per https://bugs.openjdk.java.net/browse/JDK-8257602. With JDK-8257602, the old memory allocation events (jdk.ObjectAllocationInNewTLAB and  jdk.ObjectAllocationOutsideTLAB) have been disabled in profile.jfc, and the memory allocation tracing functionality has been replaced with a new, now throttled, jdk.ObjectAllocationSample event.
So, with JDK-8257602, a basic premise is no longer valid (that a stacktrace will unconditionally be taken at the site), so we need to update the immediate environment affected by this, most notably the jdk.OldObjectSample and its stacktraces. Granted, this is also a problem that exist when custom profiles is used, where arbitrary event configurations are possible.

I will also like to state, from the start, that there is work planned for stacktraces in general – this involves the way they are captured, processed as well as represented - so things will be changing in this area.

For the sake of discussion, let’s talk about ways this can be done better with the existing situation.

Jaroslav wrote:
“Employing throttling, on the other hand, will guarantee that only used stacktraces will be in the constant pool - at the price of not being able to get N most recent events as the events would be randomly distributed across a time period.”

I don’t think throttling will solve this problem – the reason being the volatile nature of the sampled object candidates, as they can be reclaimed by the GC at any time – when they are, they are removed from the priority queue but their stacktraces would still be in the “constant pool”. For that to work, there would need to be some kind of reference counting scheme back to the just produced stacktrace – but not only that, since also methods have been tagged for use, a pruning approach would also need to de-tag constants – but an invariant in the JFR system is that no de-tags happen in the same epoch.

What I can think of that might be feasible without too much invention is something like this:

1. Add a separate stacktrace table, only to be used by jdk.OldObjectSample events. Sample candidates register their stacktraces into this table as before, outside the try_lock().
2. When a stacktrace is being generated for insertion, do _not_ tag the methods at the point (as is done today). Save the stacktrace hash as before to the sample candidate.
3. At chunk rotation, inspect all the surviving sample candidates and perform a reverse lookup (just like today), but do it against this new stacktrace table. 
4. At the point of creating the stacktrace blob, set the “leakp” tag for methods in the trace (just like is done today). 
5. The constant pool serialization (JFR type set) then saves “leakp” tagged constants and installs a constant pool metadata blob to the surviving candidates (just like today). 

Note that this new stacktrace table is not serialized wholesale (as is done with the regular stacktrace table) - it is only used to hold the stacktraces to, eventually maybe, be transformed into blobs for surviving candidates.

Some details need be tuned: for example, there needs to be some logic added back to deoptimize parts of the object sample checkpoint serialization process (it currently skips parts as it relies on the current rotation to serialize for the most recent candidates). 
The main drawbacks are the lack of stacktrace re-use between ObjectAllocationSample and OldObjectSample, but maybe something smart can be done to enable it, and somewhat more complexity added to an already complex situation.

Thanks
Markus

-----Original Message-----
From: Jaroslav Bachorík <jaroslav.bachorik at datadoghq.com> 
Sent: den 12 januari 2021 15:35
To: hotspot-jfr-dev <hotspot-jfr-dev at openjdk.java.net>
Cc: Florian David <florian.david at datadoghq.com>; Marcus Hirt <marcus.hirt at datadoghq.com>
Subject: A followup for JDK-8258414 (OldObjectSample event usability)

Helo,

I would like to pick up the thread from the JIRA to pre-discuss the possible course of action here.

As I've already mentioned in JIRA the performance hit was significantly reduced with the stacktrace hashcode fix coming in the upcoming updates. Unfortunately, it does not solve the gigantic recording size - at >100MB per recording (1 minute) we must turn off OldObjectSample event stacktrace collection for our customers :(

The reason is that the stacktrace is collected for all OldObjectSample event instantiations and it is never removed from the constant pool even if there are no events referring to it.

Unfortunately, the option of delaying the stacktrace collection only after a successful try-lock will only reduce the overhead thanks to limiting the number of collections but, eventually, the constant pool will grow significantly if we are to keep the limit of the most recent N old object samples.

Employing throttling, on the other hand, will guarantee that only used stacktraces will be in the constant pool - at the price of not being able to get N most recent events as the events would be randomly distributed across a time period.

At this point I am trying to understand the importance of having 'most recent' events vs. a sampled event set. If most recent events are required and inevitable we will have to come up with a way to purge unused entries from the stacktrace constant pool.

Thanks and looking for feedback!

-JB-