Priority redistribution in Old Object Sample event

Wed May 15 15:08:18 UTC 2024

 Hi,

I'm having a hard time understanding the priority redistribution code for
OldObjectSample. Some months ago, there was a thread about it at
https://mail.openjdk.org/pipermail/hotspot-jfr-dev/2023-June/005066.html,
but it didn't conclude one way or the other whether this code looked
correct. I'm hoping to understand the intent, and maybe also discuss
whether the current implementation follows that intent. I'll be describing
my understanding in order to surface any misunderstandings, so apologies
for the incoming wall of text. Please correct me if I'm getting things
wrong.

# The intent

If we think of allocations as a "timeline", where each sampled object
extends the timeline by adding more bytes to the right side of the line, my
understanding is that the goal of this code is to ensure that the samples
we keep around in the priority list are as evenly distributed across this
timeline as possible. We'd presumably like a good distribution whether
objects are garbage collected or not.

Going by the earlier mailing list comment at
https://mail.openjdk.org/pipermail/hotspot-jfr-dev/2024-January/005847.html,
it seems like the intent was that the sample spans should cover the entire
allocation timeline when summed.

I think the invariant that's being aimed for is that a sample always covers
the span of bytes allocated since the older neighboring sample was taken.
When a node is removed, the span of the removed node should be given to the
younger neighbor, which ensures the neighbor will cover the span left
missing by the node being removed. When a node is removed that has no
younger neighbor, the span should be given to the next sample added to the
queue, which ensures the new sample will cover the span to the older
neighbor.

The total spans of the nodes in the queue will then nearly sum to the total
allocated bytes, and any missing bit (due to removal of the youngest
sample/rejection of incoming samples because their spans are too small)
will be covered by the next sample added to the queue.

I think the idea is that by continually adjusting sample spans as neighbors
are GC'ed or otherwise removed, and by keeping the samples with the largest
spans, we'll end up with samples that are well distributed across the
timeline.

# The current implementation

When an object is added, the sampler calculates a new span by counting how
many bytes have been sampled so far (including the sample potentially being
added), and subtracting the spans of the items already in the queue.
https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L252
.

This calculated span is used only for deciding whether to add the new
sample to the queue. If added, the sample gets a span equal to the byte
size of the object being allocated.
https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L281

When a sample is dropped due to GC, the span it covers is handed to the
`prev` sample in the list. As far as I can tell, the list is ordered
youngest-sample-first, so this adds the span to the younger neighbor
sample. If there is no younger neighbor, the span is discarded.
https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L309

When a sample is dropped in favor of a new sample with a larger span, the
old sample's span is discarded.
https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L261

# The bits I think might not fit the intent

* The span stored in each sample is not the calculated span, it's just the
object's byte size (`allocated`). That means as soon as any object falls
out of the queue, the spans in the queue no longer sum to cover the
allocation timeline. This causes all future samples to be added to be
unduly prioritized for adding to the queue, because they are given an
artificially high span. In effect, future samples are weighted as if they
cover both the interval between themselves and the older neighbor sample,
plus all "missing spans" from nodes that have been discarded since the
program started.

Just to demonstrate that there might be a problem, I tried limiting the
sampler queue size to 3, and calling the `ObjectSampler::add` method 10
times in a row with a dummy object and a fixed byte size of 10.

The result is that the 1st, 2nd and 10th allocations are kept. This is
because after the first 3 inserts to fill the queue, the 4th sample is
discarded because the calculated span is 40 - 30 = 10. The 5th item will
calculate the span 50 - 30 = 20, and so replace the smallest sample (in
span terms), which will be the youngest since all samples have a stored
span of 10. This will repeat for the 6th-10th items (the calculated span
increasing by 10 per item), always replacing the most recently inserted
sample.

I think the span stored in a sample should be the span calculated when
comparing to existing samples, and not the object's byte size.

* When a sample is removed from the queue because a sample with a larger
span is being added, the span of the removed node is not handed to the
younger neighbor, this only happens when a sample is removed due to GC.
This means that the span will be given to the next sample added to the
queue. When the sample being removed is the youngest sample, this is fine,
but when it's a sample that has a younger neighbor, the span should
probably be given to that neighbor rather than the newcomer. Handing it to
the newcomer gives the new sample a high weight it doesn't deserve. It ends
up covering not just the span to the older neighbor, but also the span of
the removed node, which doesn't seem to be what we want?

I think when a sample is removed because it's being evicted in favor of a
new sample, its span should be given to the younger neighbor. If there is
no younger neighbor, the span should instead be given to the sample being
added.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-jfr-dev/attachments/20240515/671489e5/attachment-0001.htm>