Priority redistribution in Old Object Sample event

Tue May 21 18:41:01 UTC 2024

Hi Erik,

Sent a patch here https://github.com/openjdk/jdk/pull/19334

Den man. 20. maj 2024 kl. 12.58 skrev Erik Gahlin <erik.gahlin at oracle.com>:

> Hi Stig,
>
> Thanks for splitting the problem up into intent and implementation. I
> think you have understood the intent correctly.
>
> When the queue is filled, we can't just stop adding entries. One of the
> earlier entries needs to be removed to give room to a new candidate. The
> entry with the smallest allocations span is removed, and its span given to
> neighboring entries. That way entries are removed evenly over the
> application's "allocation timeline".
>
> What you say sounds reasonable, but I will have to check the
> implementation. If you have a patch, it will make it easier to evaluate.
> Otherwise I will look into it during the week and get back to you.
>
> Thanks
> Erik
>
> ------------------------------
> *From:* hotspot-jfr-dev <hotspot-jfr-dev-retn at openjdk.org> on behalf of
> Stig Rohde Døssing <stigdoessing at gmail.com>
> *Sent:* Wednesday, May 15, 2024 5:08 PM
> *To:* hotspot-jfr-dev at openjdk.org <hotspot-jfr-dev at openjdk.org>
> *Subject:* Priority redistribution in Old Object Sample event
>
> Hi,
>
> I'm having a hard time understanding the priority redistribution code for
> OldObjectSample. Some months ago, there was a thread about it at
> https://mail.openjdk.org/pipermail/hotspot-jfr-dev/2023-June/005066.html,
> but it didn't conclude one way or the other whether this code looked
> correct. I'm hoping to understand the intent, and maybe also discuss
> whether the current implementation follows that intent. I'll be describing
> my understanding in order to surface any misunderstandings, so apologies
> for the incoming wall of text. Please correct me if I'm getting things
> wrong.
>
> # The intent
>
> If we think of allocations as a "timeline", where each sampled object
> extends the timeline by adding more bytes to the right side of the line, my
> understanding is that the goal of this code is to ensure that the samples
> we keep around in the priority list are as evenly distributed across this
> timeline as possible. We'd presumably like a good distribution whether
> objects are garbage collected or not.
>
> Going by the earlier mailing list comment at
> https://mail.openjdk.org/pipermail/hotspot-jfr-dev/2024-January/005847.html,
> it seems like the intent was that the sample spans should cover the entire
> allocation timeline when summed.
>
> I think the invariant that's being aimed for is that a sample always
> covers the span of bytes allocated since the older neighboring sample was
> taken. When a node is removed, the span of the removed node should be given
> to the younger neighbor, which ensures the neighbor will cover the span
> left missing by the node being removed. When a node is removed that has no
> younger neighbor, the span should be given to the next sample added to the
> queue, which ensures the new sample will cover the span to the older
> neighbor.
>
> The total spans of the nodes in the queue will then nearly sum to the
> total allocated bytes, and any missing bit (due to removal of the youngest
> sample/rejection of incoming samples because their spans are too small)
> will be covered by the next sample added to the queue.
>
> I think the idea is that by continually adjusting sample spans as
> neighbors are GC'ed or otherwise removed, and by keeping the samples with
> the largest spans, we'll end up with samples that are well distributed
> across the timeline.
>
> # The current implementation
>
> When an object is added, the sampler calculates a new span by counting how
> many bytes have been sampled so far (including the sample potentially being
> added), and subtracting the spans of the items already in the queue.
> https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L252
> .
>
> This calculated span is used only for deciding whether to add the new
> sample to the queue. If added, the sample gets a span equal to the byte
> size of the object being allocated.
> https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L281
>
> When a sample is dropped due to GC, the span it covers is handed to the
> `prev` sample in the list. As far as I can tell, the list is ordered
> youngest-sample-first, so this adds the span to the younger neighbor
> sample. If there is no younger neighbor, the span is discarded.
> https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L309
>
> When a sample is dropped in favor of a new sample with a larger span, the
> old sample's span is discarded.
> https://github.com/openjdk/jdk/blob/32c7681cf310c87669c502c4a8b62a7fecc93360/src/hotspot/share/jfr/leakprofiler/sampling/objectSampler.cpp#L261
>
> # The bits I think might not fit the intent
>
> * The span stored in each sample is not the calculated span, it's just the
> object's byte size (`allocated`). That means as soon as any object falls
> out of the queue, the spans in the queue no longer sum to cover the
> allocation timeline. This causes all future samples to be added to be
> unduly prioritized for adding to the queue, because they are given an
> artificially high span. In effect, future samples are weighted as if they
> cover both the interval between themselves and the older neighbor sample,
> plus all "missing spans" from nodes that have been discarded since the
> program started.
>
> Just to demonstrate that there might be a problem, I tried limiting the
> sampler queue size to 3, and calling the `ObjectSampler::add` method 10
> times in a row with a dummy object and a fixed byte size of 10.
>
> The result is that the 1st, 2nd and 10th allocations are kept. This is
> because after the first 3 inserts to fill the queue, the 4th sample is
> discarded because the calculated span is 40 - 30 = 10. The 5th item will
> calculate the span 50 - 30 = 20, and so replace the smallest sample (in
> span terms), which will be the youngest since all samples have a stored
> span of 10. This will repeat for the 6th-10th items (the calculated span
> increasing by 10 per item), always replacing the most recently inserted
> sample.
>
> I think the span stored in a sample should be the span calculated when
> comparing to existing samples, and not the object's byte size.
>
> * When a sample is removed from the queue because a sample with a larger
> span is being added, the span of the removed node is not handed to the
> younger neighbor, this only happens when a sample is removed due to GC.
> This means that the span will be given to the next sample added to the
> queue. When the sample being removed is the youngest sample, this is fine,
> but when it's a sample that has a younger neighbor, the span should
> probably be given to that neighbor rather than the newcomer. Handing it to
> the newcomer gives the new sample a high weight it doesn't deserve. It ends
> up covering not just the span to the older neighbor, but also the span of
> the removed node, which doesn't seem to be what we want?
>
> I think when a sample is removed because it's being evicted in favor of a
> new sample, its span should be given to the younger neighbor. If there is
> no younger neighbor, the span should instead be given to the sample being
> added.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-jfr-dev/attachments/20240521/5f079673/attachment.htm>