RFR: 8356372: JVMTI heap sampling not working properly with outside TLAB allocations [v2]
Stefan Johansson
sjohanss at openjdk.org
Wed May 14 12:26:54 UTC 2025
On Tue, 13 May 2025 08:33:38 GMT, Stefan Karlsson <stefank at openjdk.org> wrote:
>> While working on improving the TLAB sizing code for ZGC @kstefanj ran into an issue with the following tests failing:
>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorInterpreterObjectTest.java
>> serviceability/jvmti/HeapMonitor/MyPackage/HeapMonitorStatObjectCorrectnessTest.java
>>
>> The reason for seeing the problems now is that with the new sizing code ZGC used smaller TLABs at first, before resizing them to a proper size (to lower the waste). In the HeapMonitor tests we don't allocate enough to trigger GCs that will resize the TLABs so most of the tests will now run with small TLABs. This should not be a problem, but it turns out the current sampling code is not working properly when you get a lot of outside TLAB allocations. You get those when trying to allocate a fairly large object (~1400B) that won't fit into the TLAB, but there are still quite a bit of room in the TLAB so we don't want to throw it away and take a new one.
>>
>> The problem in the current code is that we keep track of when to sample with multiple variables and when getting out of TLAB allocations these get out of sync.
>>
>> The proposed patch is the result of a restructuring and fixing of the the code that me and @kstefanj did to fix this issue.
>>
>> The idea is to better account how much we have allocated in three different buckets:
>> * Outside of TLAB allocations
>> * Accounted TLAB allocations
>> * The last bit of TLAB allocations that hasn't been accounted yet
>>
>> And then use the sum of that to compare against a *non-changing* threshold to see if it is time to take a sample.
>>
>> There are a few things to think about when reading this code:
>> * The thread can allocate and retire multiple TLABs before we cross the sample threshold.
>> * The sampling can take multiple samples in a single TLAB
>> * Any overshooting of the sample threshold triggers only one sample and the extra bytes are ignored when checking for the next sample.
>>
>> There are some restructuring in the patch to confine the ThreadHeapSampler variables and code. For example:
>>
>> 1) Moved accounting variables out of TLAB and into the ThreadHeapSampler
>> 2) Moved thread allocation accounting and sampling code out of the TLAB
>> 3) Moved retiring of TLABs out of make_parseable (needed to support (2))
>>
>> Some of that could be extracted into a separate PR if that's deemed worthwhile.
>>
>> Tested with the HeapMonitor tests various TLAB sizes.
>>
>> If there are anyone using these APIs it would be nice to get feedba...
>
> Stefan Karlsson has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>
> - Re-enable tests after merge
> - Merge remote-tracking branch 'upstream/master' into 8356372_thread_heap_sampler
> - 8356372: JVMTI heap sampling not working properly with outside TLAB allocations
Looks great :)
Suggestion to name it `reset_after_sample(...)` instead of `reset_after_sampling(...)` to be more consistent.
src/hotspot/share/runtime/threadHeapSampler.cpp line 440:
> 438: pick_next_sample();
> 439:
> 440: reset_after_sampling(tlab_top);
Suggestion:
reset_after_sample(tlab_top);
src/hotspot/share/runtime/threadHeapSampler.hpp line 110:
> 108: }
> 109:
> 110: void reset_after_sampling(HeapWord* tlab_top) {
Suggestion:
void reset_after_sample(HeapWord* tlab_top) {
-------------
Marked as reviewed by sjohanss (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/25114#pullrequestreview-2839968906
PR Review Comment: https://git.openjdk.org/jdk/pull/25114#discussion_r2088811246
PR Review Comment: https://git.openjdk.org/jdk/pull/25114#discussion_r2088810407
More information about the hotspot-dev
mailing list