RFR: 8373096: JFR leak profiler: path-to-gc-roots search should be non-recursive [v7]
Robert Toyonaga
duke at openjdk.org
Thu Dec 18 20:31:07 UTC 2025
On Thu, 18 Dec 2025 10:11:20 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
>> A customer reported a crash when producing a JFR recording with `path-to-gc-roots=true`. It was a native stack overflow that occurred during the recursive path-to-gc-root search performed in the context of PathToGcRootsOperation.
>>
>> We try to avoid this by limiting the maximum search depth (DFSClosure::max_dfs_depth). That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
>>
>> This RFE changes the algorithm to be non-recursive.
>>
>> Note that as a result of this change, the order in which oop maps are walked per oop is reversed : last oops are processed first. That should not matter for the end result, however. The search is still depth-first.
>>
>> Note that after this patch, we could easily remove the max_depth limitation altogether. I left it in however since this was not the scope of this RFE.
>>
>> Testing:
>>
>> - Tested manually with very small (256K) thread stack size for the VMThread - the patched version works where the old version crashes out
>> - Compared JFR recordings from both an unpatched version (with a large enough VMThread stack size) and a patched version; made sure that the content of "Old Object Sample" was identical
>> - Ran locally all jtreg tests in jdk/jfr
>> - GHAs
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
>
> do strides for arrays
I think adding this array striding is a good idea to avoid flooding the stack due to large arrays. I left a comment about a possible problem below.
src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 134:
> 132: ProbeStackItem psi2 = psi;
> 133: psi2.chunk ++;
> 134: _probe_stack.push(psi2);
Could it be a problem that `pointee` has already been marked? To accomplish the striding, the same `pointee` needs to be revisited with the new chunk count to evaluate the next range. However, the next time it's popped off the stack, it will get skipped over on line 108 since it's already been marked.
I ran `TestJcmdDumpPathToGCRootsBFSDFS.java` and the test passes even after I add `assert(psi.chunk==0 )` in this block, which might indicate only the first range of each array is ever getting evaluated.
-------------
PR Review: https://git.openjdk.org/jdk/pull/28659#pullrequestreview-3594914234
PR Review Comment: https://git.openjdk.org/jdk/pull/28659#discussion_r2632525732
More information about the hotspot-jfr-dev
mailing list