RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. ---- A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. ### Memory usage of old vs new algorithm: The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. ### Possible improvements/simplifications in the future: DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. ### Observable differences There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops and roots in reverse order. That means we may see different GC roots in JMC now compared to the recursive version of this algorithm. For example, for a static main class member: - One order of root processing may process the CLDG first; this causes the CLDs to be iterated, which iterates the Klass, which iterates the associated j.l.Class object. The reference stack starts at the j.l.Class object, and the root displayed in JMC says "ClassLoader". - Another root processing order may process the global handles first; this causes the AppClassLoader to be processed first, which causes the main class Mirror to be processed, and we also hit the static member, but now the root in JMC is displayed as "Global Object Handle: VM Global" I think that effect is benign - a root is a root. But, as a future improvement, I think displaying the first (or first n) objects of the reference stack would be very helpful to the analyst (possibly more so than just printing the roots). But maybe that feature exists already; I am no JMC expert. ## Testing - I ran extensive tests manually to make sure the resulting jfr files showed the same information (apart from the observable differences mentioned above). - I manually tested that we handle max_depth reached and probestack exhaustion gracefully - I added a new test to execute DFS-only on a very small stack size. Works fine (crashes with stock JVM). - I added a new test that exercises the new array chunking in the tracer - I made sure the patch fixes https://bugs.openjdk.org/browse/JDK-8371630 by running the TestWaste.java with a tiny stack size - crashes without patch, works with patch ------------- Commit messages: - fix after JDK-8375040 - tests - wip - revert unnecessary changes - wip - small stack test - tweaking DFS-BFS test - fix windows build warning - reduce diff - fix performance problem on bfs-dfs mixed mode - ... and 8 more: https://git.openjdk.org/jdk/compare/cba7d88c...5d79624b Changes: https://git.openjdk.org/jdk/pull/29382/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373096 Stats: 543 lines in 8 files changed: 502 ins; 8 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382
On Fri, 23 Jan 2026 10:18:05 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
Ping @roberttoyonaga , @egahlin ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3805038334
On Fri, 23 Jan 2026 10:18:05 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
This looks good to me! And it fixes the [problem we talked about earlier](https://github.com/openjdk/jdk/pull/28659#discussion_r2632525732). I have left one minor comment below. src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 217:
215: _current_pointee = _current_ref.dereference(); 216: 217: _num_objects_processed++;
For large arrays that have many chunks, each chunk will count as another "object" processed. Is this okay? ------------- Marked as reviewed by roberttoyonaga@github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/jdk/pull/29382#pullrequestreview-3712943595 PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2733410597
On Tue, 27 Jan 2026 18:57:56 GMT, Robert Toyonaga <duke@openjdk.org> wrote:
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
dont incremement _num_objects_processed for follow-up chunks
src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 217:
215: _current_pointee = _current_ref.dereference(); 216: 217: _num_objects_processed++;
For large arrays that have many chunks, each chunk will count as another "object" processed. Is this okay?
Thank you for catching that; I fixed it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2735291019
On Tue, 27 Jan 2026 20:00:51 GMT, Robert Toyonaga <duke@openjdk.org> wrote:
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
dont incremement _num_objects_processed for follow-up chunks
This looks good to me! And it fixes the [problem we talked about earlier](https://github.com/openjdk/jdk/pull/28659#discussion_r2632525732). I have left one minor comment below.
Thank you, @roberttoyonaga ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3809764432
On Fri, 23 Jan 2026 10:18:05 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 125:
123: // smaller. Not a problem at all. 124: // 125: // But we could run into weird pathological object graphs. Therfore we also
Suggestion: // But we could run into weird pathological object graphs. Therefore we also ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2733627437
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp Co-authored-by: Robert Toyonaga <rtoyonag@redhat.com> ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/5d79624b..af5c5585 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: dont incremement _num_objects_processed for follow-up chunks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/af5c5585..66276985 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=01-02 Stats: 6 lines in 1 file changed: 4 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382
On Wed, 28 Jan 2026 07:53:47 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed a...
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
dont incremement _num_objects_processed for follow-up chunks
src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 152:
150: assert(_probe_stack.is_empty(), "We should have drained the probe stack?"); 151: } 152: log_info(jfr, system, dfs)("DFS: objects processed: " UINT64_FORMAT ","
I think it would be better to use the already existing oldobject log tag instead of a new dfs tag. Also, info is a bit verbose, debug might be better. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2736092382
On Wed, 28 Jan 2026 07:53:47 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed a...
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
dont incremement _num_objects_processed for follow-up chunks
test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 53:
51: try (Recording r = new Recording()) { 52: Map<String, String> p = new HashMap<>(settings); 53: p.put(EventNames.OldObjectSample + "#" + Enabled.NAME, "true");
No need to set disk to true, it's true by default. It much easier to enable an event this way: r.enable(EventNames.OldObjectSample); test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 67:
65: File recording = new File(jfrFileName + r.getId() + ".jfr"); 66: recording.delete(); 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath());
Why do we need to do this using jcmd? test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 68:
66: recording.delete(); 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath()); 68: r.setSettings(Collections.emptyMap());
No need to clear settings ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2738506314 PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2738511154 PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2738513128
On Wed, 28 Jan 2026 20:36:34 GMT, Erik Gahlin <egahlin@openjdk.org> wrote:
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
dont incremement _num_objects_processed for follow-up chunks
test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 67:
65: File recording = new File(jfrFileName + r.getId() + ".jfr"); 66: recording.delete(); 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath());
Why do we need to do this using jcmd?
Hmm, I just followed the same approach as `TestJcmdDumpPathToGCRoots`. I guess the alternative would be to start a child process JVM with -XX:StartFlightRecording dumponexit=true? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2741088293
On Thu, 29 Jan 2026 11:00:49 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 67:
65: File recording = new File(jfrFileName + r.getId() + ".jfr"); 66: recording.delete(); 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath());
Why do we need to do this using jcmd?
Hmm, I just followed the same approach as `TestJcmdDumpPathToGCRoots`. I guess the alternative would be to start a child process JVM with -XX:StartFlightRecording dumponexit=true?
This is how I would implement it: [patch.txt](https://github.com/user-attachments/files/24935579/patch.txt) - Put all relevant information in the test file so that, when it fails, it can be easily analyzed without having to flip back and forth between a base class during triage, for example. - Dump data programmatically, as it is quicker and easier to understand. - Use the Events class to dump and extract and recording data. - Place helper methods in the OldObjects class for reuse. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2741301803
On Wed, 28 Jan 2026 07:53:47 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed a...
Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
dont incremement _num_objects_processed for follow-up chunks
@egahlin thank you for reviewing this. I changed the UL tag as suggested and changed the test (also removed the cutoff setting since infinity is default) ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3817180949
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 23 additional commits since the last revision: - erics test remarks - different ul log tag - Merge branch 'master' into JFR-leak-profiler-path-to-gc-roots-non-recursive-take2-with-tracing - dont incremement _num_objects_processed for follow-up chunks - Update src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp Co-authored-by: Robert Toyonaga <rtoyonag@redhat.com> - fix after JDK-8375040 - tests - wip - revert unnecessary changes - wip - ... and 13 more: https://git.openjdk.org/jdk/compare/dd83f006...dd8c1bfd ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/66276985..dd8c1bfd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=02-03 Stats: 5269 lines in 146 files changed: 3544 ins; 904 del; 821 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision: - remove unnecessary diff - copyrights ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/dd8c1bfd..ecc2bb74 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=03-04 Stats: 8 lines in 7 files changed: 0 ins; 1 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382
On Thu, 29 Jan 2026 12:01:20 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed a...
Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision:
- remove unnecessary diff - copyrights
@egahlin yes, that is nicer and simpler. I adapted your approach. Not sure if this helps, but back in December, when I rewrote this, I drew up a quick visio to help with reviews. I'll attach the pdf. [JFR-leakprofiler-DFS.pdf](https://github.com/user-attachments/files/24940854/JFR-leakprofiler-DFS.pdf) ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3818189519
On Thu, 29 Jan 2026 14:53:01 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
@egahlin yes, that is nicer and simpler. I adapted your approach.
Not sure if this helps, but back in December, when I rewrote this, I drew up a quick visio to help with reviews. I'll attach the pdf. [JFR-leakprofiler-DFS.pdf](https://github.com/user-attachments/files/24940854/JFR-leakprofiler-DFS.pdf)
Thanks, I need some more time to review your PR, but the change is now more confined and should not be hard to backport. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3823663593
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
Thomas Stuefe has updated the pull request incrementally with three additional commits since the last revision: - remove unnecessary copyright change - remove debug output - Erics test suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/ecc2bb74..957e001a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=04-05 Stats: 442 lines in 9 files changed: 195 ins; 245 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382
On Thu, 29 Jan 2026 14:57:04 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed a...
Thomas Stuefe has updated the pull request incrementally with three additional commits since the last revision:
- remove unnecessary copyright change - remove debug output - Erics test suggestions
There is still a log_info in DFSClosure destructor that should be log_debug. I wonder if you have looked at performance. For example, in which order is it best to check for nullptr, oop has been visited or whether the stack is full? You added a _num_objects_processed variable, but it’s never used, so you might want to remove it. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3841536590
On Tue, 3 Feb 2026 14:05:51 GMT, Erik Gahlin <egahlin@openjdk.org> wrote:
I wonder if you have looked at performance. For example, in which order is it best to check for nullptr, oop has been visited or whether the stack is full? You added a _num_objects_processed variable, but it’s never used, so you might want to remove it.
Will do.
For the future, I think we want to keep BFS. Initially, I only had DFS, but the chains became so weird that I had to implement BFS.
Sure; you are the maintainer, after all.
Regarding ordering, I think we want an order that makes the most sense to the user. ClassLoader is easier for users to understand than Global Object Handle. I'm not sure if that code is still present, but we had a specific order in which we processed roots.
Did you mean this? https://github.com/openjdk/jdk/blob/99bc98357dab78bef2cce7a10c98d13d1e5730e3... I can reverse the order in there and thus get the (roughly) reversed order. I already tested that, but refrained from adding it to the patch. What do you think about printing the actual (first, first and second ...) objects that are referenced by the roots? I think that would often help a lot. It helped me a lot in understanding what happened. In fact, I just traced the whole chain during development, hence the logging added to add_chain(). ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3841744021
On Tue, 3 Feb 2026 14:42:03 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
Did you mean this? https://github.com/openjdk/jdk/blob/99bc98357dab78bef2cce7a10c98d13d1e5730e3...
Yes, we start with classes first because that typically makes the most sense to users. The problem with starting with threads is that you can get odd chains if one thread holds references to other threads. Generally, stacks are harder to understand compared to an inner class (e.g. a listener) holding a reference to the outer class.
I can reverse the order in there and thus get the (roughly) reversed order. I already tested that, but refrained from adding it to the patch.
I think we want to process roots in the same order as in BFS.
What do you think about printing the actual (first, first and second ...) objects that are referenced by the roots? I think that would often help a lot. It helped me a lot in understanding what happened.
I have used 'jfr print --events OldObjectSample' when debugging issues. At one point, I had a JavaFX GUI that allowed me to explore chains, but I think JMC can do that today. I have not studied non-leaks that much.I don't have a strong opinion on what to show in trace mode, as long as it doesn't slow down product builds. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3847232784
On Wed, 4 Feb 2026 12:43:54 GMT, Erik Gahlin <egahlin@openjdk.org> wrote:
I wonder if you have looked at performance. For example, in which order is it best to check for nullptr, oop has been visited or whether the stack is full? You added a _num_objects_processed variable, but it’s never used, so you might want to remove it.
Will do.
For the future, I think we want to keep BFS. Initially, I only had DFS, but the chains became so weird that I had to implement BFS.
Sure; you are the maintainer, after all.
Regarding ordering, I think we want an order that makes the most sense to the user. ClassLoader is easier for users to understand than Global Object Handle. I'm not sure if that code is still present, but we had a specific order in which we processed roots.
Did you mean this?
https://github.com/openjdk/jdk/blob/99bc98357dab78bef2cce7a10c98d13d1e5730e3...
I can reverse the order in there and thus get the (roughly) reversed order. I already tested that, but refrained from adding it to the patch.
What do you think about printing the actual (first, first and second ...) objects that are referenced by the roots? I think that would often help a lot. It helped me a lot in understanding what happened.
In fact, I just traced the whole chain during development, hence the logging added to add_chain().
Did you mean this? https://github.com/openjdk/jdk/blob/99bc98357dab78bef2cce7a10c98d13d1e5730e3...
Yes, we start with classes first because that typically makes the most sense to users. The problem with starting with threads is that you can get odd chains if one thread holds references to other threads. Generally, stacks are harder to understand compared to an inner class (e.g. a listener) holding a reference to the outer class.
I can reverse the order in there and thus get the (roughly) reversed order. I already tested that, but refrained from adding it to the patch.
I think we want to process roots in the same order as in BFS.
What do you think about printing the actual (first, first and second ...) objects that are referenced by the roots? I think that would often help a lot. It helped me a lot in understanding what happened.
I have used 'jfr print --events OldObjectSample' when debugging issues. At one point, I had a JavaFX GUI that allowed me to explore chains, but I think JMC can do that today. I have not studied non-leaks that much.I don't have a strong opinion on what to show in trace mode, as long as it doesn't slow down product builds.
@egahlin @roberttoyonaga I put this back into draft and propose a minimally invasive patch that "works well enough"; either until I get the non-recursiveness bulletproof for all corner cases or until we get rid of DFS. Reason: I keep running into very tricky corner cases in the BFS+DFS mixed mode at the handoff point to DFS caused by: - The fact that the new non-recursive DFS can stop at any point (in theory) if the DFS probe stack runs full; that creates a new stop condition that did not exist before. Previously, DFS would stop at a maximum stack depth >>> 1. Now, in theory, DFS can stop at recursion level 1. That means we cannot guarantee that the oop handed down to DFS is fully iterated. - That causes problems since the BFS->DFS handoff oop is iterated twice. Once up in BFS, once in DFS. If the handoff oop is a large object array and it does not get fully iterated by DFS, BFS will reenter DFS with the same oop. This causes the runtime to scale quadratically with the array size (similar to https://bugs.openjdk.org/browse/JDK-8373490). All of this is solvable; no solution appeals to me, and they all make the patch even more complex. In light of the impending DFS removal, I will pull this back to draft and propose a simpler patch (the main concern is to get a patch out the door quickly for our customer). ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3907529456
On Tue, 3 Feb 2026 14:42:03 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
I wonder if you have looked at performance. For example, in which order is it best to check for nullptr, oop has been visited or whether the stack is full? You added a _num_objects_processed variable, but it’s never used, so you might want to remove it.
It is better to check if the stack is full first. 10 mio references take (median user CPU time) about 10.7 seconds with mark bit search first, 9.7 seconds with stack fullness check first. That makes sense since the stack fullness check is just comparing two integers, whereas the markbit query is more involved. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3852622930
On Thu, 5 Feb 2026 10:27:19 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
I wonder if you have looked at performance. For example, in which order is it best to check for nullptr, oop has been visited or whether the stack is full? You added a _num_objects_processed variable, but it’s never used, so you might want to remove it.
It is better to check if the stack is full first. 10 mio references take (median user CPU time) about 10.7 seconds with mark bit search first, 9.7 seconds with stack fullness check first. That makes sense since the stack fullness check is just comparing two integers, whereas the markbit query is more involved.
And note that the new solution is faster than the old recursive one, which takes about 11.5 seconds user CPU time on my machine. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3852705760
On Thu, 29 Jan 2026 14:57:04 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed a...
Thomas Stuefe has updated the pull request incrementally with three additional commits since the last revision:
- remove unnecessary copyright change - remove debug output - Erics test suggestions
For the future, I think we want to keep BFS. Initially, I only had DFS, but the chains became so weird that I had to implement BFS. Regarding ordering, I think we want an order that makes the most sense to the user. ClassLoader is easier for users to understand than Global Object Handle. I'm not sure if that code is still present, but we had a specific order in which we processed roots. I think BFS will be able to cover most of the heap, and if there is a "linked list" where DFS would need to take over, the order is unlikely to matter at that point. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3841633851
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a...
Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 31 additional commits since the last revision: - nullness check first - Merge branch 'master' into JFR-leak-profiler-path-to-gc-roots-non-recursive-take2-with-tracing - Merge branch 'master' into JFR-leak-profiler-path-to-gc-roots-non-recursive-take2-with-tracing - remove unnecessary copyright change - remove debug output - Erics test suggestions - remove unnecessary diff - copyrights - erics test remarks - different ul log tag - ... and 21 more: https://git.openjdk.org/jdk/compare/2817ec85...df13e722 ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/957e001a..df13e722 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=05-06 Stats: 50301 lines in 667 files changed: 24375 ins; 16740 del; 9186 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382
On Thu, 5 Feb 2026 10:31:37 GMT, Thomas Stuefe <stuefe@openjdk.org> wrote:
This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659.
----
A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc).
We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small.
This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read.
### Memory usage of old vs new algorithm:
The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space.
The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth.
In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp.
### Possible improvements/simplifications in the future:
DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later.
I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult.
### Observable differences
There is one observable side effect to the changed a...
Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 31 additional commits since the last revision:
- nullness check first - Merge branch 'master' into JFR-leak-profiler-path-to-gc-roots-non-recursive-take2-with-tracing - Merge branch 'master' into JFR-leak-profiler-path-to-gc-roots-non-recursive-take2-with-tracing - remove unnecessary copyright change - remove debug output - Erics test suggestions - remove unnecessary diff - copyrights - erics test remarks - different ul log tag - ... and 21 more: https://git.openjdk.org/jdk/compare/2e778d3d...df13e722
src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 153:
151: assert(_probe_stack.is_empty(), "We should have drained the probe stack?"); 152: } 153: log_info(jfr, system, oldobject)("DFS: objects processed: " UINT64_FORMAT ","
log_trace or log_debug ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2782546225
participants (3)
-
Erik Gahlin
-
Robert Toyonaga
-
Thomas Stuefe