From ysuenaga at openjdk.org Thu Jan 1 09:24:59 2026 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Thu, 1 Jan 2026 09:24:59 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Sat, 29 Nov 2025 06:06:16 GMT, Yasumasa Suenaga wrote: > The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. > > JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. > > Passed all of jdk_jfr tests on Linux AMD64. I'm still waiting for second reviewer. @mgronlun Can you take a look? ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3703460604 From mbaesken at openjdk.org Fri Jan 2 08:48:02 2026 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 2 Jan 2026 08:48:02 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: <-Vb3RjzTqS7Lo9tMqVYqEbRQ1nYLg20mfcehWEe6Vm4=.e4dba3af-59af-42c9-b0bc-0d6e57122121@github.com> On Sat, 29 Nov 2025 06:06:16 GMT, Yasumasa Suenaga wrote: > The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. > > JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. > > Passed all of jdk_jfr tests on Linux AMD64. Marked as reviewed by mbaesken (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/28563#pullrequestreview-3622449994 From haosun at openjdk.org Fri Jan 2 09:35:58 2026 From: haosun at openjdk.org (Hao Sun) Date: Fri, 2 Jan 2026 09:35:58 GMT Subject: [jdk26] RFR: 8373122: JFR build failure with CDS disabled due to -Werror=unused-function after JDK-8365400 In-Reply-To: References: Message-ID: <2VwhhoYJeHgX4snjrhAmQsQD9kLrIIBhobUh4-M7kNo=.43508487-ee4c-44cd-b31e-fe4624b7da66@github.com> On Wed, 24 Dec 2025 03:46:49 GMT, Hao Sun wrote: > Hi all, > > This pull request contains a backport of commit [e1d81c09](https://github.com/openjdk/jdk/commit/e1d81c0946364a266a006481a8fbbac24c7e6c6a) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hao Sun on 23 Dec 2025 and was reviewed by Markus Gr?nlund, Jie Fu and Francesco Andreuzzi. > > Thanks! Hi, I would appreciate it if you could help review this backport patch again. Thanks. @fandreuz @DamonFool and @mgronlun ------------- PR Comment: https://git.openjdk.org/jdk/pull/28976#issuecomment-3704861285 From fandreuzzi at openjdk.org Fri Jan 2 11:43:55 2026 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Fri, 2 Jan 2026 11:43:55 GMT Subject: [jdk26] RFR: 8373122: JFR build failure with CDS disabled due to -Werror=unused-function after JDK-8365400 In-Reply-To: References: Message-ID: On Wed, 24 Dec 2025 03:46:49 GMT, Hao Sun wrote: > Hi all, > > This pull request contains a backport of commit [e1d81c09](https://github.com/openjdk/jdk/commit/e1d81c0946364a266a006481a8fbbac24c7e6c6a) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hao Sun on 23 Dec 2025 and was reviewed by Markus Gr?nlund, Jie Fu and Francesco Andreuzzi. > > Thanks! Marked as reviewed by fandreuzzi (Committer). ------------- PR Review: https://git.openjdk.org/jdk/pull/28976#pullrequestreview-3622980963 From kbarrett at openjdk.org Sat Jan 3 08:29:29 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Sat, 3 Jan 2026 08:29:29 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet Message-ID: Please review this change to fix JfrSet to avoid triggering -Wzero-as-null-pointer-constant warnings when that warning is enabled. The old code uses an entry value with representation 0 to indicate the entry doesn't have a value. It compares an entry value against literal 0 to check for that. If the key type is a pointer type, this involves an implicit 0 => null pointer constant conversion, so we get a warning when that warning is enabled. Instead we initialize entry values to a value-initialized key, and compare against a value-initialized key. This changes the (currently undocumented) requirements on the key type. The key type is no longer required to be trivially constructible (to permit memset-based initialization), but is now required to be value-initializable. That's currently a wash, since all of the in-use key types are fundamental types (traceid (u8) and Klass*). Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) ------------- Commit messages: - fix -Wzero-as-null-poniter-constant warnings in jfrSet.hpp Changes: https://git.openjdk.org/jdk/pull/29022/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29022&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8374445 Stats: 10 lines in 1 file changed: 3 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/29022.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29022/head:pull/29022 PR: https://git.openjdk.org/jdk/pull/29022 From mgronlun at openjdk.org Mon Jan 5 09:05:27 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 5 Jan 2026 09:05:27 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Sat, 29 Nov 2025 06:06:16 GMT, Yasumasa Suenaga wrote: > The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. > > JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. > > Passed all of jdk_jfr tests on Linux AMD64. This will not work because there is still a race against the JFR Recorder Thread flushing concurrently with LeakProfiler::emit_events(). This can place the checkpoints and events in a segment before the corresponding classes and methods that were tagged as part of emit_events(). This will break the parser, since constant artifacts will not be resolvable (an invariant is that a flushed segment is self-contained). ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3709513877 From mgronlun at openjdk.org Mon Jan 5 09:09:03 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 5 Jan 2026 09:09:03 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Sat, 29 Nov 2025 06:06:16 GMT, Yasumasa Suenaga wrote: > The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. > > JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. > > Passed all of jdk_jfr tests on Linux AMD64. This is a very tricky problem to solve correctly, because a VM operation has been introduced as part of error reporting and the VM shutdown sequence. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3709521319 From mdoerr at openjdk.org Mon Jan 5 10:41:16 2026 From: mdoerr at openjdk.org (Martin Doerr) Date: Mon, 5 Jan 2026 10:41:16 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Sat, 29 Nov 2025 06:06:16 GMT, Yasumasa Suenaga wrote: > The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. > > JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. > > Passed all of jdk_jfr tests on Linux AMD64. Would it be a better solution to avoid replacing the signal handler? We could keep the Java compatible handler and change it such that it calls `crash_handler` only for the thread which is reporting the error. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3709875107 From mgronlun at openjdk.org Mon Jan 5 11:04:54 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 5 Jan 2026 11:04:54 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: <429t5MAqdyAQ7wFsxgYdUa2YPZm3GCI7SU0KwGDzcCQ=.46240518-0447-4e08-97d2-7522ebc1aacf@github.com> On Mon, 5 Jan 2026 10:37:51 GMT, Martin Doerr wrote: > Would it be a better solution to avoid replacing the signal handler? We could keep the Java compatible handler and change it such that it calls `crash_handler` only for the thread which is reporting the error. I am thinking about some alternatives. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3709957825 From krk at openjdk.org Mon Jan 5 14:55:52 2026 From: krk at openjdk.org (Kerem Kat) Date: Mon, 5 Jan 2026 14:55:52 GMT Subject: RFR: 8373096: JFR leak profiler: path-to-gc-roots search should be non-recursive [v7] In-Reply-To: References: Message-ID: On Thu, 18 Dec 2025 10:11:20 GMT, Thomas Stuefe wrote: >> A customer reported a crash when producing a JFR recording with `path-to-gc-roots=true`. It was a native stack overflow that occurred during the recursive path-to-gc-root search performed in the context of PathToGcRootsOperation. >> >> We try to avoid this by limiting the maximum search depth (DFSClosure::max_dfs_depth). That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. >> >> This RFE changes the algorithm to be non-recursive. >> >> Note that as a result of this change, the order in which oop maps are walked per oop is reversed : last oops are processed first. That should not matter for the end result, however. The search is still depth-first. >> >> Note that after this patch, we could easily remove the max_depth limitation altogether. I left it in however since this was not the scope of this RFE. >> >> Testing: >> >> - Tested manually with very small (256K) thread stack size for the VMThread - the patched version works where the old version crashes out >> - Compared JFR recordings from both an unpatched version (with a large enough VMThread stack size) and a patched version; made sure that the content of "Old Object Sample" was identical >> - Ran locally all jtreg tests in jdk/jfr >> - GHAs > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > do strides for arrays This could fix https://bugs.openjdk.org/browse/JDK-8371630, which I couldn't reproduce outside Oracle enviroments detailed in the ticket. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28659#issuecomment-3710759355 From duke at openjdk.org Tue Jan 6 23:49:26 2026 From: duke at openjdk.org (Robert Toyonaga) Date: Tue, 6 Jan 2026 23:49:26 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions In-Reply-To: References: Message-ID: On Sun, 21 Dec 2025 16:22:25 GMT, Erik Gahlin wrote: > Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? > > For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. > > Testing: jdk/jdk/jfr > > Thanks > Erik src/jdk.jfr/share/classes/jdk/jfr/internal/tracing/Transform.java line 176: > 174: } > 175: TryBlock last = tryBlocks.getLast(); > 176: if (tryBlocks.getLast().end == null) { Suggestion: if (last.end == null) { Is it important to read `tryBlocks.getLast()` again here? test/jdk/jdk/jfr/event/tracing/TestConstructors.java line 116: > 114: } > 115: try { > 116: new Zebra(true); This results in `Zebra(int)` getting traced but not `Zebra(boolean)` because an exception is thrown and the `Zebra(boolean)` constructor call [is outside the `try` block](https://github.com/openjdk/jdk/pull/28947/files#diff-68a37600bc91d54808ea1ca427ade6af8a600889877f262e20782c550eded410R160). Is this intended? Shouldn't a method be traced every time it is called? In contrast, `new Zebra(false);` causes both `Zebra(int)` and `Zebra(boolean)` to be traced. Additionally, with the old approach, `new Cat();` would not cause `Cat()` to be traced at all, since its callee, `methodThatThrows()`, prevents execution ever reaching `Cat()`'s `return` statement. I did a quick check on this by hardcoding `simplifiedInstrumentation = true`. Now, with the new approach in this PR, `new Cat();` causes `Cat()` to be traced exactly once. This makes sense to me, but is different than before. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/28947#discussion_r2666613894 PR Review Comment: https://git.openjdk.org/jdk/pull/28947#discussion_r2666618915 From egahlin at openjdk.org Wed Jan 7 00:21:33 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Wed, 7 Jan 2026 00:21:33 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions In-Reply-To: References: Message-ID: <3GdoIv47UZL2mViNWedMrfbXGorNe_mDLJEVg7lJ0VQ=.cbb57a8b-eed8-411d-b83f-1f52c9f3f84c@github.com> On Tue, 6 Jan 2026 23:46:40 GMT, Robert Toyonaga wrote: >> Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? >> >> For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. >> >> Testing: jdk/jdk/jfr >> >> Thanks >> Erik > > test/jdk/jdk/jfr/event/tracing/TestConstructors.java line 116: > >> 114: } >> 115: try { >> 116: new Zebra(true); > > This results in `Zebra(int)` getting traced but not `Zebra(boolean)` because the `Zebra(int)` constructor call throws but [is outside the `try` block](https://github.com/openjdk/jdk/pull/28947/files#diff-68a37600bc91d54808ea1ca427ade6af8a600889877f262e20782c550eded410R160) so execution never reaches the `catch` block that applies tracing. Is this intended? Shouldn't a method be traced every time it is called? In contrast, `new Zebra(false);` causes both `Zebra(int)` and `Zebra(boolean)` to be traced. > > Additionally, with the old approach, `new Cat();` would not cause `Cat()` to be traced at all, since its callee, `methodThatThrows()`, prevents execution ever reaching `Cat()`'s `return` statement. I did a quick check on this by hardcoding `simplifiedInstrumentation = true`. Now, with the new approach in this PR, `new Cat();` causes `Cat()` to be traced exactly once. This makes sense to me, but is different than before. We can't place a try block around a call to super(...) or this(...). This is why two try blocks are used, one before and one after the call to this(...) or super(...). With try blocks, we can now also track when an exception occurs in a callee. This is a behavioral change, but I believe it is for the better. I was aware of this limitation when I did the initial implementation, but I didn't think it was worth the added complexity that try blocks bring. What I didn?t realize at the time was the double-count issue, so now that we have the mechanics for try blocks, I decided to fix exception in a callee as well. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/28947#discussion_r2666661004 From haosun at openjdk.org Wed Jan 7 00:49:00 2026 From: haosun at openjdk.org (Hao Sun) Date: Wed, 7 Jan 2026 00:49:00 GMT Subject: [jdk26] RFR: 8373122: JFR build failure with CDS disabled due to -Werror=unused-function after JDK-8365400 In-Reply-To: References: Message-ID: On Wed, 24 Dec 2025 03:46:49 GMT, Hao Sun wrote: > Hi all, > > This pull request contains a backport of commit [e1d81c09](https://github.com/openjdk/jdk/commit/e1d81c0946364a266a006481a8fbbac24c7e6c6a) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hao Sun on 23 Dec 2025 and was reviewed by Markus Gr?nlund, Jie Fu and Francesco Andreuzzi. > > Thanks! I would appreciate it if someone could help review this backport patch. Thanks. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28976#issuecomment-3716887426 From jiefu at openjdk.org Wed Jan 7 00:55:02 2026 From: jiefu at openjdk.org (Jie Fu) Date: Wed, 7 Jan 2026 00:55:02 GMT Subject: [jdk26] RFR: 8373122: JFR build failure with CDS disabled due to -Werror=unused-function after JDK-8365400 In-Reply-To: References: Message-ID: On Wed, 24 Dec 2025 03:46:49 GMT, Hao Sun wrote: > Hi all, > > This pull request contains a backport of commit [e1d81c09](https://github.com/openjdk/jdk/commit/e1d81c0946364a266a006481a8fbbac24c7e6c6a) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hao Sun on 23 Dec 2025 and was reviewed by Markus Gr?nlund, Jie Fu and Francesco Andreuzzi. > > Thanks! LGTM ------------- Marked as reviewed by jiefu (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/28976#pullrequestreview-3632992405 From haosun at openjdk.org Wed Jan 7 01:08:48 2026 From: haosun at openjdk.org (Hao Sun) Date: Wed, 7 Jan 2026 01:08:48 GMT Subject: [jdk26] RFR: 8373122: JFR build failure with CDS disabled due to -Werror=unused-function after JDK-8365400 In-Reply-To: References: Message-ID: <58efiiX1cRmDco5VZyabSIFcgxVYoA30NAUNkfqtmi8=.928a57dd-f690-4b90-85a2-a9e18e459767@github.com> On Fri, 2 Jan 2026 11:40:12 GMT, Francesco Andreuzzi wrote: >> Hi all, >> >> This pull request contains a backport of commit [e1d81c09](https://github.com/openjdk/jdk/commit/e1d81c0946364a266a006481a8fbbac24c7e6c6a) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. >> >> The commit being backported was authored by Hao Sun on 23 Dec 2025 and was reviewed by Markus Gr?nlund, Jie Fu and Francesco Andreuzzi. >> >> Thanks! > > Marked as reviewed by fandreuzzi (Committer). Thanks a lot for your reviews. @fandreuz @DamonFool ------------- PR Comment: https://git.openjdk.org/jdk/pull/28976#issuecomment-3716921157 From haosun at openjdk.org Wed Jan 7 01:08:50 2026 From: haosun at openjdk.org (Hao Sun) Date: Wed, 7 Jan 2026 01:08:50 GMT Subject: [jdk26] Integrated: 8373122: JFR build failure with CDS disabled due to -Werror=unused-function after JDK-8365400 In-Reply-To: References: Message-ID: On Wed, 24 Dec 2025 03:46:49 GMT, Hao Sun wrote: > Hi all, > > This pull request contains a backport of commit [e1d81c09](https://github.com/openjdk/jdk/commit/e1d81c0946364a266a006481a8fbbac24c7e6c6a) from the [openjdk/jdk](https://git.openjdk.org/jdk) repository. > > The commit being backported was authored by Hao Sun on 23 Dec 2025 and was reviewed by Markus Gr?nlund, Jie Fu and Francesco Andreuzzi. > > Thanks! This pull request has now been integrated. Changeset: 3103fa08 Author: Hao Sun URL: https://git.openjdk.org/jdk/commit/3103fa08bba95ec2c60458d1c5f128243e5ff5bc Stats: 28 lines in 1 file changed: 14 ins; 14 del; 0 mod 8373122: JFR build failure with CDS disabled due to -Werror=unused-function after JDK-8365400 Reviewed-by: fandreuzzi, jiefu Backport-of: e1d81c0946364a266a006481a8fbbac24c7e6c6a ------------- PR: https://git.openjdk.org/jdk/pull/28976 From ozanctn at amazon.com Wed Jan 7 09:37:43 2026 From: ozanctn at amazon.com (Cetin, Ozan) Date: Wed, 7 Jan 2026 09:37:43 +0000 Subject: [jdk21] JDK-8337994 REDO backport failure analysis - Missing prerequisite changes from JDK-8316241 Message-ID: Hi, I've been investigating the test failures that caused JDK-8346108 (the revert of JDK-8337994 REDO in JDK21). This is related to the native memory leak when not recording any JFR events (JDK-8335121). Summary Based on our investigation, we believe the JDK-8337994 (REDO) backport to JDK21 failed because it appears to depend on API changes introduced in the original JDK-8316241 fix that were never backported to JDK21. Our theory is that the REDO fix assumes the existence of infrastructure that only exists in later mainline releases. Root Cause Analysis The Missing Prerequisite The original JDK-8316241 fix (commit b2a39c576706622b624314c89fa6d10d0b422f86) introduced several key changes to jfrTypeSetUtils.hpp/.cpp: 1. API Change: should_do_loader_klass(const Klass* k) ? should_do_cld_klass(const Klass* k, bool leakp) 2. New Data Structure: Added _klass_loader_leakp_set for separate tracking of leakp (leak profiler) path klasses 3. New Function: get_cld_klass(CldPtr cld, bool leakp) in jfrTypeSet.cpp that properly enqueues CLD klasses via JfrTraceId::load() What Happens Without These Changes The REDO fix attempts to use get_cld_klass() which calls should_do_cld_klass(klass, leakp), but in the JDK21 backport: * JDK21 still has the old API: should_do_loader_klass(const Klass* k) (no leakp parameter) * JDK21 lacks _klass_loader_leakp_set for separate tracking * The get_cld_klass() function doesn't exist in the JDK21 codebase This causes the assert(IS_SERIALIZED(class_loader_klass)) to fail in write_cld() because the CLD's class_loader_klass is never properly enqueued for serialization during the leakp path. Test Failure Mechanism (TestChunkIntegrity.java) 1. TestClassLoader loads MyClass 2. Event commits with clazz = MyClass 3. JFR rotation writes MyClass to chunk 4. MyClass's CLD references TestClassLoader Klass 5. BUG: TestClassLoader Klass not serialized (leakp path broken) 6. Chunk written with broken reference 7. In slowdebug: assert(IS_SERIALIZED(class_loader_klass)) fails 8. In release: "Events don't match" when comparing chunks The Fix I've been able to get a local jdk21 build passing all tests (including slowdebug) by backporting JDK-8316241 and resolving the resulting conflicts. The key changes are: 1. jfrTypeSetUtils.hpp // OLD (JDK21 current) bool should_do_loader_klass(const Klass* k); // NEW (with leakp support) bool should_do_cld_klass(const Klass* k, bool leakp); 2. jfrTypeSetUtils.cpp // Added _klass_loader_leakp_set member GrowableArray* _klass_loader_leakp_set; // Updated implementation bool JfrArtifactSet::should_do_cld_klass(const Klass* k, bool leakp) { assert(k != nullptr, "invariant"); assert(_klass_loader_set != nullptr, "invariant"); assert(_klass_loader_leakp_set != nullptr, "invariant"); return not_in_set(leakp ? _klass_loader_leakp_set : _klass_loader_set, k); } 3. jfrTypeSet.cpp - Added get_cld_klass() static inline KlassPtr get_cld_klass(CldPtr cld, bool leakp) { if (cld == nullptr) { return nullptr; } assert(leakp ? IS_LEAKP(cld) : used(cld), "invariant"); KlassPtr cld_klass = cld->class_loader_klass(); if (cld_klass == nullptr) { return nullptr; } if (should_do_cld_klass(cld_klass, leakp)) { if (current_epoch()) { // KEY FIX: Enqueue the klass for serialization JfrTraceId::load(cld_klass); } else { artifact_tag(cld_klass, leakp); } return cld_klass; } return nullptr; } Proposed Action Based on this, it appears that backporting JDK-8337994 (REDO) alone may not be sufficient, and that some or all the prerequisite infrastructure changes from JDK-8316241 may also need to be backported. Additionally, there may be other upstream commits (such as 8323631) in JDK24 that were made on top of JDK-8316241 that could also be required for the fix to not cause other possible errors. We would appreciate guidance on identifying any additional changes that might need to be included in the backport. If this direction makes sense, I'm happy to prepare a proper patch for review. References * JDK-8335121: Native memory leak when JFR is enabled but no events are emitted * JDK-8316241: Test jdk/jdk/jfr/jvm/TestChunkIntegrity.java failed (original fix) * JDK-8337994: [REDO] Native memory leak when not recording any events * JDK-8346108: Revert of REDO in JDK21u due to test failures Best Regards, Ozan -------------- next part -------------- An HTML attachment was scrubbed... URL: From egahlin at openjdk.org Wed Jan 7 10:08:11 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Wed, 7 Jan 2026 10:08:11 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions [v2] In-Reply-To: References: Message-ID: > Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? > > For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. > > Testing: jdk/jdk/jfr > > Thanks > Erik Erik Gahlin has updated the pull request incrementally with one additional commit since the last revision: Formatting + reuse of local variable ------------- Changes: - all: https://git.openjdk.org/jdk/pull/28947/files - new: https://git.openjdk.org/jdk/pull/28947/files/6b2473dc..f97e2ad3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=28947&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28947&range=00-01 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/28947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/28947/head:pull/28947 PR: https://git.openjdk.org/jdk/pull/28947 From fandreuzzi at openjdk.org Wed Jan 7 12:42:14 2026 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Wed, 7 Jan 2026 12:42:14 GMT Subject: RFR: 8374713: PredicatedConcurrentWriteOp is unused Message-ID: Trivial removal of `PredicatedConcurrentWriteOp`, which is not used anymore after [8284161](https://bugs.openjdk.org/browse/JDK-8284161). ------------- Commit messages: - nn Changes: https://git.openjdk.org/jdk/pull/29088/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29088&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8374713 Stats: 12 lines in 1 file changed: 0 ins; 12 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/29088.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29088/head:pull/29088 PR: https://git.openjdk.org/jdk/pull/29088 From duke at openjdk.org Wed Jan 7 14:08:30 2026 From: duke at openjdk.org (Robert Toyonaga) Date: Wed, 7 Jan 2026 14:08:30 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions [v2] In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 10:08:11 GMT, Erik Gahlin wrote: >> Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? >> >> For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. >> >> Testing: jdk/jdk/jfr >> >> Thanks >> Erik > > Erik Gahlin has updated the pull request incrementally with one additional commit since the last revision: > > Formatting + reuse of local variable Marked as reviewed by roberttoyonaga at github.com (no known OpenJDK username). ------------- PR Review: https://git.openjdk.org/jdk/pull/28947#pullrequestreview-3635059497 From duke at openjdk.org Wed Jan 7 14:08:34 2026 From: duke at openjdk.org (Robert Toyonaga) Date: Wed, 7 Jan 2026 14:08:34 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions [v2] In-Reply-To: <3GdoIv47UZL2mViNWedMrfbXGorNe_mDLJEVg7lJ0VQ=.cbb57a8b-eed8-411d-b83f-1f52c9f3f84c@github.com> References: <3GdoIv47UZL2mViNWedMrfbXGorNe_mDLJEVg7lJ0VQ=.cbb57a8b-eed8-411d-b83f-1f52c9f3f84c@github.com> Message-ID: On Wed, 7 Jan 2026 00:15:10 GMT, Erik Gahlin wrote: >> test/jdk/jdk/jfr/event/tracing/TestConstructors.java line 116: >> >>> 114: } >>> 115: try { >>> 116: new Zebra(true); >> >> This results in `Zebra(int)` getting traced but not `Zebra(boolean)` because the `Zebra(int)` constructor call throws but [is outside the `try` block](https://github.com/openjdk/jdk/pull/28947/files#diff-68a37600bc91d54808ea1ca427ade6af8a600889877f262e20782c550eded410R160) so execution never reaches the `catch` block that applies tracing. Is this intended? Shouldn't a method be traced every time it is called? In contrast, `new Zebra(false);` causes both `Zebra(int)` and `Zebra(boolean)` to be traced. >> >> Additionally, with the old approach, `new Cat();` would not cause `Cat()` to be traced at all, since its callee, `methodThatThrows()`, prevents execution ever reaching `Cat()`'s `return` statement. I did a quick check on this by hardcoding `simplifiedInstrumentation = true`. Now, with the new approach in this PR, `new Cat();` causes `Cat()` to be traced exactly once. This makes sense to me, but is different than before. > > We can't place a try block around a call to super(...) or this(...). This is why two try blocks are used, one before and one after the call to this(...) or super(...). > > With try blocks, we can now also track when an exception occurs in a callee. This is a behavioral change, but I believe it is for the better. I was aware of this limitation when I did the initial implementation, but I didn't think it was worth the added complexity that try blocks bring. What I didn?t realize at the time was the double-count issue, so now that we have the mechanics for try blocks, I fixed exception in a callee as well. Okay this makes sense to me. Thank you for your explanation. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/28947#discussion_r2668595080 From mgronlun at openjdk.org Wed Jan 7 14:23:52 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 7 Jan 2026 14:23:52 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Sat, 29 Nov 2025 06:06:16 GMT, Yasumasa Suenaga wrote: > The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. > > JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. > > Passed all of jdk_jfr tests on Linux AMD64. Alternative implementation suggestion PR (in draft state) https://github.com/openjdk/jdk/pull/29094 Includes also a solution to [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) @tstuefe Please take a look, and also if you can, submit for testing on your platforms. Markus ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3719105625 From mgronlun at openjdk.org Wed Jan 7 17:31:12 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 7 Jan 2026 17:31:12 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions [v2] In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 10:08:11 GMT, Erik Gahlin wrote: >> Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? >> >> For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. >> >> Testing: jdk/jdk/jfr >> >> Thanks >> Erik > > Erik Gahlin has updated the pull request incrementally with one additional commit since the last revision: > > Formatting + reuse of local variable Marked as reviewed by mgronlun (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/28947#pullrequestreview-3635976864 From mgronlun at openjdk.org Wed Jan 7 17:38:46 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 7 Jan 2026 17:38:46 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet In-Reply-To: References: Message-ID: On Sat, 3 Jan 2026 08:21:15 GMT, Kim Barrett wrote: > Please review this change to fix JfrSet to avoid triggering > -Wzero-as-null-pointer-constant warnings when that warning is enabled. > > The old code uses an entry value with representation 0 to indicate the entry > doesn't have a value. It compares an entry value against literal 0 to check > for that. If the key type is a pointer type, this involves an implicit 0 => > null pointer constant conversion, so we get a warning when that warning is > enabled. > > Instead we initialize entry values to a value-initialized key, and compare > against a value-initialized key. This changes the (currently undocumented) > requirements on the key type. The key type is no longer required to be > trivially constructible (to permit memset-based initialization), but is now > required to be value-initializable. That's currently a wash, since all of the > in-use key types are fundamental types (traceid (u8) and Klass*). > > Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) Will review this later Kim, sorry for the delay (26 stuff). ------------- PR Comment: https://git.openjdk.org/jdk/pull/29022#issuecomment-3719971111 From mgronlun at openjdk.org Wed Jan 7 17:45:22 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 7 Jan 2026 17:45:22 GMT Subject: RFR: 8374713: PredicatedConcurrentWriteOp is unused In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 12:34:33 GMT, Francesco Andreuzzi wrote: > Trivial removal of `PredicatedConcurrentWriteOp`, which is not used anymore after [8284161](https://bugs.openjdk.org/browse/JDK-8284161). I would prefer to keep this building block so I don't have to devise it again, when / if needed next time. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29088#issuecomment-3720004639 From ysuenaga at openjdk.org Thu Jan 8 01:38:00 2026 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Thu, 8 Jan 2026 01:38:00 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 14:20:34 GMT, Markus Gr?nlund wrote: >> The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. >> >> JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. >> >> Passed all of jdk_jfr tests on Linux AMD64. > > Alternative implementation suggestion PR (in draft state) https://github.com/openjdk/jdk/pull/29094 > > Includes also a solution to [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) > @tstuefe > > Please take a look, and also if you can, submit for testing on your platforms. > > Markus Thanks a lot @mgronlun ! I think JDK-8371014 (and JDK-8373257) should be tackled in #29094 . So should I close this PR? ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3721527637 From fandreuzzi at openjdk.org Thu Jan 8 08:34:05 2026 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Thu, 8 Jan 2026 08:34:05 GMT Subject: RFR: 8374713: PredicatedConcurrentWriteOp is unused In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 17:41:51 GMT, Markus Gr?nlund wrote: > I would prefer to keep this building block so I don't have to devise it again, when / if needed next time. Sure, I'll close this PR then. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29088#issuecomment-3722778900 From fandreuzzi at openjdk.org Thu Jan 8 08:34:07 2026 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Thu, 8 Jan 2026 08:34:07 GMT Subject: Withdrawn: 8374713: PredicatedConcurrentWriteOp is unused In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 12:34:33 GMT, Francesco Andreuzzi wrote: > Trivial removal of `PredicatedConcurrentWriteOp`, which is not used anymore after [8284161](https://bugs.openjdk.org/browse/JDK-8284161). This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/29088 From mgronlun at openjdk.org Thu Jan 8 09:47:20 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 8 Jan 2026 09:47:20 GMT Subject: RFR: 8374713: PredicatedConcurrentWriteOp is unused In-Reply-To: References: Message-ID: <2RGUAJjq1XAngtw2vGN9xRo6XDW8DbNq5xmr6KGij3A=.538c42fc-f2b4-4c52-963e-aea497163cb5@github.com> On Thu, 8 Jan 2026 08:30:37 GMT, Francesco Andreuzzi wrote: > > I would prefer to keep this building block so I don't have to devise it again, when / if needed next time. > > Sure, I'll close this PR then. Thanks @fandreuz ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29088#issuecomment-3723052062 From egahlin at openjdk.org Thu Jan 8 11:26:54 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Thu, 8 Jan 2026 11:26:54 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions [v3] In-Reply-To: References: Message-ID: > Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? > > For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. > > Testing: jdk/jdk/jfr > > Thanks > Erik Erik Gahlin has updated the pull request incrementally with one additional commit since the last revision: Use simplified instrumentation for java.lang.Object:: ------------- Changes: - all: https://git.openjdk.org/jdk/pull/28947/files - new: https://git.openjdk.org/jdk/pull/28947/files/f97e2ad3..4897a25e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=28947&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=28947&range=01-02 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/28947.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/28947/head:pull/28947 PR: https://git.openjdk.org/jdk/pull/28947 From mgronlun at openjdk.org Thu Jan 8 12:24:06 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 8 Jan 2026 12:24:06 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 14:20:34 GMT, Markus Gr?nlund wrote: >> The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. >> >> JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. >> >> Passed all of jdk_jfr tests on Linux AMD64. > > Alternative implementation suggestion PR (in draft state) https://github.com/openjdk/jdk/pull/29094 > > Includes also a solution to [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) > @tstuefe > > Please take a look, and also if you can, submit for testing on your platforms. > > Markus > Thanks a lot @mgronlun ! I think JDK-8371014 (and JDK-8373257) should be tackled in #29094 . So should I close this PR? Yes, I think we should do it in https://github.com/openjdk/jdk/pull/29094. You can close this one, and I will officially publish https://github.com/openjdk/jdk/pull/29094. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28563#issuecomment-3723622934 From mgronlun at openjdk.org Thu Jan 8 12:29:48 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 8 Jan 2026 12:29:48 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented Message-ID: Alternative for solving [JDK-8371014](https://bugs.openjdk.org/browse/JDK-8371014) Also includes a fix for [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) ------------- Commit messages: - reordering - update comment - is_recording() conditional - remove tautology - 8371014 Changes: https://git.openjdk.org/jdk/pull/29094/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29094&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8371014 Stats: 292 lines in 15 files changed: 219 ins; 18 del; 55 mod Patch: https://git.openjdk.org/jdk/pull/29094.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29094/head:pull/29094 PR: https://git.openjdk.org/jdk/pull/29094 From mdoerr at openjdk.org Thu Jan 8 12:29:49 2026 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 8 Jan 2026 12:29:49 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: <8LD4JmIZnVSwmhLeVZROok-0h-nCD1TxlaSRHe586-E=.99a49bfa-444f-4ddf-b206-0a75fe1dad23@github.com> On Wed, 7 Jan 2026 14:14:19 GMT, Markus Gr?nlund wrote: > Alternative for solving [JDK-8371014](https://bugs.openjdk.org/browse/JDK-8371014) > > Also includes a fix for [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) TestEmergencyDumpAtOOM.java has passed on both, AIX and linux on PPC64. Thanks! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29094#issuecomment-3719926832 From mgronlun at openjdk.org Thu Jan 8 12:29:51 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 8 Jan 2026 12:29:51 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: <8LD4JmIZnVSwmhLeVZROok-0h-nCD1TxlaSRHe586-E=.99a49bfa-444f-4ddf-b206-0a75fe1dad23@github.com> References: <8LD4JmIZnVSwmhLeVZROok-0h-nCD1TxlaSRHe586-E=.99a49bfa-444f-4ddf-b206-0a75fe1dad23@github.com> Message-ID: On Wed, 7 Jan 2026 17:23:25 GMT, Martin Doerr wrote: > TestEmergencyDumpAtOOM.java has passed on both, AIX and linux on PPC64. Thanks! Thanks Martin. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29094#issuecomment-3720022380 From ysuenaga at openjdk.org Thu Jan 8 12:29:52 2026 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Thu, 8 Jan 2026 12:29:52 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: <8LD4JmIZnVSwmhLeVZROok-0h-nCD1TxlaSRHe586-E=.99a49bfa-444f-4ddf-b206-0a75fe1dad23@github.com> Message-ID: On Wed, 7 Jan 2026 17:45:55 GMT, Markus Gr?nlund wrote: >> TestEmergencyDumpAtOOM.java has passed on both, AIX and linux on PPC64. Thanks! > >> TestEmergencyDumpAtOOM.java has passed on both, AIX and linux on PPC64. Thanks! > > Thanks Martin. Thanks a lot @mgronlun ! Looks good in general. Can we wait to finish `service.emit_leakprofiler_events()` in JFR recorder thread before the crash at `report_java_out_of_memory()` in debug.cpp? whether `abort()` is called before finishing to dump events by recorder thread. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29094#issuecomment-3721524743 From mgronlun at openjdk.org Thu Jan 8 12:29:53 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 8 Jan 2026 12:29:53 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: <8LD4JmIZnVSwmhLeVZROok-0h-nCD1TxlaSRHe586-E=.99a49bfa-444f-4ddf-b206-0a75fe1dad23@github.com> Message-ID: On Wed, 7 Jan 2026 17:45:55 GMT, Markus Gr?nlund wrote: >> TestEmergencyDumpAtOOM.java has passed on both, AIX and linux on PPC64. Thanks! > >> TestEmergencyDumpAtOOM.java has passed on both, AIX and linux on PPC64. Thanks! > > Thanks Martin. > Thanks a lot @mgronlun ! Looks good in general. > > Can we wait to finish `service.emit_leakprofiler_events()` in JFR recorder thread before the crash at `report_java_out_of_memory()` in debug.cpp? whether `abort()` is called before finishing to dump events by recorder thread. The solution to is avoid someone calling abort() concurrently until at least one service.emit_leakprofiler_events(); has completed. That's why the invocation is done by all threads coming into report_java_out_of_memory(), not only a cas-selected one. Why? Because it is only by taking the threads from thread state _thread_in_vm to state _thread_blocked (which we manage as part of posting the JFR msg), that a VM operation in service.emit_leakprofiler_events() can proceed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29094#issuecomment-3723611490 From mgronlun at openjdk.org Thu Jan 8 13:47:06 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 8 Jan 2026 13:47:06 GMT Subject: RFR: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions [v3] In-Reply-To: References: Message-ID: On Thu, 8 Jan 2026 11:26:54 GMT, Erik Gahlin wrote: >> Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? >> >> For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. >> >> Testing: jdk/jdk/jfr >> >> Thanks >> Erik > > Erik Gahlin has updated the pull request incrementally with one additional commit since the last revision: > > Use simplified instrumentation for java.lang.Object:: Marked as reviewed by mgronlun (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/28947#pullrequestreview-3639493888 From egahlin at openjdk.org Thu Jan 8 16:38:54 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Thu, 8 Jan 2026 16:38:54 GMT Subject: Integrated: 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions In-Reply-To: References: Message-ID: On Sun, 21 Dec 2025 16:22:25 GMT, Erik Gahlin wrote: > Could I have a review of a PR that changes how the instrumentation of the MethodTrace and MethodTiming events is implemented, so they handle exceptions in a better way? > > For constructors, the current implementation is still used in certain corner cases. A proper implementation would require data-flow analysis, but for all practical purposes this code should work fine. > > Testing: jdk/jdk/jfr > > Thanks > Erik This pull request has now been integrated. Changeset: fa2eb626 Author: Erik Gahlin URL: https://git.openjdk.org/jdk/commit/fa2eb626478806dc64fe03d8729f53f7ed26a172 Stats: 363 lines in 5 files changed: 333 ins; 1 del; 29 mod 8367949: JFR: MethodTrace double-counts methods that catch their own exceptions Reviewed-by: mgronlun ------------- PR: https://git.openjdk.org/jdk/pull/28947 From egahlin at openjdk.org Thu Jan 8 17:20:55 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Thu, 8 Jan 2026 17:20:55 GMT Subject: RFR: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 Message-ID: Could I have a review of a PR that attempts to harden a test? Sometimes, ClassLoaderStatistics events are dropped, probably due to a bug in the RecordingStream class when starting multiple recordings simultaneously. This is not a bug related to back-to-back chunks, so I decided to use an EventFileStream instead. I also use TestClassLoader for verification purposes. Using PlatformClassLoader shouldn't be a problem, but it seems more prudent to have an actual object/class on the heap for the class loader that needs to be checked. Testing: jdk/jdk/jfr Thanks Erik ------------- Commit messages: - Use EventStream - Remove empty line - Initial Changes: https://git.openjdk.org/jdk/pull/29117/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29117&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8372321 Stats: 47 lines in 1 file changed: 21 ins; 14 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/29117.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29117/head:pull/29117 PR: https://git.openjdk.org/jdk/pull/29117 From mgronlun at openjdk.org Thu Jan 8 19:01:30 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 8 Jan 2026 19:01:30 GMT Subject: RFR: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 In-Reply-To: References: Message-ID: On Thu, 8 Jan 2026 14:28:54 GMT, Erik Gahlin wrote: > Could I have a review of a PR that attempts to harden a test? > > Sometimes, ClassLoaderStatistics events are dropped, probably due to a bug in the RecordingStream class when starting multiple recordings simultaneously. This is not a bug related to back-to-back chunks, so I decided to use an EventFileStream instead. > > I also use TestClassLoader for verification purposes. Using PlatformClassLoader shouldn't be a problem, but it seems more prudent to have an actual object/class on the heap for the class loader that needs to be checked. > > Testing: jdk/jdk/jfr > > Thanks > Erik Lets try this. ------------- Marked as reviewed by mgronlun (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/29117#pullrequestreview-3640803287 From duke at openjdk.org Thu Jan 8 21:48:11 2026 From: duke at openjdk.org (Robert Toyonaga) Date: Thu, 8 Jan 2026 21:48:11 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 14:14:19 GMT, Markus Gr?nlund wrote: > Alternative for solving [JDK-8371014](https://bugs.openjdk.org/browse/JDK-8371014) > > Also includes a fix for [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) > > Testing: jdk_jfr, stress testing, manual testing with CrashOnOutOfMemoryError, tier1-6 src/hotspot/share/jfr/recorder/repository/jfrEmergencyDump.cpp line 611: > 609: if (thread->is_VM_thread()) { > 610: const VM_Operation* const operation = VMThread::vm_operation(); > 611: if (operation != nullptr && operation->type() == VM_Operation::VMOp_JFROldObject) { Is it better/possible to directly check the rotation lock instead? Maybe it's possible the thread crashed before starting the vm operation, or the lock is held by something else. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29094#discussion_r2674002541 From ysuenaga at openjdk.org Fri Jan 9 01:19:25 2026 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Fri, 9 Jan 2026 01:19:25 GMT Subject: Withdrawn: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Sat, 29 Nov 2025 06:06:16 GMT, Yasumasa Suenaga wrote: > The jtreg test TestEmergencyDumpAtOOM.java runs into the following error on ppc64 platforms. > > JFR emergency dump would be kicked at `VMError::report_and_die()`, then Java thread for JFR would not work due to secondary signal handler for error reporting. > > Passed all of jdk_jfr tests on Linux AMD64. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/28563 From mbaesken at openjdk.org Fri Jan 9 08:09:05 2026 From: mbaesken at openjdk.org (Matthias Baesken) Date: Fri, 9 Jan 2026 08:09:05 GMT Subject: RFR: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 In-Reply-To: References: Message-ID: On Thu, 8 Jan 2026 14:28:54 GMT, Erik Gahlin wrote: > Could I have a review of a PR that attempts to harden a test? > > Sometimes, ClassLoaderStatistics events are dropped, probably due to a bug in the RecordingStream class when starting multiple recordings simultaneously. This is not a bug related to back-to-back chunks, so I decided to use an EventFileStream instead. > > I also use TestClassLoader for verification purposes. Using PlatformClassLoader shouldn't be a problem, but it seems more prudent to have an actual object/class on the heap for the class loader that needs to be checked. > > Testing: jdk/jdk/jfr > > Thanks > Erik If you want , I can put the change into our CI and let it there for a few days, to check if the errors we faced are gone with the change. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29117#issuecomment-3727642942 From mgronlun at openjdk.org Fri Jan 9 11:07:17 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Fri, 9 Jan 2026 11:07:17 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Thu, 8 Jan 2026 21:42:16 GMT, Robert Toyonaga wrote: > Is it better/possible to directly check the rotation lock instead? Maybe it's possible the thread crashed before starting the vm operation, or the lock is held by something else. Lock testing is inherently racy, and would also include false negatives (i.e., say the rotation lock is currently held during a normal flush / rotation by the JFR Recorder Thread, then its perfectly fine even for the VM Thread to block waiting for it to be released). It is only the above implication that makes it impossible for the VM Thread to wait on rotation lock release. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29094#discussion_r2675762122 From mgronlun at openjdk.org Sun Jan 11 12:45:02 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Sun, 11 Jan 2026 12:45:02 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant Message-ID: Greetings, When sampling threads in state _thread_in_native, there is a missing memory barrier when UseSystemMemoryBarrier is used, because it must be emitted manually. Testing: jdk_jfr Thanks Markus PS "threads_lock" local variable was renamed to "lock" not to confuse with the global Threads_lock. ------------- Commit messages: - 8373485 Changes: https://git.openjdk.org/jdk/pull/29155/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29155&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373485 Stats: 11 lines in 1 file changed: 7 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/29155.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29155/head:pull/29155 PR: https://git.openjdk.org/jdk/pull/29155 From egahlin at openjdk.org Sun Jan 11 16:18:52 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Sun, 11 Jan 2026 16:18:52 GMT Subject: RFR: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 In-Reply-To: References: Message-ID: On Fri, 9 Jan 2026 08:06:47 GMT, Matthias Baesken wrote: > If you want , I can put the change into our CI and let it there for a few days, to check if the errors we faced are gone with the change. We have been able to reproduce this in our CI as well. Before the fix, I got about 10 failures in 1000 runs. After the fix, there were zero failures. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29117#issuecomment-3734920165 From fabrice.bibonne at courriel.eco Sun Jan 11 18:23:04 2026 From: fabrice.bibonne at courriel.eco (Fabrice Bibonne) Date: Sun, 11 Jan 2026 19:23:04 +0100 Subject: Using JFR both with ZGC degrades application throughput Message-ID: Hi all, I would like to report a case where starting jfr for an application running with zgc causes a significant throughput degradation (compared to when JFR is not started). My context : I was writing a little web app to illustrate a case where the use of ZGC gives a better throughput than with G1. I benchmarked with grafana k6 my application running with G1 and my application running with ZGC : the runs with ZGC gave better throughputs. I wanted to go a bit further in explanation so I began again my benchmarks with JFR to be able to illustrate GC gains in JMC. When I ran my web app with ZGC+JFR, I noticed a significant throughput degradation in my benchmark (which was not the case with G1+JFR). Although I did not measure an increase in overhead as such, I still wanted to report this issue because the degradation in throughput with JFR is such that it would not be usable as is on a production service. I wrote a little application (not a web one) to reproduce the problem : the application calls a little conversion service 200 times with random numbers in parallel (to be like a web app in charge and to pressure GC). The conversion service (a method named `convertNumberToWords`) convert the number in a String looking for the String in a Map with the number as th key. In order to instantiate and destroy many objects at each call, the map is built parsing a huge String at each call. Application ends after 200 calls. Here are the step to reproduce : 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact (be aware to be on branch jfr+zgc_impact) 2. Compile it (you must include numbers200k.zip in resources : it contains a 36 Mo text files whose contents are used to create the huge String variable) 3. in the root of repository : 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -classpath target/classes poc.java.perf.write.TestPerf #ZGC without JFR` 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -XX:StartFlightRecording -classpath target/classes poc.java.perf.write.TestPerf #ZGC with JFR` 4. The real time of the second run (with JFR) will be considerably higher than that of the first I ran these tests on my laptop : - Dell Inc. Latitude 5591 - openSUSE Tumbleweed 20260108 - Kernel : 6.18.3-1-default (64-bit) - 12 ? Intel(R) Core(tm) i7-8850H CPU @ 2.60GHz - RAM 16 Gio - openjdk version "25.0.1" 2025-10-21 - OpenJDK Runtime Environment (build 25.0.1+8-27) - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) - many tabs opened in firefox ! I also ran it in a container (eclipse-temurin:25) on my laptop and with a windows laptop and came to the same conclusions : here are the measurements from the container : | Run with | Real time (s) | |-----------|---------------| | ZGC alone | 7.473 | | ZGC + jfr | 25.075 | | G1 alone | 10.195 | | G1 + jfr | 10.450 | After all these tests I tried to run the app with an other profiler tool in order to understand where is the issue. I join the flamegraph when running jfr+zgc : for the worker threads of the ForkJoinPool of Stream, stack traces of a majority of samples have the same top lines : - PosixSemaphore::wait - ZPageAllocator::alloc_page_stall - ZPageAllocator::alloc_page_inner - ZPageAllocator::alloc_page So many thread seem to spent their time waiting in the method ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic tasks threads has also a few samples where it waits at ZPageAllocator::alloc_page_stall. I hope this will help you to find the issue. Thank you very much for reading this email until the end. I hope this is the good place for such a feedback. Let me know if I must report my problem elsewhere. Be free to ask me more questions if you need. Thank you all for this amazing tool ! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: withJfr.html Type: application/octet-stream Size: 76123 bytes Desc: not available URL: From shade at openjdk.org Mon Jan 12 09:46:40 2026 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 12 Jan 2026 09:46:40 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: References: Message-ID: On Sun, 11 Jan 2026 12:39:06 GMT, Markus Gr?nlund wrote: > Greetings, > > When sampling threads in state _thread_in_native, there is a missing memory barrier when UseSystemMemoryBarrier is used, because it must be emitted manually. > > Testing: jdk_jfr > > Thanks > Markus > > PS "threads_lock" local variable was renamed to "lock" not to confuse with the global Threads_lock. Man, this is confusing. Looks to me thread states are guarded specially. Looking at `Handshake::execute`, I see the pattern is: // Separate the arming of the poll in add_operation() above from // the read of JavaThread state in the try_process() call below. if (UseSystemMemoryBarrier) { SystemMemoryBarrier::emit(); } else { OrderAccess::fence(); } This follows `HandshakeState::add_operation` -> `SafepointMechanism::arm_local_poll_release`. `arm_local_poll_release` is what `JfrSampleThread::sample_native_thread` also does. So, the fix should follow what `Handshake` does. I think you are trying to do the same, but piggy-back on `OA::fence()` already done in `JfrMutexTryLock` when `-UseSystemMemoryBarrier`? ------------- PR Review: https://git.openjdk.org/jdk/pull/29155#pullrequestreview-3649901629 From erik.gahlin at oracle.com Mon Jan 12 09:56:06 2026 From: erik.gahlin at oracle.com (Erik Gahlin) Date: Mon, 12 Jan 2026 09:56:06 +0000 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: References: Message-ID: Hi Fabrice, Thanks for reporting! Could you post the source code for the reproducer here? The 36 MB file could probably be replaced with a String::repeat expression. JFR does use some memory, which could impact available heap and performance, although the degradation you?re seeing seems awfully high. Thanks Erik ________________________________________ From: hotspot-jfr-dev on behalf of Fabrice Bibonne Sent: Sunday, January 11, 2026 7:23 PM To: hotspot-jfr-dev at openjdk.org Subject: Using JFR both with ZGC degrades application throughput Hi all, I would like to report a case where starting jfr for an application running with zgc causes a significant throughput degradation (compared to when JFR is not started). My context : I was writing a little web app to illustrate a case where the use of ZGC gives a better throughput than with G1. I benchmarked with grafana k6 my application running with G1 and my application running with ZGC : the runs with ZGC gave better throughputs. I wanted to go a bit further in explanation so I began again my benchmarks with JFR to be able to illustrate GC gains in JMC. When I ran my web app with ZGC+JFR, I noticed a significant throughput degradation in my benchmark (which was not the case with G1+JFR). Although I did not measure an increase in overhead as such, I still wanted to report this issue because the degradation in throughput with JFR is such that it would not be usable as is on a production service. I wrote a little application (not a web one) to reproduce the problem : the application calls a little conversion service 200 times with random numbers in parallel (to be like a web app in charge and to pressure GC). The conversion service (a method named `convertNumberToWords`) convert the number in a String looking for the String in a Map with the number as th key. In order to instantiate and destroy many objects at each call, the map is built parsing a huge String at each call. Application ends after 200 calls. Here are the step to reproduce : 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact (be aware to be on branch jfr+zgc_impact) 2. Compile it (you must include numbers200k.zip in resources : it contains a 36 Mo text files whose contents are used to create the huge String variable) 3. in the root of repository : 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -classpath target/classes poc.java.perf.write.TestPerf #ZGC without JFR` 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -XX:StartFlightRecording -classpath target/classes poc.java.perf.write.TestPerf #ZGC with JFR` 4. The real time of the second run (with JFR) will be considerably higher than that of the first I ran these tests on my laptop : - Dell Inc. Latitude 5591 - openSUSE Tumbleweed 20260108 - Kernel : 6.18.3-1-default (64-bit) - 12 ? Intel? Core? i7-8850H CPU @ 2.60GHz - RAM 16 Gio - openjdk version "25.0.1" 2025-10-21 - OpenJDK Runtime Environment (build 25.0.1+8-27) - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) - many tabs opened in firefox ! I also ran it in a container (eclipse-temurin:25) on my laptop and with a windows laptop and came to the same conclusions : here are the measurements from the container : | Run with | Real time (s) | |-----------|---------------| | ZGC alone | 7.473 | | ZGC + jfr | 25.075 | | G1 alone | 10.195 | | G1 + jfr | 10.450 | After all these tests I tried to run the app with an other profiler tool in order to understand where is the issue. I join the flamegraph when running jfr+zgc : for the worker threads of the ForkJoinPool of Stream, stack traces of a majority of samples have the same top lines : - PosixSemaphore::wait - ZPageAllocator::alloc_page_stall - ZPageAllocator::alloc_page_inner - ZPageAllocator::alloc_page So many thread seem to spent their time waiting in the method ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic tasks threads has also a few samples where it waits at ZPageAllocator::alloc_page_stall. I hope this will help you to find the issue. Thank you very much for reading this email until the end. I hope this is the good place for such a feedback. Let me know if I must report my problem elsewhere. Be free to ask me more questions if you need. Thank you all for this amazing tool ! From mgronlun at openjdk.org Mon Jan 12 10:48:37 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 12 Jan 2026 10:48:37 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: References: Message-ID: On Mon, 12 Jan 2026 09:43:14 GMT, Aleksey Shipilev wrote: > Man, this is confusing. Looks to me thread states are guarded specially. Looking at `Handshake::execute`, I see the pattern is: > > ``` > // Separate the arming of the poll in add_operation() above from > // the read of JavaThread state in the try_process() call below. > if (UseSystemMemoryBarrier) { > SystemMemoryBarrier::emit(); > } else { > OrderAccess::fence(); > } > ``` > > This follows `HandshakeState::add_operation` -> `SafepointMechanism::arm_local_poll_release`. `arm_local_poll_release` is what `JfrSampleThread::sample_native_thread` also does. So, the fix should follow what `Handshake` does. I think you are trying to do the same, but piggy-back on `OA::fence()` already done in `JfrMutexTryLock` when `-UseSystemMemoryBarrier`? Exactly right. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29155#issuecomment-3737916202 From egahlin at openjdk.org Mon Jan 12 11:35:06 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Mon, 12 Jan 2026 11:35:06 GMT Subject: Integrated: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 In-Reply-To: References: Message-ID: On Thu, 8 Jan 2026 14:28:54 GMT, Erik Gahlin wrote: > Could I have a review of a PR that attempts to harden a test? > > Sometimes, ClassLoaderStatistics events are dropped, probably due to a bug in the RecordingStream class when starting multiple recordings simultaneously. This is not a bug related to back-to-back chunks, so I decided to use an EventFileStream instead. > > I also use TestClassLoader for verification purposes. Using PlatformClassLoader shouldn't be a problem, but it seems more prudent to have an actual object/class on the heap for the class loader that needs to be checked. > > Testing: jdk/jdk/jfr > > Thanks > Erik This pull request has now been integrated. Changeset: 556bddfd Author: Erik Gahlin URL: https://git.openjdk.org/jdk/commit/556bddfd9439d1bad698ab5134317ce263a36b04 Stats: 47 lines in 1 file changed: 21 ins; 14 del; 12 mod 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 Reviewed-by: mgronlun ------------- PR: https://git.openjdk.org/jdk/pull/29117 From thomas.schatzl at oracle.com Mon Jan 12 13:18:47 2026 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Mon, 12 Jan 2026 14:18:47 +0100 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: References: Message-ID: Hi, while not being able to answer the question about why using JFR takes so much additional time, when reading about your benchmark setup the following things came to my mind: * -XX:+UseCompressedOops for ZGC does nothing (ZGC does not support compressed oops at all), and G1 will automatically use it. You can leave it off. * G1 having a significantly worse throughput than ZGC is very rare: even then the extent you show is quite large. Taking some of content together (4g heap, Maps, huge string variables) indicates that you might have run into a well-known pathology of G1 with large objects: the application might waste up to 50% of your application due to these humongous objects [0]. G1 might work better in JDK 26 too as some enhancement to some particular case has been added. More is being worked on. TL;DR: Your application might run much better with a large(r) G1HeapRegionSize setting. Or just upgrading to JDK 26. * While ZGC does not have that in some cases extreme memory wastage for large allocations, there is still some. Adding JFR might just push it over the edge (the stack you showed are about finding a new empty page/region for allocation, failing to do so, doing a GC, stalling and waiting). Hth, Thomas [0] https://tschatzl.github.io/2021/11/15/heap-regions-x-large.html On 11.01.26 19:23, Fabrice Bibonne wrote: > Hi all, > > ?I would like to report a case where starting jfr for an application > running with zgc causes a significant throughput degradation (compared > to when JFR is not started). > > ?My context : I was writing a little web app to illustrate a case where > the use of ZGC gives a better throughput than with G1. I benchmarked > with grafana k6 my application running with G1 and my application > running with ZGC ?: the runs with ZGC gave better throughputs. I wanted > to go a bit further in explanation so I began again my benchmarks with > JFR to be able to illustrate GC gains in JMC. When I ran my web app with > ZGC+JFR, I noticed a significant throughput degradation in my benchmark > (which was not the case with G1+JFR). > > ?Although I did not measure an increase in overhead as such, I still > wanted to report this issue because the degradation in throughput with > JFR is such that it would not be usable as is on a production service. > > I wrote a little application (not a web one) to reproduce the problem : > the application calls a little conversion service 200 times with random > numbers in parallel (to be like a web app in charge and to pressure GC). > The conversion service (a method named `convertNumberToWords`) convert > the number in a String looking for the String in a Map with the number > as th key. In order to instantiate and destroy many objects at each > call, the map is built parsing a huge String at each call. Application > ends after 200 calls. > > Here are the step to reproduce : > 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact > (be aware to be on branch jfr+zgc_impact) > 2. Compile it (you must include numbers200k.zip in resources : it > contains a 36 Mo text files whose contents are used to create the huge > String variable) > 3. in the root of repository : > 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -classpath > target/classes poc.java.perf.write.TestPerf #ZGC without JFR` > 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops - > XX:StartFlightRecording -classpath target/classes > poc.java.perf.write.TestPerf #ZGC with JFR` > 4. The real time of the second run (with JFR) will be considerably > higher than that of the first > > I ran these tests on my laptop : > - Dell Inc. Latitude 5591 > - openSUSE Tumbleweed 20260108 > - Kernel : 6.18.3-1-default (64-bit) > - 12 ? Intel? Core? i7-8850H CPU @ 2.60GHz > - RAM 16 Gio > - openjdk version "25.0.1" 2025-10-21 > - OpenJDK Runtime Environment (build 25.0.1+8-27) > - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) > - many tabs opened in firefox ! > > I also ran it in a container (eclipse-temurin:25) on my laptop and with > a windows laptop and came to the same conclusions : here are the > measurements from the container : > > | Run with ?| Real time (s) | > |-----------|---------------| > | ZGC alone | 7.473 ? ? ? ? | > | ZGC + jfr | 25.075 ? ? ? ?| > | G1 alone ?| 10.195 ? ? ? ?| > | G1 + jfr ?| 10.450 ? ? ? ?| > > > After all these tests I tried to run the app with an other profiler tool > in order to understand where is the issue. I join the flamegraph when > running jfr+zgc : for the worker threads of the ForkJoinPool of Stream, > stack traces of a majority of samples have the same top lines : > - PosixSemaphore::wait > - ZPageAllocator::alloc_page_stall > - ZPageAllocator::alloc_page_inner > - ZPageAllocator::alloc_page > > So many thread seem to spent their time waiting in the method > ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic > tasks threads has also a few samples where it waits at > ZPageAllocator::alloc_page_stall. I hope this will help you to find the > issue. > > Thank you very much for reading this email until the end. I hope this is > the good place for such a feedback. Let me know if I must report my > problem elsewhere. Be free to ask me more questions if you need. > > Thank you all for this amazing tool ! > > From mgronlun at openjdk.org Mon Jan 12 13:52:58 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 12 Jan 2026 13:52:58 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: References: Message-ID: On Sun, 11 Jan 2026 12:39:06 GMT, Markus Gr?nlund wrote: > Greetings, > > When sampling threads in state _thread_in_native, there is a missing memory barrier when UseSystemMemoryBarrier is used, because it must be emitted manually. > > Testing: jdk_jfr > > Thanks > Markus > > PS "threads_lock" local variable was renamed to "lock" not to confuse with the global Threads_lock. I am going to have to move (back) the Threads_lock acquisition (as part of https://bugs.openjdk.org/browse/JDK-8373106 to where it was placed originally, before JFR Cooperative Sampling). Hence, I will update this to exactly mirror the Handshake pattern. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29155#issuecomment-3738645122 From mgronlun at openjdk.org Mon Jan 12 14:00:23 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 12 Jan 2026 14:00:23 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant [v2] In-Reply-To: References: Message-ID: > Greetings, > > When sampling threads in state _thread_in_native, there is a missing memory barrier when UseSystemMemoryBarrier is used, because it must be emitted manually. > > Testing: jdk_jfr > > Thanks > Markus > > PS "threads_lock" local variable was renamed to "lock" not to confuse with the global Threads_lock. Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: explicit fences ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29155/files - new: https://git.openjdk.org/jdk/pull/29155/files/9b8ed440..8e951ac8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29155&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29155&range=00-01 Stats: 5 lines in 1 file changed: 2 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/29155.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29155/head:pull/29155 PR: https://git.openjdk.org/jdk/pull/29155 From shade at openjdk.org Mon Jan 12 14:19:50 2026 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 12 Jan 2026 14:19:50 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant [v2] In-Reply-To: References: Message-ID: On Mon, 12 Jan 2026 14:00:23 GMT, Markus Gr?nlund wrote: >> Greetings, >> >> When sampling threads in state _thread_in_native, there is a missing memory barrier when UseSystemMemoryBarrier is used, because it must be emitted manually. >> >> Testing: jdk_jfr >> >> Thanks >> Markus >> >> PS "threads_lock" local variable was renamed to "lock" not to confuse with the global Threads_lock. > > Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: > > explicit fences All right, that reads better, thanks. ------------- Marked as reviewed by shade (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/29155#pullrequestreview-3650994267 From fabrice.bibonne at courriel.eco Mon Jan 12 15:59:09 2026 From: fabrice.bibonne at courriel.eco (Fabrice Bibonne) Date: Mon, 12 Jan 2026 16:59:09 +0100 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: References: Message-ID: <80f97dba0b628057de3b7cd2ef4c3bea@courriel.eco> Here is a unique source code file for the reproducer (the big String is generated when starting as you suggested). It changes a little the results but the run with zgc + jfr is still taking lot of time. Thanks you for having a look. Fabrice Le 2026-01-12 10:56, Erik Gahlin a ?crit : > Hi Fabrice, > > Thanks for reporting! > > Could you post the source code for the reproducer here? The 36 MB file > could probably be replaced with a String::repeat expression. > > JFR does use some memory, which could impact available heap and > performance, although the degradation you're seeing seems awfully high. > > Thanks > Erik > > ________________________________________ > From: hotspot-jfr-dev on behalf of > Fabrice Bibonne > Sent: Sunday, January 11, 2026 7:23 PM > To: hotspot-jfr-dev at openjdk.org > Subject: Using JFR both with ZGC degrades application throughput > > Hi all, > > I would like to report a case where starting jfr for an application > running with zgc causes a significant throughput degradation (compared > to when JFR is not started). > > My context : I was writing a little web app to illustrate a case where > the use of ZGC gives a better throughput than with G1. I benchmarked > with grafana k6 my application running with G1 and my application > running with ZGC : the runs with ZGC gave better throughputs. I wanted > to go a bit further in explanation so I began again my benchmarks with > JFR to be able to illustrate GC gains in JMC. When I ran my web app > with ZGC+JFR, I noticed a significant throughput degradation in my > benchmark (which was not the case with G1+JFR). > > Although I did not measure an increase in overhead as such, I still > wanted to report this issue because the degradation in throughput with > JFR is such that it would not be usable as is on a production service. > > I wrote a little application (not a web one) to reproduce the problem : > the application calls a little conversion service 200 times with random > numbers in parallel (to be like a web app in charge and to pressure > GC). The conversion service (a method named `convertNumberToWords`) > convert the number in a String looking for the String in a Map with the > number as th key. In order to instantiate and destroy many objects at > each call, the map is built parsing a huge String at each call. > Application ends after 200 calls. > > Here are the step to reproduce : > 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact > (be aware to be on branch jfr+zgc_impact) > 2. Compile it (you must include numbers200k.zip in resources : it > contains a 36 Mo text files whose contents are used to create the huge > String variable) > 3. in the root of repository : > 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -classpath > target/classes poc.java.perf.write.TestPerf #ZGC without JFR` > 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops > -XX:StartFlightRecording -classpath target/classes > poc.java.perf.write.TestPerf #ZGC with JFR` > 4. The real time of the second run (with JFR) will be considerably > higher than that of the first > > I ran these tests on my laptop : > - Dell Inc. Latitude 5591 > - openSUSE Tumbleweed 20260108 > - Kernel : 6.18.3-1-default (64-bit) > - 12 ? Intel(R) Core(tm) i7-8850H CPU @ 2.60GHz > - RAM 16 Gio > - openjdk version "25.0.1" 2025-10-21 > - OpenJDK Runtime Environment (build 25.0.1+8-27) > - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) > - many tabs opened in firefox ! > > I also ran it in a container (eclipse-temurin:25) on my laptop and with > a windows laptop and came to the same conclusions : here are the > measurements from the container : > > | Run with | Real time (s) | > |-----------|---------------| > | ZGC alone | 7.473 | > | ZGC + jfr | 25.075 | > | G1 alone | 10.195 | > | G1 + jfr | 10.450 | > > After all these tests I tried to run the app with an other profiler > tool in order to understand where is the issue. I join the flamegraph > when running jfr+zgc : for the worker threads of the ForkJoinPool of > Stream, stack traces of a majority of samples have the same top lines : > - PosixSemaphore::wait > - ZPageAllocator::alloc_page_stall > - ZPageAllocator::alloc_page_inner > - ZPageAllocator::alloc_page > > So many thread seem to spent their time waiting in the method > ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic > tasks threads has also a few samples where it waits at > ZPageAllocator::alloc_page_stall. I hope this will help you to find the > issue. > > Thank you very much for reading this email until the end. I hope this > is the good place for such a feedback. Let me know if I must report my > problem elsewhere. Be free to ask me more questions if you need. > > Thank you all for this amazing tool ! -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: TestPerf.java Type: text/x-c Size: 2744 bytes Desc: not available URL: From mgronlun at openjdk.org Mon Jan 12 21:38:18 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 12 Jan 2026 21:38:18 GMT Subject: RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library Message-ID: Greetings, this change effectively reverts [JDK-8358429](https://bugs.openjdk.org/browse/JDK-8358429), which was an attempt to minimize the time the Threads_lock is held during JFR sampling. That change was premised on the, at the time, two known reasons for why we held the Threads_lock during the entire sampling interval. After this change, subtle deadlocks happened on macOS, very intermittently, in the pthreads library, in that a suspended thread could be the owner of an internal process lock, a process lock that was then needed when sending pthread_kill signal to resume it. By rolling back to holding the Threads_lock for the entire duration of the sampling interval (like we have done for many many years in the era before JFR Cooperative Sampling), we prevent JavaThreads from calling os::create_thread(). I have decided to rollback the solution to the version we know work, instead of attempting a more granular solution, perhaps using sigprocmask() to create a critical section around pthread_create in os_bsd.cpp. This is something we might want to do later, but more time is then needed for falsifying / verifying the correct fix. Testing: jdk_jfr, stress testing Thanks Markus PS Indirect barriers removed are explicitly re-inserted as per [JDK-8373485](https://bugs.openjdk.org/browse/JDK-8373485) ------------- Commit messages: - 8373106 Changes: https://git.openjdk.org/jdk/pull/29178/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29178&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373106 Stats: 62 lines in 1 file changed: 12 ins; 18 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/29178.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29178/head:pull/29178 PR: https://git.openjdk.org/jdk/pull/29178 From ysuenaga at openjdk.org Mon Jan 12 23:44:59 2026 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Mon, 12 Jan 2026 23:44:59 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 14:14:19 GMT, Markus Gr?nlund wrote: > Alternative for solving [JDK-8371014](https://bugs.openjdk.org/browse/JDK-8371014) > > Also includes a fix for [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) > > Testing: jdk_jfr, stress testing, manual testing with CrashOnOutOfMemoryError, tier1-6 Thanks a lot for working on this! ------------- Marked as reviewed by ysuenaga (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/29094#pullrequestreview-3653157591 From fabrice.bibonne at courriel.eco Tue Jan 13 04:36:58 2026 From: fabrice.bibonne at courriel.eco (Fabrice Bibonne) Date: Tue, 13 Jan 2026 05:36:58 +0100 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: References: Message-ID: <504c670a6250f5c2ee5e27a8bed97980@courriel.eco> Thank you for your advise, I just give a few precisions in a few lines : * for `-XX:+UseCompressedOops`, I must admit I do not know this option : I add it because JDK Mission control warned me about it in "Automated analysis result" after a fisrt try (<>) * it is true that application waste time in GC pauses (46,6% of time with G1) : I wanted an example app which uses GC a lot. Maybe this is a little too much compared to real apps (even if for some of them, we may wonder...). * the stack I showed about finding a new empty page/region allocation is present in both cases (with jfr and without jfr). But in the case with jfr, it is much more wider : it takes much more samples. Best regards, Fabrice Le 2026-01-12 14:18, Thomas Schatzl a ?crit : > Hi, > > while not being able to answer the question about why using JFR takes > so much additional time, when reading about your benchmark setup the > following things came to my mind: > > * -XX:+UseCompressedOops for ZGC does nothing (ZGC does not support > compressed oops at all), and G1 will automatically use it. You can > leave it off. > > * G1 having a significantly worse throughput than ZGC is very rare: > even then the extent you show is quite large. Taking some of content > together (4g heap, Maps, huge string variables) indicates that you > might have run into a well-known pathology of G1 with large objects: > the application might waste up to 50% of your application due to these > humongous objects [0 [1]]. > G1 might work better in JDK 26 too as some enhancement to some > particular case has been added. More is being worked on. > > TL;DR: Your application might run much better with a large(r) > G1HeapRegionSize setting. Or just upgrading to JDK 26. > > * While ZGC does not have that in some cases extreme memory wastage for > large allocations, there is still some. Adding JFR might just push it > over the edge (the stack you showed are about finding a new empty > page/region for allocation, failing to do so, doing a GC, stalling and > waiting). > > Hth, > Thomas > > [0] https://tschatzl.github.io/2021/11/15/heap-regions-x-large.html > > On 11.01.26 19:23, Fabrice Bibonne wrote: > >> Hi all, >> >> I would like to report a case where starting jfr for an application >> running with zgc causes a significant throughput degradation (compared >> to when JFR is not started). >> >> My context : I was writing a little web app to illustrate a case where >> the use of ZGC gives a better throughput than with G1. I benchmarked >> with grafana k6 my application running with G1 and my application >> running with ZGC : the runs with ZGC gave better throughputs. I >> wanted to go a bit further in explanation so I began again my >> benchmarks with JFR to be able to illustrate GC gains in JMC. When I >> ran my web app with ZGC+JFR, I noticed a significant throughput >> degradation in my benchmark (which was not the case with G1+JFR). >> >> Although I did not measure an increase in overhead as such, I still >> wanted to report this issue because the degradation in throughput with >> JFR is such that it would not be usable as is on a production service. >> >> I wrote a little application (not a web one) to reproduce the problem >> : the application calls a little conversion service 200 times with >> random numbers in parallel (to be like a web app in charge and to >> pressure GC). The conversion service (a method named >> `convertNumberToWords`) convert the number in a String looking for the >> String in a Map with the number as th key. In order to instantiate and >> destroy many objects at each call, the map is built parsing a huge >> String at each call. Application ends after 200 calls. >> >> Here are the step to reproduce : >> 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact >> (be aware to be on branch jfr+zgc_impact) >> 2. Compile it (you must include numbers200k.zip in resources : it >> contains a 36 Mo text files whose contents are used to create the huge >> String variable) >> 3. in the root of repository : >> 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops >> -classpath target/classes poc.java.perf.write.TestPerf #ZGC without >> JFR` >> 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops - >> XX:StartFlightRecording -classpath target/classes >> poc.java.perf.write.TestPerf #ZGC with JFR` >> 4. The real time of the second run (with JFR) will be considerably >> higher than that of the first >> >> I ran these tests on my laptop : >> - Dell Inc. Latitude 5591 >> - openSUSE Tumbleweed 20260108 >> - Kernel : 6.18.3-1-default (64-bit) >> - 12 ? Intel(R) Core(tm) i7-8850H CPU @ 2.60GHz >> - RAM 16 Gio >> - openjdk version "25.0.1" 2025-10-21 >> - OpenJDK Runtime Environment (build 25.0.1+8-27) >> - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) >> - many tabs opened in firefox ! >> >> I also ran it in a container (eclipse-temurin:25) on my laptop and >> with a windows laptop and came to the same conclusions : here are the >> measurements from the container : >> >> | Run with | Real time (s) | >> |-----------|---------------| >> | ZGC alone | 7.473 | >> | ZGC + jfr | 25.075 | >> | G1 alone | 10.195 | >> | G1 + jfr | 10.450 | >> >> After all these tests I tried to run the app with an other profiler >> tool in order to understand where is the issue. I join the flamegraph >> when running jfr+zgc : for the worker threads of the ForkJoinPool of >> Stream, stack traces of a majority of samples have the same top lines >> : >> - PosixSemaphore::wait >> - ZPageAllocator::alloc_page_stall >> - ZPageAllocator::alloc_page_inner >> - ZPageAllocator::alloc_page >> >> So many thread seem to spent their time waiting in the method >> ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic >> tasks threads has also a few samples where it waits at >> ZPageAllocator::alloc_page_stall. I hope this will help you to find >> the issue. >> >> Thank you very much for reading this email until the end. I hope this >> is the good place for such a feedback. Let me know if I must report my >> problem elsewhere. Be free to ask me more questions if you need. >> >> Thank you all for this amazing tool ! Links: ------ [1] https://tschatzl.github.io/2021/11/15/heap-regions-x-large.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Tue Jan 13 10:06:07 2026 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Tue, 13 Jan 2026 11:06:07 +0100 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: <504c670a6250f5c2ee5e27a8bed97980@courriel.eco> References: <504c670a6250f5c2ee5e27a8bed97980@courriel.eco> Message-ID: <948f3b50-80a2-4d89-a341-d9908a07f862@oracle.com> Hi, On 13.01.26 05:36, Fabrice Bibonne wrote: > Thank you for your advise, I just give a few precisions in a few lines? : > > * for `-XX:+UseCompressedOops`, I must admit I do not know this option : > I add it because JDK Mission control warned me about it in "Automated > analysis result" after a fisrt try (< [...].Use the JVM argument '-XX:+UseCompressedOops' to enable this > feature>>) Maybe JMC should not provide this hint for ZGC then (not directed towards you). > > * it is true that application waste time in GC pauses (46,6% of time > with G1) : I wanted an example app which uses GC a lot. Maybe this is a > little too much compared to real apps (even if for some of them, we may > wonder...). What I am saying is that while the results are as they are for you, I suspect that the result is not representative for G1 as it exercises a pathology that could (and unfortunately must if it is really the case) be resolved by a single command line switch by the user. The G1 GC algorithm would need prior knowledge of the application it is running to automatically resolve this. Having had a look at G1 behavior, the reason for the low performance is likely due to heap sizing heuristics issues, G1 does not expand the heap as aggressively as ZGC. The upside obviously is that it uses (much) less memory. ;) More technical explanation: * in presence of full collection ([0], in the process of being fixed right now), G1 does not expand the heap, running with maybe half of what ZGC uses. This is due to the behavior of the application. * even if fixing the bug via some command line options (-XX:MaxHeapFreeRatio=100), the short runtime of the application (i.e. even with that fix applied, it takes to long to get to the same heap size as ZGC for reasons we can discuss if you want. * the mentioned issue with the large objects, i.e. G1 wasting too much memory also contributes. Interestingly, I only observed this on slower systems, these issues do not show on faster ones, e.g. on some x64 workstation (limited to 10 threads). On that workstation, G1 is 2x faster than ZGC with the settings you gave already. However on some smallish Aarch64 VM it is around the same performance (slightly slower). This is probably what you are seeing on your laptop (which may also experience aggressive throttling without precautions). TL;DR: If you set minimum heap size, and region size, G1 is 2x faster than ZGC (with -Xms4g -Xmx4g -XX:G1HeapRegionSize=8m) on that slower aarch64 machine too here. (Fwiw, for maximum throughput we recommend to set minimum and maximum heap size to the same value irrespective of the garbage collector, see the recommendations in our performance guide [1]. It also describes the issue with humongous objects. We are working on improving both issues right now). Another observation is that with ZGC, although overall throughput is faster than with G1 in your original example, its allocation stalls are in the range of hundreds of milliseconds, while G1 pauses are at most at 50ms. So the "experience" with that original web app may be better with G1 even if it is slower overall :P (We do not recommend running latency oriented programs at that cpu load level either way, but just noticing). > * the stack I showed about finding a new empty page/region allocation is > present in both cases (with jfr and without jfr). But in the case with > jfr, it is much more wider : it takes much more samples. Problems tend to exacerbate themselves, i.e. after a certain threshold of allocation rate beyond what it can sustain, performance can quickly (non-linearly) detoriate, e.g. because of the need to use of different slower algorithms. Without JFR I am already seeing that almost all GCs are caused by allocation stalls. Adding to that will not help. When looking around in ZGC logs a bit, with StartFlightRecording there seems to be much more so-called in-place object movement (i.e. instead of copying live objects to a new place, and then freeing the old now empty space, the objects are moved "down" the heap to fill gaps), which is a lot more expensive. This shows in garbage collection pauses, changing from hundreds of ms to seconds. As mentioned above, it looks like just that little extra memory usage causes ZGC to go into some very slow mode to free memory and avoid OOME. Hth, Thomas [0] https://bugs.openjdk.org/browse/JDK-8238686 [1] https://docs.oracle.com/en/java/javase/25/gctuning/garbage-first-garbage-collector-tuning.html > > Best regards, > > Fabrice > > > Le 2026-01-12 14:18, Thomas Schatzl a ?crit?: > >> Hi, >> >> ? while not being able to answer the question about why using JFR >> takes so much additional time, when reading about your benchmark setup >> the following things came to my mind: >> >> * -XX:+UseCompressedOops for ZGC does nothing (ZGC does not support >> compressed oops at all), and G1 will automatically use it. You can >> leave it off. >> >> * G1 having a significantly worse throughput than ZGC is very rare: >> even then the extent you show is quite large. Taking some of content >> together (4g heap, Maps, huge string variables) indicates that you >> might have run into a well-known pathology of G1 with large objects: >> the application might waste up to 50% of your application due to these >> humongous objects [0 > regions-x-large.html>]. >> G1 might work better in JDK 26 too as some enhancement to some >> particular case has been added. More is being worked on. >> >> TL;DR: Your application might run much better with a large(r) >> G1HeapRegionSize setting. Or just upgrading to JDK 26. >> >> * While ZGC does not have that in some cases extreme memory wastage >> for large allocations, there is still some. Adding JFR might just push >> it over the edge (the stack you showed are about finding a new empty >> page/region for allocation, failing to do so, doing a GC, stalling and >> waiting). >> >> Hth, >> ? Thomas >> >> [0] https://tschatzl.github.io/2021/11/15/heap-regions-x-large.html >> >> >> On 11.01.26 19:23, Fabrice Bibonne wrote: >>> Hi all, >>> >>> ??I would like to report a case where starting jfr for an application >>> running with zgc causes a significant throughput degradation >>> (compared to when JFR is not started). >>> >>> ??My context : I was writing a little web app to illustrate a case >>> where the use of ZGC gives a better throughput than with G1. I >>> benchmarked with grafana k6 my application running with G1 and my >>> application running with ZGC ?: the runs with ZGC gave better >>> throughputs. I wanted to go a bit further in explanation so I began >>> again my benchmarks with JFR to be able to illustrate GC gains in >>> JMC. When I ran my web app with ZGC+JFR, I noticed a significant >>> throughput degradation in my benchmark (which was not the case with >>> G1+JFR). >>> >>> ??Although I did not measure an increase in overhead as such, I still >>> wanted to report this issue because the degradation in throughput >>> with JFR is such that it would not be usable as is on a production >>> service. >>> >>> I wrote a little application (not a web one) to reproduce the >>> problem : the application calls a little conversion service 200 times >>> with random numbers in parallel (to be like a web app in charge and >>> to pressure GC). The conversion service (a method named >>> `convertNumberToWords`) convert the number in a String looking for >>> the String in a Map with the number as th key. In order to >>> instantiate and destroy many objects at each call, the map is built >>> parsing a huge String at each call. Application ends after 200 calls. >>> >>> Here are the step to reproduce : >>> 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact >>> (be >>> aware to be on branch jfr+zgc_impact) >>> 2. Compile it (you must include numbers200k.zip in resources : it >>> contains a 36 Mo text files whose contents are used to create the >>> huge String variable) >>> 3. in the root of repository : >>> 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops - >>> classpath target/classes poc.java.perf.write.TestPerf #ZGC without JFR` >>> 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops - >>> XX:StartFlightRecording -classpath target/classes >>> poc.java.perf.write.TestPerf #ZGC with JFR` >>> 4. The real time of the second run (with JFR) will be considerably >>> higher than that of the first >>> >>> I ran these tests on my laptop : >>> - Dell Inc. Latitude 5591 >>> - openSUSE Tumbleweed 20260108 >>> - Kernel : 6.18.3-1-default (64-bit) >>> - 12 ? Intel? Core? i7-8850H CPU @ 2.60GHz >>> - RAM 16 Gio >>> - openjdk version "25.0.1" 2025-10-21 >>> - OpenJDK Runtime Environment (build 25.0.1+8-27) >>> - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) >>> - many tabs opened in firefox ! >>> >>> I also ran it in a container (eclipse-temurin:25) on my laptop and >>> with a windows laptop and came to the same conclusions : here are the >>> measurements from the container : >>> >>> | Run with ?| Real time (s) | >>> |-----------|---------------| >>> | ZGC alone | 7.473 ? ? ? ? | >>> | ZGC + jfr | 25.075 ? ? ? ?| >>> | G1 alone ?| 10.195 ? ? ? ?| >>> | G1 + jfr ?| 10.450 ? ? ? ?| >>> >>> >>> After all these tests I tried to run the app with an other profiler >>> tool in order to understand where is the issue. I join the flamegraph >>> when running jfr+zgc : for the worker threads of the ForkJoinPool of >>> Stream, stack traces of a majority of samples have the same top lines : >>> - PosixSemaphore::wait >>> - ZPageAllocator::alloc_page_stall >>> - ZPageAllocator::alloc_page_inner >>> - ZPageAllocator::alloc_page >>> >>> So many thread seem to spent their time waiting in the method >>> ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic >>> tasks threads has also a few samples where it waits at >>> ZPageAllocator::alloc_page_stall. I hope this will help you to find >>> the issue. >>> >>> Thank you very much for reading this email until the end. I hope this >>> is the good place for such a feedback. Let me know if I must report >>> my problem elsewhere. Be free to ask me more questions if you need. >>> >>> Thank you all for this amazing tool ! >>> >>> From egahlin at openjdk.org Tue Jan 13 10:33:43 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Tue, 13 Jan 2026 10:33:43 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant [v2] In-Reply-To: References: Message-ID: On Mon, 12 Jan 2026 14:00:23 GMT, Markus Gr?nlund wrote: >> Greetings, >> >> When sampling threads in state _thread_in_native, there is a missing memory barrier when UseSystemMemoryBarrier is used, because it must be emitted manually. >> >> Testing: jdk_jfr >> >> Thanks >> Markus >> >> PS "threads_lock" local variable was renamed to "lock" not to confuse with the global Threads_lock. > > Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: > > explicit fences Marked as reviewed by egahlin (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29155#pullrequestreview-3655069713 From egahlin at openjdk.org Tue Jan 13 10:37:31 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Tue, 13 Jan 2026 10:37:31 GMT Subject: RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library In-Reply-To: References: Message-ID: On Mon, 12 Jan 2026 21:29:26 GMT, Markus Gr?nlund wrote: > Greetings, > > this change effectively reverts [JDK-8358429](https://bugs.openjdk.org/browse/JDK-8358429), which was an attempt to minimize the time the Threads_lock is held during JFR sampling. That change was premised on the, at the time, two known reasons for why we held the Threads_lock during the entire sampling interval. > > After this change, subtle deadlocks happened on macOS, very intermittently, in the pthreads library, in that a suspended thread could be the owner of an internal process lock, a process lock that was then needed when sending pthread_kill signal to resume it. > > By rolling back to holding the Threads_lock for the entire duration of the sampling interval (like we have done for many many years in the era before JFR Cooperative Sampling), we prevent JavaThreads from calling os::create_thread(). > > I have decided to rollback the solution to the version we know work, instead of attempting a more granular solution, perhaps using sigprocmask() to create a critical section around pthread_create in os_bsd.cpp. This is something we might want to do later, but more time is then needed for falsifying / verifying the correct fix. > > Testing: jdk_jfr, stress testing > > Thanks > Markus > > PS Indirect barriers removed are explicitly re-inserted as per [JDK-8373485](https://bugs.openjdk.org/browse/JDK-8373485) Marked as reviewed by egahlin (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29178#pullrequestreview-3655081510 From mgronlun at openjdk.org Tue Jan 13 11:45:28 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 11:45:28 GMT Subject: RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant [v2] In-Reply-To: References: Message-ID: <0QURWowNF1yaO7n1MYUHNS-MRVgvXrqX4GXP64R3vQE=.15dd2eea-5b51-41ef-9b19-5c9fd2da670b@github.com> On Mon, 12 Jan 2026 14:16:51 GMT, Aleksey Shipilev wrote: >> Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: >> >> explicit fences > > All right, that reads better, thanks. Thanks for your reviews @shipilev and @egahlin! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29155#issuecomment-3743867509 From mgronlun at openjdk.org Tue Jan 13 11:48:07 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 11:48:07 GMT Subject: Integrated: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: References: Message-ID: On Sun, 11 Jan 2026 12:39:06 GMT, Markus Gr?nlund wrote: > Greetings, > > When sampling threads in state _thread_in_native, there is a missing memory barrier when UseSystemMemoryBarrier is used, because it must be emitted manually. > > Testing: jdk_jfr > > Thanks > Markus > > PS "threads_lock" local variable was renamed to "lock" not to confuse with the global Threads_lock. This pull request has now been integrated. Changeset: 543a9722 Author: Markus Gr?nlund URL: https://git.openjdk.org/jdk/commit/543a972222118155e4c72c6f2d32d154c5dfd442 Stats: 11 lines in 1 file changed: 7 ins; 0 del; 4 mod 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant Reviewed-by: shade, egahlin ------------- PR: https://git.openjdk.org/jdk/pull/29155 From mgronlun at openjdk.org Tue Jan 13 12:05:41 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 12:05:41 GMT Subject: RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library [v2] In-Reply-To: References: Message-ID: > Greetings, > > this change effectively reverts [JDK-8358429](https://bugs.openjdk.org/browse/JDK-8358429), which was an attempt to minimize the time the Threads_lock is held during JFR sampling. That change was premised on the, at the time, two known reasons for why we held the Threads_lock during the entire sampling interval. > > After this change, subtle deadlocks happened on macOS, very intermittently, in the pthreads library, in that a suspended thread could be the owner of an internal process lock, a process lock that was then needed when sending pthread_kill signal to resume it. > > By rolling back to holding the Threads_lock for the entire duration of the sampling interval (like we have done for many many years in the era before JFR Cooperative Sampling), we prevent JavaThreads from calling os::create_thread(). > > I have decided to rollback the solution to the version we know work, instead of attempting a more granular solution, perhaps using sigprocmask() to create a critical section around pthread_create in os_bsd.cpp. This is something we might want to do later, but more time is then needed for falsifying / verifying the correct fix. > > Testing: jdk_jfr, stress testing > > Thanks > Markus > > PS Indirect barriers removed are explicitly re-inserted as per [JDK-8373485](https://bugs.openjdk.org/browse/JDK-8373485) Markus Gr?nlund has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - remove extraneous assertion - Merge branch 'master' into 8373106 - 8373106 ------------- Changes: https://git.openjdk.org/jdk/pull/29178/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29178&range=01 Stats: 62 lines in 1 file changed: 12 ins; 18 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/29178.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29178/head:pull/29178 PR: https://git.openjdk.org/jdk/pull/29178 From mgronlun at openjdk.org Tue Jan 13 12:24:38 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 12:24:38 GMT Subject: [jdk26] RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant Message-ID: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant ------------- Commit messages: - Backport 543a972222118155e4c72c6f2d32d154c5dfd442 Changes: https://git.openjdk.org/jdk/pull/29189/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29189&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373485 Stats: 11 lines in 1 file changed: 7 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/29189.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29189/head:pull/29189 PR: https://git.openjdk.org/jdk/pull/29189 From shade at openjdk.org Tue Jan 13 12:24:39 2026 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 13 Jan 2026 12:24:39 GMT Subject: [jdk26] RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> References: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> Message-ID: On Tue, 13 Jan 2026 12:14:17 GMT, Markus Gr?nlund wrote: > 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant Marked as reviewed by shade (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29189#pullrequestreview-3655493911 From egahlin at openjdk.org Tue Jan 13 13:21:54 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Tue, 13 Jan 2026 13:21:54 GMT Subject: RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library [v2] In-Reply-To: References: Message-ID: On Tue, 13 Jan 2026 12:05:41 GMT, Markus Gr?nlund wrote: >> Greetings, >> >> this change effectively reverts [JDK-8358429](https://bugs.openjdk.org/browse/JDK-8358429), which was an attempt to minimize the time the Threads_lock is held during JFR sampling. That change was premised on the, at the time, two known reasons for why we held the Threads_lock during the entire sampling interval. >> >> After this change, subtle deadlocks happened on macOS, very intermittently, in the pthreads library, in that a suspended thread could be the owner of an internal process lock, a process lock that was then needed when sending pthread_kill signal to resume it. >> >> By rolling back to holding the Threads_lock for the entire duration of the sampling interval (like we have done for many many years in the era before JFR Cooperative Sampling), we prevent JavaThreads from calling os::create_thread(). >> >> I have decided to rollback the solution to the version we know work, instead of attempting a more granular solution, perhaps using sigprocmask() to create a critical section around pthread_create in os_bsd.cpp. This is something we might want to do later, but more time is then needed for falsifying / verifying the correct fix. >> >> Testing: jdk_jfr, stress testing >> >> Thanks >> Markus >> >> PS Indirect barriers removed are explicitly re-inserted as per [JDK-8373485](https://bugs.openjdk.org/browse/JDK-8373485) > > Markus Gr?nlund has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - remove extraneous assertion > - Merge branch 'master' into 8373106 > - 8373106 Marked as reviewed by egahlin (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29178#pullrequestreview-3655758172 From egahlin at openjdk.org Tue Jan 13 13:30:16 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Tue, 13 Jan 2026 13:30:16 GMT Subject: [jdk26] RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> References: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> Message-ID: On Tue, 13 Jan 2026 12:14:17 GMT, Markus Gr?nlund wrote: > 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant Marked as reviewed by egahlin (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29189#pullrequestreview-3655791568 From mgronlun at openjdk.org Tue Jan 13 13:39:45 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 13:39:45 GMT Subject: [jdk26] RFR: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: References: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> Message-ID: On Tue, 13 Jan 2026 12:18:46 GMT, Aleksey Shipilev wrote: >> 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant > > Marked as reviewed by shade (Reviewer). Thanks @shipilev and @egahlin for your reviews! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29189#issuecomment-3744387302 From mgronlun at openjdk.org Tue Jan 13 13:43:07 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 13:43:07 GMT Subject: [jdk26] Integrated: 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant In-Reply-To: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> References: <34IxtvJSmrvH3eMEUTonmlsiz5eFXCmnAXymcvgo4jw=.51316cb5-4b87-49a7-9bf4-145cf984c940@github.com> Message-ID: <1pwZMVoxLuPdhjOc72ousIFh0iH0gJwhzAXWQlDIX-Q=.6ea1592b-ca97-4fbf-a1f3-62f5254e93af@github.com> On Tue, 13 Jan 2026 12:14:17 GMT, Markus Gr?nlund wrote: > 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant This pull request has now been integrated. Changeset: 58be8702 Author: Markus Gr?nlund URL: https://git.openjdk.org/jdk/commit/58be8702d8b2434b810e8f142d631827ddf758a0 Stats: 11 lines in 1 file changed: 7 ins; 0 del; 4 mod 8373485: JFR Crash during sampling: assert(jt->has_last_Java_frame()) failed: invariant Reviewed-by: shade, egahlin Backport-of: 543a972222118155e4c72c6f2d32d154c5dfd442 ------------- PR: https://git.openjdk.org/jdk/pull/29189 From egahlin at openjdk.org Tue Jan 13 14:06:13 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Tue, 13 Jan 2026 14:06:13 GMT Subject: [jdk26] RFR: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 Message-ID: <24Tldr_YSbD5O2w5hXET_zCFQb4G-ZwnVM9C4VdPykA=.3f29601b-14ac-47a0-b62f-744fb1095b69@github.com> 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 ------------- Commit messages: - Backport 556bddfd9439d1bad698ab5134317ce263a36b04 Changes: https://git.openjdk.org/jdk/pull/29191/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29191&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8372321 Stats: 47 lines in 1 file changed: 21 ins; 14 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/29191.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29191/head:pull/29191 PR: https://git.openjdk.org/jdk/pull/29191 From mgronlun at openjdk.org Tue Jan 13 14:30:03 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 14:30:03 GMT Subject: [jdk26] RFR: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 In-Reply-To: <24Tldr_YSbD5O2w5hXET_zCFQb4G-ZwnVM9C4VdPykA=.3f29601b-14ac-47a0-b62f-744fb1095b69@github.com> References: <24Tldr_YSbD5O2w5hXET_zCFQb4G-ZwnVM9C4VdPykA=.3f29601b-14ac-47a0-b62f-744fb1095b69@github.com> Message-ID: On Tue, 13 Jan 2026 13:51:44 GMT, Erik Gahlin wrote: > 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 Marked as reviewed by mgronlun (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29191#pullrequestreview-3656094767 From mgronlun at openjdk.org Tue Jan 13 18:06:39 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 18:06:39 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: <8LD4JmIZnVSwmhLeVZROok-0h-nCD1TxlaSRHe586-E=.99a49bfa-444f-4ddf-b206-0a75fe1dad23@github.com> Message-ID: On Thu, 8 Jan 2026 01:32:44 GMT, Yasumasa Suenaga wrote: >>> TestEmergencyDumpAtOOM.java has passed on both, AIX and linux on PPC64. Thanks! >> >> Thanks Martin. > > Thanks a lot @mgronlun ! Looks good in general. > > Can we wait to finish `service.emit_leakprofiler_events()` in JFR recorder thread before the crash at `report_java_out_of_memory()` in debug.cpp? whether `abort()` is called before finishing to dump events by recorder thread. Thanks for your review @YaSuenag. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29094#issuecomment-3745680784 From mgronlun at openjdk.org Tue Jan 13 18:08:33 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 18:08:33 GMT Subject: Integrated: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 14:14:19 GMT, Markus Gr?nlund wrote: > Alternative for solving [JDK-8371014](https://bugs.openjdk.org/browse/JDK-8371014) > > Also includes a fix for [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) > > Testing: jdk_jfr, stress testing, manual testing with CrashOnOutOfMemoryError, tier1-6 This pull request has now been integrated. Changeset: f23752a7 Author: Markus Gr?nlund URL: https://git.openjdk.org/jdk/commit/f23752a75ee3d3af0853eff9c678d2496bb1cf58 Stats: 292 lines in 15 files changed: 219 ins; 18 del; 55 mod 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented Reviewed-by: ysuenaga ------------- PR: https://git.openjdk.org/jdk/pull/29094 From mgronlun at openjdk.org Tue Jan 13 19:40:00 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 19:40:00 GMT Subject: RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library [v2] In-Reply-To: References: Message-ID: On Tue, 13 Jan 2026 13:18:28 GMT, Erik Gahlin wrote: >> Markus Gr?nlund has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: >> >> - remove extraneous assertion >> - Merge branch 'master' into 8373106 >> - 8373106 > > Marked as reviewed by egahlin (Reviewer). Thanks @egahlin for the review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29178#issuecomment-3746145756 From mgronlun at openjdk.org Tue Jan 13 19:43:47 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 19:43:47 GMT Subject: Integrated: 8373106: JFR suspend/resume deadlock on macOS in pthreads library In-Reply-To: References: Message-ID: On Mon, 12 Jan 2026 21:29:26 GMT, Markus Gr?nlund wrote: > Greetings, > > this change effectively reverts [JDK-8358429](https://bugs.openjdk.org/browse/JDK-8358429), which was an attempt to minimize the time the Threads_lock is held during JFR sampling. That change was premised on the, at the time, two known reasons for why we held the Threads_lock during the entire sampling interval. > > After this change, subtle deadlocks happened on macOS, very intermittently, in the pthreads library, in that a suspended thread could be the owner of an internal process lock, a process lock that was then needed when sending pthread_kill signal to resume it. > > By rolling back to holding the Threads_lock for the entire duration of the sampling interval (like we have done for many many years in the era before JFR Cooperative Sampling), we prevent JavaThreads from calling os::create_thread(). > > I have decided to rollback the solution to the version we know work, instead of attempting a more granular solution, perhaps using sigprocmask() to create a critical section around pthread_create in os_bsd.cpp. This is something we might want to do later, but more time is then needed for falsifying / verifying the correct fix. > > Testing: jdk_jfr, stress testing > > Thanks > Markus > > PS Indirect barriers removed are explicitly re-inserted as per [JDK-8373485](https://bugs.openjdk.org/browse/JDK-8373485) This pull request has now been integrated. Changeset: b070367b Author: Markus Gr?nlund URL: https://git.openjdk.org/jdk/commit/b070367bdf980ef1c257cab485927db39b544241 Stats: 62 lines in 1 file changed: 12 ins; 18 del; 32 mod 8373106: JFR suspend/resume deadlock on macOS in pthreads library Reviewed-by: egahlin ------------- PR: https://git.openjdk.org/jdk/pull/29178 From mgronlun at openjdk.org Tue Jan 13 19:58:48 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 19:58:48 GMT Subject: [jdk26] RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library Message-ID: 8373106: JFR suspend/resume deadlock on macOS in pthreads library ------------- Commit messages: - Backport b070367bdf980ef1c257cab485927db39b544241 Changes: https://git.openjdk.org/jdk/pull/29207/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29207&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373106 Stats: 62 lines in 1 file changed: 12 ins; 18 del; 32 mod Patch: https://git.openjdk.org/jdk/pull/29207.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29207/head:pull/29207 PR: https://git.openjdk.org/jdk/pull/29207 From egahlin at openjdk.org Tue Jan 13 21:25:00 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Tue, 13 Jan 2026 21:25:00 GMT Subject: [jdk26] RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library In-Reply-To: References: Message-ID: On Tue, 13 Jan 2026 19:50:57 GMT, Markus Gr?nlund wrote: > 8373106: JFR suspend/resume deadlock on macOS in pthreads library Marked as reviewed by egahlin (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29207#pullrequestreview-3657990642 From mgronlun at openjdk.org Tue Jan 13 21:29:38 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 21:29:38 GMT Subject: [jdk26] RFR: 8373106: JFR suspend/resume deadlock on macOS in pthreads library In-Reply-To: References: Message-ID: On Tue, 13 Jan 2026 21:21:40 GMT, Erik Gahlin wrote: >> 8373106: JFR suspend/resume deadlock on macOS in pthreads library > > Marked as reviewed by egahlin (Reviewer). Thanks @egahlin for your review! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29207#issuecomment-3746647223 From mgronlun at openjdk.org Tue Jan 13 21:31:29 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Tue, 13 Jan 2026 21:31:29 GMT Subject: [jdk26] Integrated: 8373106: JFR suspend/resume deadlock on macOS in pthreads library In-Reply-To: References: Message-ID: On Tue, 13 Jan 2026 19:50:57 GMT, Markus Gr?nlund wrote: > 8373106: JFR suspend/resume deadlock on macOS in pthreads library This pull request has now been integrated. Changeset: a45364a2 Author: Markus Gr?nlund URL: https://git.openjdk.org/jdk/commit/a45364a28b058739eb58bea24a219d7816d042e6 Stats: 62 lines in 1 file changed: 12 ins; 18 del; 32 mod 8373106: JFR suspend/resume deadlock on macOS in pthreads library Reviewed-by: egahlin Backport-of: b070367bdf980ef1c257cab485927db39b544241 ------------- PR: https://git.openjdk.org/jdk/pull/29207 From egahlin at openjdk.org Wed Jan 14 00:16:11 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Wed, 14 Jan 2026 00:16:11 GMT Subject: [jdk26] Integrated: 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 In-Reply-To: <24Tldr_YSbD5O2w5hXET_zCFQb4G-ZwnVM9C4VdPykA=.3f29601b-14ac-47a0-b62f-744fb1095b69@github.com> References: <24Tldr_YSbD5O2w5hXET_zCFQb4G-ZwnVM9C4VdPykA=.3f29601b-14ac-47a0-b62f-744fb1095b69@github.com> Message-ID: On Tue, 13 Jan 2026 13:51:44 GMT, Erik Gahlin wrote: > 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 This pull request has now been integrated. Changeset: 1bf35d7b Author: Erik Gahlin URL: https://git.openjdk.org/jdk/commit/1bf35d7bd0a8771e8656800a613de6a01057fc38 Stats: 47 lines in 1 file changed: 21 ins; 14 del; 12 mod 8372321: TestBackToBackSensitive fails intermittently after JDK-8365972 Reviewed-by: mgronlun Backport-of: 556bddfd9439d1bad698ab5134317ce263a36b04 ------------- PR: https://git.openjdk.org/jdk/pull/29191 From mgronlun at openjdk.org Wed Jan 14 09:04:48 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 14 Jan 2026 09:04:48 GMT Subject: [jdk26] RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Tue, 13 Jan 2026 18:14:09 GMT, Markus Gr?nlund wrote: > 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented Can you please also review the backport to 26 @YaSuenag ? ------------- PR Comment: https://git.openjdk.org/jdk/pull/29203#issuecomment-3748504204 From ysuenaga at openjdk.org Wed Jan 14 09:53:31 2026 From: ysuenaga at openjdk.org (Yasumasa Suenaga) Date: Wed, 14 Jan 2026 09:53:31 GMT Subject: [jdk26] RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: <4cVV_QKVhvDsK5TIJ6xmux4INP9msOEl0W4MyQU8SV4=.68ccb9b0-5968-467b-a5f4-9c902cdd3fb0@github.com> On Tue, 13 Jan 2026 18:14:09 GMT, Markus Gr?nlund wrote: > 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented Marked as reviewed by ysuenaga (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29203#pullrequestreview-3659821661 From mgronlun at openjdk.org Wed Jan 14 11:02:56 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 14 Jan 2026 11:02:56 GMT Subject: [jdk26] RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: <4cVV_QKVhvDsK5TIJ6xmux4INP9msOEl0W4MyQU8SV4=.68ccb9b0-5968-467b-a5f4-9c902cdd3fb0@github.com> References: <4cVV_QKVhvDsK5TIJ6xmux4INP9msOEl0W4MyQU8SV4=.68ccb9b0-5968-467b-a5f4-9c902cdd3fb0@github.com> Message-ID: <5Fl8Ynrvx4YIKLdJs85xf-6qk42Fkwt5XWsZXhF4caU=.e17f5327-6df9-4072-9d70-325aa5ef2249@github.com> On Wed, 14 Jan 2026 09:51:04 GMT, Yasumasa Suenaga wrote: >> 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented > > Marked as reviewed by ysuenaga (Reviewer). Thanks for your review @YaSuenag! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29203#issuecomment-3748984614 From mgronlun at openjdk.org Wed Jan 14 11:07:20 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 14 Jan 2026 11:07:20 GMT Subject: [jdk26] Integrated: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Tue, 13 Jan 2026 18:14:09 GMT, Markus Gr?nlund wrote: > 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented This pull request has now been integrated. Changeset: f3bdee89 Author: Markus Gr?nlund URL: https://git.openjdk.org/jdk/commit/f3bdee89ed1acd8a61989dd580f11ff184166520 Stats: 292 lines in 15 files changed: 219 ins; 18 del; 55 mod 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented Reviewed-by: ysuenaga Backport-of: f23752a75ee3d3af0853eff9c678d2496bb1cf58 ------------- PR: https://git.openjdk.org/jdk/pull/29203 From mdoerr at openjdk.org Wed Jan 14 12:07:27 2026 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 14 Jan 2026 12:07:27 GMT Subject: RFR: 8371014: Dump JFR recording on CrashOnOutOfMemoryError is incorrectly implemented In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 14:14:19 GMT, Markus Gr?nlund wrote: > Alternative for solving [JDK-8371014](https://bugs.openjdk.org/browse/JDK-8371014) > > Also includes a fix for [JDK-8373257](https://bugs.openjdk.org/browse/JDK-8373257) > > Testing: jdk_jfr, stress testing, manual testing with CrashOnOutOfMemoryError, tier1-6 Thanks for fixing and backporting it! I had taken a quick look and think it is good, but not a full review. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29094#issuecomment-3749247796 From mgronlun at openjdk.org Wed Jan 14 15:19:27 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 14 Jan 2026 15:19:27 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet In-Reply-To: References: Message-ID: On Sat, 3 Jan 2026 08:21:15 GMT, Kim Barrett wrote: > Please review this change to fix JfrSet to avoid triggering > -Wzero-as-null-pointer-constant warnings when that warning is enabled. > > The old code uses an entry value with representation 0 to indicate the entry > doesn't have a value. It compares an entry value against literal 0 to check > for that. If the key type is a pointer type, this involves an implicit 0 => > null pointer constant conversion, so we get a warning when that warning is > enabled. > > Instead we initialize entry values to a value-initialized key, and compare > against a value-initialized key. This changes the (currently undocumented) > requirements on the key type. The key type is no longer required to be > trivially constructible (to permit memset-based initialization), but is now > required to be value-initializable. That's currently a wash, since all of the > in-use key types are fundamental types (traceid (u8) and Klass*). > > Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) src/hotspot/share/jfr/utilities/jfrSet.hpp line 72: > 70: } > 71: for (unsigned i = 0; i < table_size; ++i) { > 72: ::new (&table[i]) K{}; Is this the new (placement pun intended) way to do a placement new, using the outer scope operator ::? Or is it because we don't know what Hotspot type this is? src/hotspot/share/jfr/utilities/jfrSet.hpp line 142: > 140: for (unsigned i = 0; i < old_table_size; ++i) { > 141: const K k = old_table[i]; > 142: if (k != K{}) { Are these K{}'s compile constant expressions? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29022#discussion_r2690861905 PR Review Comment: https://git.openjdk.org/jdk/pull/29022#discussion_r2690859072 From markus.gronlund at oracle.com Wed Jan 14 18:15:21 2026 From: markus.gronlund at oracle.com (Markus Gronlund) Date: Wed, 14 Jan 2026 18:15:21 +0000 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: <80f97dba0b628057de3b7cd2ef4c3bea@courriel.eco> References: <80f97dba0b628057de3b7cd2ef4c3bea@courriel.eco> Message-ID: Hi Fabrice, Thank you very much for reporting this and also for providing a great reproducer. We have made some progress towards understanding the problem space, at least. To help you continue with your demonstrations, explanations, and comparisons, I only need you to do the following: In the jdk/lib/jfr directory, there are two files that control the default and profile sets of JFR events: default.jfc and profile.jfc, respectively. false false 0 ns Turn off the jdk.OldObjectSample event by setting enabled to false. This effectively turns off JFRs capability to monitor memory leaks in the background. With this small change, you should be back on track for proper comparisons, also when using JFR. Let me know if you have any questions. We will be thinking about how to solve this properly. Cheers for now Regards Markus Confidential- Oracle Internal From: hotspot-jfr-dev On Behalf Of Fabrice Bibonne Sent: Monday, 12 January 2026 16:59 To: hotspot-jfr-dev at openjdk.org Subject: Re: Using JFR both with ZGC degrades application throughput Here is a unique source code file for the reproducer (the big String is generated when starting as you suggested). It changes a little the results but the run with zgc + jfr is still taking lot of time. Thanks you for having a look. Fabrice Le 2026-01-12 10:56, Erik Gahlin a ?crit : Hi Fabrice, Thanks for reporting! Could you post the source code for the reproducer here? The 36 MB file could probably be replaced with a String::repeat expression. JFR does use some memory, which could impact available heap and performance, although the degradation you?re seeing seems awfully high. Thanks Erik ________________________________________ From: hotspot-jfr-dev > on behalf of Fabrice Bibonne > Sent: Sunday, January 11, 2026 7:23 PM To: hotspot-jfr-dev at openjdk.org Subject: Using JFR both with ZGC degrades application throughput Hi all, I would like to report a case where starting jfr for an application running with zgc causes a significant throughput degradation (compared to when JFR is not started). My context : I was writing a little web app to illustrate a case where the use of ZGC gives a better throughput than with G1. I benchmarked with grafana k6 my application running with G1 and my application running with ZGC : the runs with ZGC gave better throughputs. I wanted to go a bit further in explanation so I began again my benchmarks with JFR to be able to illustrate GC gains in JMC. When I ran my web app with ZGC+JFR, I noticed a significant throughput degradation in my benchmark (which was not the case with G1+JFR). Although I did not measure an increase in overhead as such, I still wanted to report this issue because the degradation in throughput with JFR is such that it would not be usable as is on a production service. I wrote a little application (not a web one) to reproduce the problem : the application calls a little conversion service 200 times with random numbers in parallel (to be like a web app in charge and to pressure GC). The conversion service (a method named `convertNumberToWords`) convert the number in a String looking for the String in a Map with the number as th key. In order to instantiate and destroy many objects at each call, the map is built parsing a huge String at each call. Application ends after 200 calls. Here are the step to reproduce : 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact (be aware to be on branch jfr+zgc_impact) 2. Compile it (you must include numbers200k.zip in resources : it contains a 36 Mo text files whose contents are used to create the huge String variable) 3. in the root of repository : 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -classpath target/classes poc.java.perf.write.TestPerf #ZGC without JFR` 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -XX:StartFlightRecording -classpath target/classes poc.java.perf.write.TestPerf #ZGC with JFR` 4. The real time of the second run (with JFR) will be considerably higher than that of the first I ran these tests on my laptop : - Dell Inc. Latitude 5591 - openSUSE Tumbleweed 20260108 - Kernel : 6.18.3-1-default (64-bit) - 12 ? Intel? Core? i7-8850H CPU @ 2.60GHz - RAM 16 Gio - openjdk version "25.0.1" 2025-10-21 - OpenJDK Runtime Environment (build 25.0.1+8-27) - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) - many tabs opened in firefox ! I also ran it in a container (eclipse-temurin:25) on my laptop and with a windows laptop and came to the same conclusions : here are the measurements from the container : | Run with | Real time (s) | |-----------|---------------| | ZGC alone | 7.473 | | ZGC + jfr | 25.075 | | G1 alone | 10.195 | | G1 + jfr | 10.450 | After all these tests I tried to run the app with an other profiler tool in order to understand where is the issue. I join the flamegraph when running jfr+zgc : for the worker threads of the ForkJoinPool of Stream, stack traces of a majority of samples have the same top lines : - PosixSemaphore::wait - ZPageAllocator::alloc_page_stall - ZPageAllocator::alloc_page_inner - ZPageAllocator::alloc_page So many thread seem to spent their time waiting in the method ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic tasks threads has also a few samples where it waits at ZPageAllocator::alloc_page_stall. I hope this will help you to find the issue. Thank you very much for reading this email until the end. I hope this is the good place for such a feedback. Let me know if I must report my problem elsewhere. Be free to ask me more questions if you need. Thank you all for this amazing tool ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.gronlund at oracle.com Wed Jan 14 18:30:15 2026 From: markus.gronlund at oracle.com (Markus Gronlund) Date: Wed, 14 Jan 2026 18:30:15 +0000 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: References: <80f97dba0b628057de3b7cd2ef4c3bea@courriel.eco> Message-ID: Hi again, I just remembered we have improved our ergonomics over the years. Therefore, there is a much easier way for you to do this without configuring anything in the .jfc files: you can simply override event settings on the command line. [1] -XX:StartFlightRecording:jdk.OldObjectSample#enabled=false Way easier! Cheers Markus [1] https://egahlin.github.io/2022/05/31/improved-ergonomics.html Confidential- Oracle Internal From: hotspot-jfr-dev On Behalf Of Markus Gronlund Sent: Wednesday, 14 January 2026 19:15 To: Fabrice Bibonne Cc: hotspot-jfr-dev at openjdk.org Subject: RE: Using JFR both with ZGC degrades application throughput Hi Fabrice, Thank you very much for reporting this and also for providing a great reproducer. We have made some progress towards understanding the problem space, at least. To help you continue with your demonstrations, explanations, and comparisons, I only need you to do the following: In the jdk/lib/jfr directory, there are two files that control the default and profile sets of JFR events: default.jfc and profile.jfc, respectively. false false 0 ns Turn off the jdk.OldObjectSample event by setting enabled to false. This effectively turns off JFRs capability to monitor memory leaks in the background. With this small change, you should be back on track for proper comparisons, also when using JFR. Let me know if you have any questions. We will be thinking about how to solve this properly. Cheers for now Regards Markus Confidential- Oracle Internal From: hotspot-jfr-dev > On Behalf Of Fabrice Bibonne Sent: Monday, 12 January 2026 16:59 To: hotspot-jfr-dev at openjdk.org Subject: Re: Using JFR both with ZGC degrades application throughput Here is a unique source code file for the reproducer (the big String is generated when starting as you suggested). It changes a little the results but the run with zgc + jfr is still taking lot of time. Thanks you for having a look. Fabrice Le 2026-01-12 10:56, Erik Gahlin a ?crit : Hi Fabrice, Thanks for reporting! Could you post the source code for the reproducer here? The 36 MB file could probably be replaced with a String::repeat expression. JFR does use some memory, which could impact available heap and performance, although the degradation you?re seeing seems awfully high. Thanks Erik ________________________________________ From: hotspot-jfr-dev > on behalf of Fabrice Bibonne > Sent: Sunday, January 11, 2026 7:23 PM To: hotspot-jfr-dev at openjdk.org Subject: Using JFR both with ZGC degrades application throughput Hi all, I would like to report a case where starting jfr for an application running with zgc causes a significant throughput degradation (compared to when JFR is not started). My context : I was writing a little web app to illustrate a case where the use of ZGC gives a better throughput than with G1. I benchmarked with grafana k6 my application running with G1 and my application running with ZGC : the runs with ZGC gave better throughputs. I wanted to go a bit further in explanation so I began again my benchmarks with JFR to be able to illustrate GC gains in JMC. When I ran my web app with ZGC+JFR, I noticed a significant throughput degradation in my benchmark (which was not the case with G1+JFR). Although I did not measure an increase in overhead as such, I still wanted to report this issue because the degradation in throughput with JFR is such that it would not be usable as is on a production service. I wrote a little application (not a web one) to reproduce the problem : the application calls a little conversion service 200 times with random numbers in parallel (to be like a web app in charge and to pressure GC). The conversion service (a method named `convertNumberToWords`) convert the number in a String looking for the String in a Map with the number as th key. In order to instantiate and destroy many objects at each call, the map is built parsing a huge String at each call. Application ends after 200 calls. Here are the step to reproduce : 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact (be aware to be on branch jfr+zgc_impact) 2. Compile it (you must include numbers200k.zip in resources : it contains a 36 Mo text files whose contents are used to create the huge String variable) 3. in the root of repository : 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -classpath target/classes poc.java.perf.write.TestPerf #ZGC without JFR` 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops -XX:StartFlightRecording -classpath target/classes poc.java.perf.write.TestPerf #ZGC with JFR` 4. The real time of the second run (with JFR) will be considerably higher than that of the first I ran these tests on my laptop : - Dell Inc. Latitude 5591 - openSUSE Tumbleweed 20260108 - Kernel : 6.18.3-1-default (64-bit) - 12 ? Intel? Core? i7-8850H CPU @ 2.60GHz - RAM 16 Gio - openjdk version "25.0.1" 2025-10-21 - OpenJDK Runtime Environment (build 25.0.1+8-27) - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) - many tabs opened in firefox ! I also ran it in a container (eclipse-temurin:25) on my laptop and with a windows laptop and came to the same conclusions : here are the measurements from the container : | Run with | Real time (s) | |-----------|---------------| | ZGC alone | 7.473 | | ZGC + jfr | 25.075 | | G1 alone | 10.195 | | G1 + jfr | 10.450 | After all these tests I tried to run the app with an other profiler tool in order to understand where is the issue. I join the flamegraph when running jfr+zgc : for the worker threads of the ForkJoinPool of Stream, stack traces of a majority of samples have the same top lines : - PosixSemaphore::wait - ZPageAllocator::alloc_page_stall - ZPageAllocator::alloc_page_inner - ZPageAllocator::alloc_page So many thread seem to spent their time waiting in the method ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic tasks threads has also a few samples where it waits at ZPageAllocator::alloc_page_stall. I hope this will help you to find the issue. Thank you very much for reading this email until the end. I hope this is the good place for such a feedback. Let me know if I must report my problem elsewhere. Be free to ask me more questions if you need. Thank you all for this amazing tool ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From fabrice.bibonne at courriel.eco Thu Jan 15 05:44:28 2026 From: fabrice.bibonne at courriel.eco (Fabrice Bibonne) Date: Thu, 15 Jan 2026 06:44:28 +0100 Subject: Using JFR both with ZGC degrades application throughput In-Reply-To: References: <80f97dba0b628057de3b7cd2ef4c3bea@courriel.eco> Message-ID: Hi, Yes turning off jdk.OldObjectSample event solved the issue : the real time execution of my sample with zgc and JFR recording with jdk.OldObjectSample turned off is now very close to that without JFR recording. Thank you very much. Best regards. Fabrice Le 2026-01-14 19:30, Markus Gronlund a ?crit : > Hi again, > > I just remembered we have improved our ergonomics over the years. > > Therefore, there is a much easier way for you to do this without > configuring anything in the .jfc files: you can simply override event > settings on the command line. [1] > > -XX:StartFlightRecording:jdk.OldObjectSample#enabled=false > > Way easier! > > Cheers > > Markus > > [1] https://egahlin.github.io/2022/05/31/improved-ergonomics.html [2] > > Confidential- Oracle Internal > > From: hotspot-jfr-dev On Behalf Of > Markus Gronlund > Sent: Wednesday, 14 January 2026 19:15 > To: Fabrice Bibonne > Cc: hotspot-jfr-dev at openjdk.org > Subject: RE: Using JFR both with ZGC degrades application throughput > > Hi Fabrice, > > Thank you very much for reporting this and also for providing a great > reproducer. > > We have made some progress towards understanding the problem space, at > least. > > To help you continue with your demonstrations, explanations, and > comparisons, I only need you to do the following: > > In the jdk/lib/jfr directory, there are two files that control the > default and profile sets of JFR events: default.jfc and profile.jfc, > respectively. > > > > false > > control="old-objects-stack-trace">false > > 0 ns > > > > Turn off the jdk.OldObjectSample event by setting enabled to false. > > This effectively turns off JFRs capability to monitor memory leaks in > the background. > > With this small change, you should be back on track for proper > comparisons, also when using JFR. > > Let me know if you have any questions. We will be thinking about how to > solve this properly. > > Cheers for now > > Regards > > Markus > > Confidential- Oracle Internal > > From: hotspot-jfr-dev On Behalf Of > Fabrice Bibonne > Sent: Monday, 12 January 2026 16:59 > To: hotspot-jfr-dev at openjdk.org > Subject: Re: Using JFR both with ZGC degrades application throughput > > Here is a unique source code file for the reproducer (the big String is > generated when starting as you suggested). It changes a little the > results but the run with zgc + jfr is still taking lot of time. > > Thanks you for having a look. > > Fabrice > > Le 2026-01-12 10:56, Erik Gahlin a ?crit : > >> Hi Fabrice, >> >> Thanks for reporting! >> >> Could you post the source code for the reproducer here? The 36 MB file >> could probably be replaced with a String::repeat expression. >> >> JFR does use some memory, which could impact available heap and >> performance, although the degradation you're seeing seems awfully >> high. >> >> Thanks >> Erik >> >> ________________________________________ >> From: hotspot-jfr-dev on behalf of >> Fabrice Bibonne >> Sent: Sunday, January 11, 2026 7:23 PM >> To: hotspot-jfr-dev at openjdk.org >> Subject: Using JFR both with ZGC degrades application throughput >> >> Hi all, >> >> I would like to report a case where starting jfr for an application >> running with zgc causes a significant throughput degradation (compared >> to when JFR is not started). >> >> My context : I was writing a little web app to illustrate a case where >> the use of ZGC gives a better throughput than with G1. I benchmarked >> with grafana k6 my application running with G1 and my application >> running with ZGC : the runs with ZGC gave better throughputs. I >> wanted to go a bit further in explanation so I began again my >> benchmarks with JFR to be able to illustrate GC gains in JMC. When I >> ran my web app with ZGC+JFR, I noticed a significant throughput >> degradation in my benchmark (which was not the case with G1+JFR). >> >> Although I did not measure an increase in overhead as such, I still >> wanted to report this issue because the degradation in throughput with >> JFR is such that it would not be usable as is on a production service. >> >> I wrote a little application (not a web one) to reproduce the problem >> : the application calls a little conversion service 200 times with >> random numbers in parallel (to be like a web app in charge and to >> pressure GC). The conversion service (a method named >> `convertNumberToWords`) convert the number in a String looking for the >> String in a Map with the number as th key. In order to instantiate and >> destroy many objects at each call, the map is built parsing a huge >> String at each call. Application ends after 200 calls. >> >> Here are the step to reproduce : >> 1. Clone https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact >> [1] (be aware to be on branch jfr+zgc_impact) >> 2. Compile it (you must include numbers200k.zip in resources : it >> contains a 36 Mo text files whose contents are used to create the huge >> String variable) >> 3. in the root of repository : >> 3a. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops >> -classpath target/classes poc.java.perf.write.TestPerf #ZGC without >> JFR` >> 3b. Run `time java -Xmx4g -XX:+UseZGC -XX:+UseCompressedOops >> -XX:StartFlightRecording -classpath target/classes >> poc.java.perf.write.TestPerf #ZGC with JFR` >> 4. The real time of the second run (with JFR) will be considerably >> higher than that of the first >> >> I ran these tests on my laptop : >> - Dell Inc. Latitude 5591 >> - openSUSE Tumbleweed 20260108 >> - Kernel : 6.18.3-1-default (64-bit) >> - 12 ? Intel(R) Core(tm) i7-8850H CPU @ 2.60GHz >> - RAM 16 Gio >> - openjdk version "25.0.1" 2025-10-21 >> - OpenJDK Runtime Environment (build 25.0.1+8-27) >> - OpenJDK 64-Bit Server VM (build 25.0.1+8-27, mixed mode, sharing) >> - many tabs opened in firefox ! >> >> I also ran it in a container (eclipse-temurin:25) on my laptop and >> with a windows laptop and came to the same conclusions : here are the >> measurements from the container : >> >> | Run with | Real time (s) | >> |-----------|---------------| >> | ZGC alone | 7.473 | >> | ZGC + jfr | 25.075 | >> | G1 alone | 10.195 | >> | G1 + jfr | 10.450 | >> >> After all these tests I tried to run the app with an other profiler >> tool in order to understand where is the issue. I join the flamegraph >> when running jfr+zgc : for the worker threads of the ForkJoinPool of >> Stream, stack traces of a majority of samples have the same top lines >> : >> - PosixSemaphore::wait >> - ZPageAllocator::alloc_page_stall >> - ZPageAllocator::alloc_page_inner >> - ZPageAllocator::alloc_page >> >> So many thread seem to spent their time waiting in the method >> ZPageAllocator::alloc_page_stall when the JFR is on. The JFR periodic >> tasks threads has also a few samples where it waits at >> ZPageAllocator::alloc_page_stall. I hope this will help you to find >> the issue. >> >> Thank you very much for reading this email until the end. I hope this >> is the good place for such a feedback. Let me know if I must report my >> problem elsewhere. Be free to ask me more questions if you need. >> >> Thank you all for this amazing tool ! Links: ------ [1] https://framagit.org/FBibonne/poc-java/-/tree/jfr+zgc_impact [2] https://egahlin.github.io/2022/05/31/improved-ergonomics.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From stuefe at openjdk.org Thu Jan 15 07:51:36 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 15 Jan 2026 07:51:36 GMT Subject: RFR: 8373096: JFR leak profiler: path-to-gc-roots search should be non-recursive [v7] In-Reply-To: References: Message-ID: On Thu, 18 Dec 2025 10:11:20 GMT, Thomas Stuefe wrote: >> A customer reported a crash when producing a JFR recording with `path-to-gc-roots=true`. It was a native stack overflow that occurred during the recursive path-to-gc-root search performed in the context of PathToGcRootsOperation. >> >> We try to avoid this by limiting the maximum search depth (DFSClosure::max_dfs_depth). That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. >> >> This RFE changes the algorithm to be non-recursive. >> >> Note that as a result of this change, the order in which oop maps are walked per oop is reversed : last oops are processed first. That should not matter for the end result, however. The search is still depth-first. >> >> Note that after this patch, we could easily remove the max_depth limitation altogether. I left it in however since this was not the scope of this RFE. >> >> Testing: >> >> - Tested manually with very small (256K) thread stack size for the VMThread - the patched version works where the old version crashes out >> - Compared JFR recordings from both an unpatched version (with a large enough VMThread stack size) and a patched version; made sure that the content of "Old Object Sample" was identical >> - Ran locally all jtreg tests in jdk/jfr >> - GHAs > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > do strides for arrays I close this PR in favor of opening a fresh one later using a different approach ------------- PR Comment: https://git.openjdk.org/jdk/pull/28659#issuecomment-3753292176 From stuefe at openjdk.org Thu Jan 15 07:51:38 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 15 Jan 2026 07:51:38 GMT Subject: Withdrawn: 8373096: JFR leak profiler: path-to-gc-roots search should be non-recursive In-Reply-To: References: Message-ID: On Thu, 4 Dec 2025 15:54:04 GMT, Thomas Stuefe wrote: > A customer reported a crash when producing a JFR recording with `path-to-gc-roots=true`. It was a native stack overflow that occurred during the recursive path-to-gc-root search performed in the context of PathToGcRootsOperation. > > We try to avoid this by limiting the maximum search depth (DFSClosure::max_dfs_depth). That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This RFE changes the algorithm to be non-recursive. > > Note that as a result of this change, the order in which oop maps are walked per oop is reversed : last oops are processed first. That should not matter for the end result, however. The search is still depth-first. > > Note that after this patch, we could easily remove the max_depth limitation altogether. I left it in however since this was not the scope of this RFE. > > Testing: > > - Tested manually with very small (256K) thread stack size for the VMThread - the patched version works where the old version crashes out > - Compared JFR recordings from both an unpatched version (with a large enough VMThread stack size) and a patched version; made sure that the content of "Old Object Sample" was identical > - Ran locally all jtreg tests in jdk/jfr > - GHAs This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/28659 From kbarrett at openjdk.org Thu Jan 15 16:24:45 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 15 Jan 2026 16:24:45 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet In-Reply-To: References: Message-ID: On Wed, 14 Jan 2026 15:11:59 GMT, Markus Gr?nlund wrote: >> Please review this change to fix JfrSet to avoid triggering >> -Wzero-as-null-pointer-constant warnings when that warning is enabled. >> >> The old code uses an entry value with representation 0 to indicate the entry >> doesn't have a value. It compares an entry value against literal 0 to check >> for that. If the key type is a pointer type, this involves an implicit 0 => >> null pointer constant conversion, so we get a warning when that warning is >> enabled. >> >> Instead we initialize entry values to a value-initialized key, and compare >> against a value-initialized key. This changes the (currently undocumented) >> requirements on the key type. The key type is no longer required to be >> trivially constructible (to permit memset-based initialization), but is now >> required to be value-initializable. That's currently a wash, since all of the >> in-use key types are fundamental types (traceid (u8) and Klass*). >> >> Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) > > src/hotspot/share/jfr/utilities/jfrSet.hpp line 72: > >> 70: } >> 71: for (unsigned i = 0; i < table_size; ++i) { >> 72: ::new (&table[i]) K{}; > > Is this the new (placement pun intended) way to do a placement new, using the outer scope operator ::? Or is it because we don't know what Hotspot type this is? It's the same old way one should always do it. If one wants global placement new, one should say so. An unqualified `new` expression does a class-based lookup of `operator new`, so if the class has one (and lots of ours do), that will be used. We don't want that here, regardless of the type of `K`. As it happens, for all current uses `K` is a fundamental type, so it doesn't matter. But it's clearer and future proof to be explicit. > src/hotspot/share/jfr/utilities/jfrSet.hpp line 142: > >> 140: for (unsigned i = 0; i < old_table_size; ++i) { >> 141: const K k = old_table[i]; >> 142: if (k != K{}) { > > Are these K{}'s compile constant expressions? For the types currently used, yes. This is a "value-initialized" (C++17 11.6/8) temporary. For fundamental types, that's a zero-initialized temporary. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29022#discussion_r2695057339 PR Review Comment: https://git.openjdk.org/jdk/pull/29022#discussion_r2695088728 From kbarrett at openjdk.org Thu Jan 15 16:50:14 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 15 Jan 2026 16:50:14 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet [v2] In-Reply-To: References: Message-ID: <90e6X30EBPUPpUf2EebvJD4TLqA0NZ0bE5vF_ib69Fs=.65981273-2420-428c-ab0d-ae8ee5548a85@github.com> > Please review this change to fix JfrSet to avoid triggering > -Wzero-as-null-pointer-constant warnings when that warning is enabled. > > The old code uses an entry value with representation 0 to indicate the entry > doesn't have a value. It compares an entry value against literal 0 to check > for that. If the key type is a pointer type, this involves an implicit 0 => > null pointer constant conversion, so we get a warning when that warning is > enabled. > > Instead we initialize entry values to a value-initialized key, and compare > against a value-initialized key. This changes the (currently undocumented) > requirements on the key type. The key type is no longer required to be > trivially constructible (to permit memset-based initialization), but is now > required to be value-initializable. That's currently a wash, since all of the > in-use key types are fundamental types (traceid (u8) and Klass*). > > Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into jfrset-zero-as-null-pointer-warnings - fix -Wzero-as-null-poniter-constant warnings in jfrSet.hpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29022/files - new: https://git.openjdk.org/jdk/pull/29022/files/d2ee55ab..d54334e0 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29022&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29022&range=00-01 Stats: 55051 lines in 1117 files changed: 27415 ins; 10797 del; 16839 mod Patch: https://git.openjdk.org/jdk/pull/29022.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29022/head:pull/29022 PR: https://git.openjdk.org/jdk/pull/29022 From mgronlun at openjdk.org Thu Jan 15 17:12:28 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 15 Jan 2026 17:12:28 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet [v2] In-Reply-To: <90e6X30EBPUPpUf2EebvJD4TLqA0NZ0bE5vF_ib69Fs=.65981273-2420-428c-ab0d-ae8ee5548a85@github.com> References: <90e6X30EBPUPpUf2EebvJD4TLqA0NZ0bE5vF_ib69Fs=.65981273-2420-428c-ab0d-ae8ee5548a85@github.com> Message-ID: On Thu, 15 Jan 2026 16:50:14 GMT, Kim Barrett wrote: >> Please review this change to fix JfrSet to avoid triggering >> -Wzero-as-null-pointer-constant warnings when that warning is enabled. >> >> The old code uses an entry value with representation 0 to indicate the entry >> doesn't have a value. It compares an entry value against literal 0 to check >> for that. If the key type is a pointer type, this involves an implicit 0 => >> null pointer constant conversion, so we get a warning when that warning is >> enabled. >> >> Instead we initialize entry values to a value-initialized key, and compare >> against a value-initialized key. This changes the (currently undocumented) >> requirements on the key type. The key type is no longer required to be >> trivially constructible (to permit memset-based initialization), but is now >> required to be value-initializable. That's currently a wash, since all of the >> in-use key types are fundamental types (traceid (u8) and Klass*). >> >> Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) > > Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into jfrset-zero-as-null-pointer-warnings > - fix -Wzero-as-null-poniter-constant warnings in jfrSet.hpp Look good, thanks Kim. ------------- Marked as reviewed by mgronlun (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/29022#pullrequestreview-3666675208 From mgronlun at openjdk.org Thu Jan 15 17:12:41 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Thu, 15 Jan 2026 17:12:41 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet [v2] In-Reply-To: References: Message-ID: On Thu, 15 Jan 2026 16:15:52 GMT, Kim Barrett wrote: >> src/hotspot/share/jfr/utilities/jfrSet.hpp line 72: >> >>> 70: } >>> 71: for (unsigned i = 0; i < table_size; ++i) { >>> 72: ::new (&table[i]) K{}; >> >> Is this the new (placement pun intended) way to do a placement new, using the outer scope operator ::? Or is it because we don't know what Hotspot type this is? > > It's the same old way one should always do it. If one wants global placement > new, one should say so. An unqualified `new` expression does a class-based > lookup of `operator new`, so if the class has one (and lots of ours do), that > will be used. We don't want that here, regardless of the type of `K`. As it > happens, for all current uses `K` is a fundamental type, so it doesn't matter. > But it's clearer and future proof to be explicit. And the new location for global new is cppstdlib/new.hpp - ok. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29022#discussion_r2695233003 From kbarrett at openjdk.org Thu Jan 15 19:15:48 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 15 Jan 2026 19:15:48 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet In-Reply-To: References: Message-ID: On Wed, 7 Jan 2026 17:34:11 GMT, Markus Gr?nlund wrote: >> Please review this change to fix JfrSet to avoid triggering >> -Wzero-as-null-pointer-constant warnings when that warning is enabled. >> >> The old code uses an entry value with representation 0 to indicate the entry >> doesn't have a value. It compares an entry value against literal 0 to check >> for that. If the key type is a pointer type, this involves an implicit 0 => >> null pointer constant conversion, so we get a warning when that warning is >> enabled. >> >> Instead we initialize entry values to a value-initialized key, and compare >> against a value-initialized key. This changes the (currently undocumented) >> requirements on the key type. The key type is no longer required to be >> trivially constructible (to permit memset-based initialization), but is now >> required to be value-initializable. That's currently a wash, since all of the >> in-use key types are fundamental types (traceid (u8) and Klass*). >> >> Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) > > Will review this later Kim, sorry for the delay (26 stuff). Thanks for reviewing @mgronlun ------------- PR Comment: https://git.openjdk.org/jdk/pull/29022#issuecomment-3756446182 From kbarrett at openjdk.org Thu Jan 15 19:17:35 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Thu, 15 Jan 2026 19:17:35 GMT Subject: Integrated: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet In-Reply-To: References: Message-ID: On Sat, 3 Jan 2026 08:21:15 GMT, Kim Barrett wrote: > Please review this change to fix JfrSet to avoid triggering > -Wzero-as-null-pointer-constant warnings when that warning is enabled. > > The old code uses an entry value with representation 0 to indicate the entry > doesn't have a value. It compares an entry value against literal 0 to check > for that. If the key type is a pointer type, this involves an implicit 0 => > null pointer constant conversion, so we get a warning when that warning is > enabled. > > Instead we initialize entry values to a value-initialized key, and compare > against a value-initialized key. This changes the (currently undocumented) > requirements on the key type. The key type is no longer required to be > trivially constructible (to permit memset-based initialization), but is now > required to be value-initializable. That's currently a wash, since all of the > in-use key types are fundamental types (traceid (u8) and Klass*). > > Testing: mach5 tier1-3 (tier3 is where most jfr tests are run) This pull request has now been integrated. Changeset: a8b845e0 Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/a8b845e08ce2f1fbe7d807cd963cb6b5e4df5ce6 Stats: 10 lines in 1 file changed: 3 ins; 0 del; 7 mod 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet Reviewed-by: mgronlun ------------- PR: https://git.openjdk.org/jdk/pull/29022 From dholmes at openjdk.org Thu Jan 15 21:24:05 2026 From: dholmes at openjdk.org (David Holmes) Date: Thu, 15 Jan 2026 21:24:05 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet In-Reply-To: References: Message-ID: <0ey_3JltJ7Pde7KfS-Cot6PUAx-r6AMfUv0RXtIVA8U=.5ece4ed0-0ef9-4259-b801-f92f01d4cc97@github.com> On Thu, 15 Jan 2026 19:13:47 GMT, Kim Barrett wrote: >> Will review this later Kim, sorry for the delay (26 stuff). > > Thanks for reviewing @mgronlun @kimbarrett don't you also need to adjust this for completeness: void clear() { memset(_table, 0, _table_size * sizeof(K)); } ------------- PR Comment: https://git.openjdk.org/jdk/pull/29022#issuecomment-3756916457 From kbarrett at openjdk.org Fri Jan 16 15:22:13 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Fri, 16 Jan 2026 15:22:13 GMT Subject: RFR: 8374445: Fix -Wzero-as-null-pointer-constant warnings in JfrSet In-Reply-To: References: Message-ID: On Thu, 15 Jan 2026 19:13:47 GMT, Kim Barrett wrote: >> Will review this later Kim, sorry for the delay (26 stuff). > > Thanks for reviewing @mgronlun > @kimbarrett don't you also need to adjust this for completeness: > > ``` > void clear() { > memset(_table, 0, _table_size * sizeof(K)); > } > ``` Drat! Missed that. https://bugs.openjdk.org/browse/JDK-8375544 ------------- PR Comment: https://git.openjdk.org/jdk/pull/29022#issuecomment-3760548998 From fandreuzzi at openjdk.org Tue Jan 20 14:45:24 2026 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Tue, 20 Jan 2026 14:45:24 GMT Subject: RFR: 8375717: Outdated link in jdk.jfr.internal.JVM javadoc Message-ID: I noticed that several javadoc entries in `jdk.jfr.internal.JVM` are still referencing `createNativeJFR`, which has been removed as part of [JDK-8310661](https://bugs.openjdk.org/browse/JDK-8310661). I propose to replace it with `JVMSupport#createJFR()`. ------------- Commit messages: - replace Changes: https://git.openjdk.org/jdk/pull/29322/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29322&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8375717 Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/29322.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29322/head:pull/29322 PR: https://git.openjdk.org/jdk/pull/29322 From fandreuzzi at openjdk.org Tue Jan 20 15:50:35 2026 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Tue, 20 Jan 2026 15:50:35 GMT Subject: RFR: 8375717: Outdated link in jdk.jfr.internal.JVM javadoc In-Reply-To: References: Message-ID: On Tue, 20 Jan 2026 15:43:25 GMT, Erik Gahlin wrote: > Curious, any particular reason for fixing this? Is it to get rid of all the warnings? The JVM class is internal, and its Javadoc is intended only for HotSpot developers, so it's not important. I was going through the code and could not find what the javadoc was referring to. I just found it confusing since the method does not exist anymore, but there's no warning about it indeed. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29322#issuecomment-3773571821 From egahlin at openjdk.org Tue Jan 20 15:50:34 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Tue, 20 Jan 2026 15:50:34 GMT Subject: RFR: 8375717: Outdated link in jdk.jfr.internal.JVM javadoc In-Reply-To: References: Message-ID: On Tue, 20 Jan 2026 14:37:11 GMT, Francesco Andreuzzi wrote: > I noticed that several javadoc entries in `jdk.jfr.internal.JVM` reference `createNativeJFR`, which has been removed as part of [JDK-8310661](https://bugs.openjdk.org/browse/JDK-8310661). I propose to replace it with `JVMSupport#createJFR()`. Curious, any particular reason for fixing this? Is it to get rid of all the warnings? The JVM class is internal, and its Javadoc is intended only for HotSpot developers, so it's not important. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29322#issuecomment-3773552812 From kbarrett at openjdk.org Tue Jan 20 17:45:13 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Tue, 20 Jan 2026 17:45:13 GMT Subject: RFR: 8375544: JfrSet::clear should not use memset Message-ID: <19EqFv_QUJiVGE-nFxUTfowQZvesS-oWGrE8SoAKTuU=.900c1b50-46bb-4a84-8dee-fd8db2a5a81b@github.com> Please review this change to JfrSet::clear to no longer use memset to clear the table data. Instead use value-initialization, similarly to what is done in the set initialization from JDK-8374445. Testing: mach5 tier1-3 ------------- Commit messages: - fix Changes: https://git.openjdk.org/jdk/pull/29327/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29327&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8375544 Stats: 9 lines in 1 file changed: 8 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/29327.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29327/head:pull/29327 PR: https://git.openjdk.org/jdk/pull/29327 From egahlin at openjdk.org Wed Jan 21 09:36:06 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Wed, 21 Jan 2026 09:36:06 GMT Subject: RFR: 8375717: Outdated link in jdk.jfr.internal.JVM javadoc In-Reply-To: References: Message-ID: On Tue, 20 Jan 2026 14:37:11 GMT, Francesco Andreuzzi wrote: > I noticed that several javadoc entries in `jdk.jfr.internal.JVM` reference `createNativeJFR`, which has been removed as part of [JDK-8310661](https://bugs.openjdk.org/browse/JDK-8310661). I propose to replace it with `JVMSupport#createJFR()`. Marked as reviewed by egahlin (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29322#pullrequestreview-3686206529 From mgronlun at openjdk.org Wed Jan 21 10:35:25 2026 From: mgronlun at openjdk.org (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 21 Jan 2026 10:35:25 GMT Subject: RFR: 8375544: JfrSet::clear should not use memset In-Reply-To: <19EqFv_QUJiVGE-nFxUTfowQZvesS-oWGrE8SoAKTuU=.900c1b50-46bb-4a84-8dee-fd8db2a5a81b@github.com> References: <19EqFv_QUJiVGE-nFxUTfowQZvesS-oWGrE8SoAKTuU=.900c1b50-46bb-4a84-8dee-fd8db2a5a81b@github.com> Message-ID: On Tue, 20 Jan 2026 17:38:57 GMT, Kim Barrett wrote: > Please review this change to JfrSet::clear to no longer use memset to clear > the table data. Instead use value-initialization, similarly to what is done > in the set initialization from JDK-8374445. > > Testing: mach5 tier1-3 Marked as reviewed by mgronlun (Reviewer). ------------- PR Review: https://git.openjdk.org/jdk/pull/29327#pullrequestreview-3686482632 From fandreuzzi at openjdk.org Wed Jan 21 10:45:45 2026 From: fandreuzzi at openjdk.org (Francesco Andreuzzi) Date: Wed, 21 Jan 2026 10:45:45 GMT Subject: Integrated: 8375717: Outdated link in jdk.jfr.internal.JVM javadoc In-Reply-To: References: Message-ID: On Tue, 20 Jan 2026 14:37:11 GMT, Francesco Andreuzzi wrote: > I noticed that several javadoc entries in `jdk.jfr.internal.JVM` reference `createNativeJFR`, which has been removed as part of [JDK-8310661](https://bugs.openjdk.org/browse/JDK-8310661). I propose to replace it with `JVMSupport#createJFR()`. This pull request has now been integrated. Changeset: 5c7c2f09 Author: Francesco Andreuzzi URL: https://git.openjdk.org/jdk/commit/5c7c2f093b83a017970d9d05c258b4c0910bfc2c Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod 8375717: Outdated link in jdk.jfr.internal.JVM javadoc Reviewed-by: egahlin ------------- PR: https://git.openjdk.org/jdk/pull/29322 From egahlin at openjdk.org Wed Jan 21 14:10:41 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Wed, 21 Jan 2026 14:10:41 GMT Subject: RFR: 8373439: Deadlock between flight recorder & VMDeath [v2] In-Reply-To: References: Message-ID: On Tue, 16 Dec 2025 09:08:25 GMT, Bara' Hasheesh wrote: >> A simple `PlatformRecorder.isInShutDown` check is added to `PlatformRecording.start` to prevent any new recording from start after the JVM initiates it's shutdown hooks >> >> A new test was added that fails without the change & passes with it >> >> I also ran `tier1`, `tier2` as well as `jdk_jfr` on Linux x86 > > Bara' Hasheesh has updated the pull request incrementally with one additional commit since the last revision: > > flags I think a dummy may be hard to test and ensure that it works for all possible scenarios. There are also behavioral questions, e.g. what should happen if a user calls Recording::dump on a dummy? Throwing an exception creates a burden for users of the API. We will need to think some more on how to fix this. ------------- PR Comment: https://git.openjdk.org/jdk/pull/28767#issuecomment-3778311717 From kbarrett at openjdk.org Wed Jan 21 14:56:00 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 21 Jan 2026 14:56:00 GMT Subject: RFR: 8375544: JfrSet::clear should not use memset In-Reply-To: References: <19EqFv_QUJiVGE-nFxUTfowQZvesS-oWGrE8SoAKTuU=.900c1b50-46bb-4a84-8dee-fd8db2a5a81b@github.com> Message-ID: <0CHEE1CXNWMqeY7deqrXgKOkVHBKFGgL8Jo7Wt68lLs=.8485bd0c-e83e-4a90-a852-df3bf0cd9041@github.com> On Wed, 21 Jan 2026 10:33:04 GMT, Markus Gr?nlund wrote: >> Please review this change to JfrSet::clear to no longer use memset to clear >> the table data. Instead use value-initialization, similarly to what is done >> in the set initialization from JDK-8374445. >> >> Testing: mach5 tier1-3 > > Marked as reviewed by mgronlun (Reviewer). Thanks for reviewing @mgronlun ------------- PR Comment: https://git.openjdk.org/jdk/pull/29327#issuecomment-3778552162 From kbarrett at openjdk.org Wed Jan 21 14:57:50 2026 From: kbarrett at openjdk.org (Kim Barrett) Date: Wed, 21 Jan 2026 14:57:50 GMT Subject: Integrated: 8375544: JfrSet::clear should not use memset In-Reply-To: <19EqFv_QUJiVGE-nFxUTfowQZvesS-oWGrE8SoAKTuU=.900c1b50-46bb-4a84-8dee-fd8db2a5a81b@github.com> References: <19EqFv_QUJiVGE-nFxUTfowQZvesS-oWGrE8SoAKTuU=.900c1b50-46bb-4a84-8dee-fd8db2a5a81b@github.com> Message-ID: <1tMbxgFIWRFc1n8gJE20YnmNtPLxoQXM9WOQo5AZiCs=.90bfa104-d5ff-489e-8b82-71433ba5528d@github.com> On Tue, 20 Jan 2026 17:38:57 GMT, Kim Barrett wrote: > Please review this change to JfrSet::clear to no longer use memset to clear > the table data. Instead use value-initialization, similarly to what is done > in the set initialization from JDK-8374445. > > Testing: mach5 tier1-3 This pull request has now been integrated. Changeset: 3033e6f4 Author: Kim Barrett URL: https://git.openjdk.org/jdk/commit/3033e6f421d0f6e0aea1d976a806d7abca7c6360 Stats: 9 lines in 1 file changed: 8 ins; 0 del; 1 mod 8375544: JfrSet::clear should not use memset Reviewed-by: mgronlun ------------- PR: https://git.openjdk.org/jdk/pull/29327 From duke at openjdk.org Wed Jan 21 15:37:21 2026 From: duke at openjdk.org (Bara' Hasheesh) Date: Wed, 21 Jan 2026 15:37:21 GMT Subject: RFR: 8373439: Deadlock between flight recorder & VMDeath [v2] In-Reply-To: References: Message-ID: On Wed, 21 Jan 2026 14:06:23 GMT, Erik Gahlin wrote: > There are also behavioral questions, e.g. what should happen if a user calls Recording::dump on a dummy? Yes, which is why I initially went with an exception as it's much clearer in terms of behaviour > Throwing an exception creates a burden for users of the API. > > We will need to think some more on how to fix this. As you're aware JFR backend/components are cleaned up during the shutdown hook via a call to `recorder.destroy()`, as such trying to do any JFR work after that, doesn't make much sense I can't really think of a way to make JFR "work" without keeping JFR components alive during a JVM death which will create it's own problems (when will these resources be eventually cleared?) **IMO the first thing we should do is to define what the behaviour should be, after that we can investigate/discuss on how that can be done/achieved** For throwing an exception, the current documentation is the following /** * Starts this recording. *

* It's recommended that the recording options and event settings are configured * before calling this method. The benefits of doing so are a more consistent * state when analyzing the recorded data, and improved performance because the * configuration can be applied atomically. *

* After a successful invocation of this method, this recording is in the * {@code RUNNING} state. * * @throws IllegalStateException if recording is already started or is in the * {@code CLOSED} state */ This documentation does not make any guarantees that the call will be successful, so I can see some freedom in defining some new errors here ------------- PR Comment: https://git.openjdk.org/jdk/pull/28767#issuecomment-3778890869 From lmesnik at openjdk.org Wed Jan 21 16:02:08 2026 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Wed, 21 Jan 2026 16:02:08 GMT Subject: RFR: 8373439: Deadlock between flight recorder & VMDeath [v2] In-Reply-To: References: Message-ID: On Tue, 16 Dec 2025 09:08:25 GMT, Bara' Hasheesh wrote: >> A simple `PlatformRecorder.isInShutDown` check is added to `PlatformRecording.start` to prevent any new recording from start after the JVM initiates it's shutdown hooks >> >> A new test was added that fails without the change & passes with it >> >> I also ran `tier1`, `tier2` as well as `jdk_jfr` on Linux x86 > > Bara' Hasheesh has updated the pull request incrementally with one additional commit since the last revision: > > flags src/jdk.jfr/share/classes/jdk/jfr/internal/PlatformRecorder.java line 560: > 558: copy.setStopTime(r.getStopTime()); > 559: copy.setFlushInterval(r.getFlushInterval()); > 560: copy.setDummyRecording(r.isDummyRecording()); Not a review and I am not planning to review the fix itself. Could you please change PR and bug summary. The VMDeath is jvmti event so the bug is misleading. Something "Deadlock between flight recorder & VM shutdown" would be better. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/28767#discussion_r2713227258 From jaroslav.bachorik at datadoghq.com Wed Jan 21 21:03:38 2026 From: jaroslav.bachorik at datadoghq.com (=?UTF-8?Q?Jaroslav_Bachor=C3=ADk?=) Date: Wed, 21 Jan 2026 22:03:38 +0100 Subject: RFC: Display contextual event fields in jfr view command Message-ID: Hello, I'd like to propose adding context display support to the `jfr view` command. This would allow users to see which @Contextual events were active when other events occurred, without requiring any changes to the JFR recording format or runtime. Background Back in 2021, there was a discussion on this list about adding a Recording Context concept to JFR (thread starting at 2021-June/002777). Erik suggested an alternative to modifying the event format: use dedicated context events with begin/end markers and correlate them during recording analysis. This proposal implements exactly that approach on the tooling side. When users have events with @Contextual annotated fields (such as trace IDs, span IDs, or request contexts), they can now view which contexts were active during any event - all computed at analysis time from the existing recording data. --- Current State The `jfr print` command already supports displaying contextual events. When printing events, it shows active context fields inline: jfr print recording.jfr jdk.ThreadSleep { Context: Trace.traceId = "abc-123-def" Context: Trace.service = "order-service" startTime = 12:00:01.000 duration = 50 ms ... } This works well for detailed event inspection, but the `jfr view` command (which displays events in a tabular format) has no equivalent capability. --- The Problem When using `jfr view` to analyze recordings from distributed systems, users cannot see which contexts were active. The tabular format is often preferred for scanning many events quickly, but without context information users must: 1. Note the timestamp of the event of interest 2. Switch to `jfr print` or manually search for overlapping contextual events 3. Match by thread ID to avoid cross-thread confusion 4. Repeat for every event they want to analyze This breaks the workflow when trying to correlate events with their contexts at scale. --- Proposed Solution Add a `--show-context` flag to `jfr view` that automatically displays contextual event fields as additional columns: jfr view --show-context jdk.ThreadSleep recording.jfr ThreadSleep Time Sleep Time Trace.traceId Trace.service ---------------------------------------------------------------- 12:00:01 50 ms abc-123-def order-service 12:00:02 100 ms abc-123-def order-service 12:00:03 25 ms N/A N/A The context matching rule is: a contextual event is active when contextStart <= eventStart AND contextEnd >= eventStart. Users can optionally filter which context types to display: jfr view --show-context=Span,Trace WorkEvent recording.jfr --- Why This Approach? 1. No runtime overhead - context correlation happens entirely at analysis time 2. No format changes - works with existing recordings that have @Contextual events 3. Backward compatible - recordings remain readable by older tools 4. Flexible - users choose which contexts to display 5. Proven pattern - based on the timeline approach already used in PrettyWriter --- [PoC] Implementation Notes The implementation tracks context per-thread using a timeline-based approach similar to PrettyWriter.java. Events are buffered in a priority queue ordered by timestamp. Contextual events contribute both start and end timestamps, and active contexts are tracked per-thread to prevent cross-thread leakage. Memory is bounded (~1M events) to handle large recordings. Queries without --show-context bypass this entirely, so there's no overhead for existing usage. I've also added support for referencing contextual fields in GROUP BY clauses for the `jfr query` command (debug builds), enabling aggregation queries like: SELECT COUNT(*), Trace.traceId FROM WorkEvent GROUP BY Trace.traceId --- Questions for Discussion 1. Is the matching rule (contextStart <= eventStart) correct? An alternative would be to require the event to fall entirely within the context. 2. Should there be a maximum number of context columns to prevent very wide output? 3. Is 1M events a reasonable buffer size? This balances memory (~100MB) with accuracy for long-running contexts. 4. The `jfr print` command already shows context - should there be a way to disable it for consistency, or is the current always-on behavior correct? I'd welcome feedback on the approach before proceeding further. Thanks, Jaroslacv -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik.gahlin at oracle.com Mon Jan 26 12:34:17 2026 From: erik.gahlin at oracle.com (Erik Gahlin) Date: Mon, 26 Jan 2026 12:34:17 +0000 Subject: RFC: Display contextual event fields in jfr view command In-Reply-To: References: Message-ID: Hi Jaroslav, The 'jfr print' command is meant for presentations, demos, and debugging. It was never designed for application troubleshooting. The contextual support added in JDK 25 was included to demonstrate to application developers how the @Contextual annotation can be used, and to show third parties how contextual support can be implemented using the jdk.jfr.consumer API. The 'view' command, on the other hand, was designed for troubleshooting and can be used on a live process, so it should not use excessive memory or CPU. You added a command-line flag, --show-context, perhaps to prevent additional overhead from contextual processing, but before adding flags, I think it might be a good time to step back and think about how we best can present contextual information to users. A flag is usually hard for users to find. It would be better to add support in JMC, so users can discover contexts and then drill deeper by clicking in the GUI. I'm also wondering if contextual support belongs in the query language. It's not clear how columns of nested contexts should be identified. It may be better to create something like FormRenderer that only handles event types. We have also discussed adding a bit in the chunk header if a contextual event has been emitted. This would allow a parser to have a fast path when there are no contextual events. Thanks Erik ________________________________________ From: hotspot-jfr-dev on behalf of Jaroslav Bachor?k Sent: Wednesday, January 21, 2026 10:03 PM To: hotspot-jfr-dev Subject: RFC: Display contextual event fields in jfr view command Hello, I'd like to propose adding context display support to the `jfr view` command. This would allow users to see which @Contextual events were active when other events occurred, without requiring any changes to the JFR recording format or runtime. Background Back in 2021, there was a discussion on this list about adding a Recording Context concept to JFR (thread starting at 2021-June/002777). Erik suggested an alternative to modifying the event format: use dedicated context events with begin/end markers and correlate them during recording analysis. This proposal implements exactly that approach on the tooling side. When users have events with @Contextual annotated fields (such as trace IDs, span IDs, or request contexts), they can now view which contexts were active during any event - all computed at analysis time from the existing recording data. --- Current State The `jfr print` command already supports displaying contextual events. When printing events, it shows active context fields inline: jfr print recording.jfr jdk.ThreadSleep { Context: Trace.traceId = "abc-123-def" Context: Trace.service = "order-service" startTime = 12:00:01.000 duration = 50 ms ... } This works well for detailed event inspection, but the `jfr view` command (which displays events in a tabular format) has no equivalent capability. --- The Problem When using `jfr view` to analyze recordings from distributed systems, users cannot see which contexts were active. The tabular format is often preferred for scanning many events quickly, but without context information users must: 1. Note the timestamp of the event of interest 2. Switch to `jfr print` or manually search for overlapping contextual events 3. Match by thread ID to avoid cross-thread confusion 4. Repeat for every event they want to analyze This breaks the workflow when trying to correlate events with their contexts at scale. --- Proposed Solution Add a `--show-context` flag to `jfr view` that automatically displays contextual event fields as additional columns: jfr view --show-context jdk.ThreadSleep recording.jfr ThreadSleep Time Sleep Time Trace.traceId Trace.service ---------------------------------------------------------------- 12:00:01 50 ms abc-123-def order-service 12:00:02 100 ms abc-123-def order-service 12:00:03 25 ms N/A N/A The context matching rule is: a contextual event is active when contextStart <= eventStart AND contextEnd >= eventStart. Users can optionally filter which context types to display: jfr view --show-context=Span,Trace WorkEvent recording.jfr --- Why This Approach? 1. No runtime overhead - context correlation happens entirely at analysis time 2. No format changes - works with existing recordings that have @Contextual events 3. Backward compatible - recordings remain readable by older tools 4. Flexible - users choose which contexts to display 5. Proven pattern - based on the timeline approach already used in PrettyWriter --- [PoC] Implementation Notes The implementation tracks context per-thread using a timeline-based approach similar to PrettyWriter.java. Events are buffered in a priority queue ordered by timestamp. Contextual events contribute both start and end timestamps, and active contexts are tracked per-thread to prevent cross-thread leakage. Memory is bounded (~1M events) to handle large recordings. Queries without --show-context bypass this entirely, so there's no overhead for existing usage. I've also added support for referencing contextual fields in GROUP BY clauses for the `jfr query` command (debug builds), enabling aggregation queries like: SELECT COUNT(*), Trace.traceId FROM WorkEvent GROUP BY Trace.traceId --- Questions for Discussion 1. Is the matching rule (contextStart <= eventStart) correct? An alternative would be to require the event to fall entirely within the context. 2. Should there be a maximum number of context columns to prevent very wide output? 3. Is 1M events a reasonable buffer size? This balances memory (~100MB) with accuracy for long-running contexts. 4. The `jfr print` command already shows context - should there be a way to disable it for consistency, or is the current always-on behavior correct? I'd welcome feedback on the approach before proceeding further. Thanks, Jaroslacv From erik.gahlin at oracle.com Mon Jan 26 15:35:51 2026 From: erik.gahlin at oracle.com (Erik Gahlin) Date: Mon, 26 Jan 2026 15:35:51 +0000 Subject: RFC: Display contextual event fields in jfr view command In-Reply-To: References: Message-ID: I meant, something like PrettyWriter but with columns (not FormRenderer). Erik ________________________________________ From: hotspot-jfr-dev on behalf of Erik Gahlin Sent: Monday, January 26, 2026 1:34 PM To: Jaroslav Bachor?k; hotspot-jfr-dev Subject: Re: RFC: Display contextual event fields in jfr view command Hi Jaroslav, The 'jfr print' command is meant for presentations, demos, and debugging. It was never designed for application troubleshooting. The contextual support added in JDK 25 was included to demonstrate to application developers how the @Contextual annotation can be used, and to show third parties how contextual support can be implemented using the jdk.jfr.consumer API. The 'view' command, on the other hand, was designed for troubleshooting and can be used on a live process, so it should not use excessive memory or CPU. You added a command-line flag, --show-context, perhaps to prevent additional overhead from contextual processing, but before adding flags, I think it might be a good time to step back and think about how we best can present contextual information to users. A flag is usually hard for users to find. It would be better to add support in JMC, so users can discover contexts and then drill deeper by clicking in the GUI. I'm also wondering if contextual support belongs in the query language. It's not clear how columns of nested contexts should be identified. It may be better to create something like FormRenderer that only handles event types. We have also discussed adding a bit in the chunk header if a contextual event has been emitted. This would allow a parser to have a fast path when there are no contextual events. Thanks Erik ________________________________________ From: hotspot-jfr-dev on behalf of Jaroslav Bachor?k Sent: Wednesday, January 21, 2026 10:03 PM To: hotspot-jfr-dev Subject: RFC: Display contextual event fields in jfr view command Hello, I'd like to propose adding context display support to the `jfr view` command. This would allow users to see which @Contextual events were active when other events occurred, without requiring any changes to the JFR recording format or runtime. Background Back in 2021, there was a discussion on this list about adding a Recording Context concept to JFR (thread starting at 2021-June/002777). Erik suggested an alternative to modifying the event format: use dedicated context events with begin/end markers and correlate them during recording analysis. This proposal implements exactly that approach on the tooling side. When users have events with @Contextual annotated fields (such as trace IDs, span IDs, or request contexts), they can now view which contexts were active during any event - all computed at analysis time from the existing recording data. --- Current State The `jfr print` command already supports displaying contextual events. When printing events, it shows active context fields inline: jfr print recording.jfr jdk.ThreadSleep { Context: Trace.traceId = "abc-123-def" Context: Trace.service = "order-service" startTime = 12:00:01.000 duration = 50 ms ... } This works well for detailed event inspection, but the `jfr view` command (which displays events in a tabular format) has no equivalent capability. --- The Problem When using `jfr view` to analyze recordings from distributed systems, users cannot see which contexts were active. The tabular format is often preferred for scanning many events quickly, but without context information users must: 1. Note the timestamp of the event of interest 2. Switch to `jfr print` or manually search for overlapping contextual events 3. Match by thread ID to avoid cross-thread confusion 4. Repeat for every event they want to analyze This breaks the workflow when trying to correlate events with their contexts at scale. --- Proposed Solution Add a `--show-context` flag to `jfr view` that automatically displays contextual event fields as additional columns: jfr view --show-context jdk.ThreadSleep recording.jfr ThreadSleep Time Sleep Time Trace.traceId Trace.service ---------------------------------------------------------------- 12:00:01 50 ms abc-123-def order-service 12:00:02 100 ms abc-123-def order-service 12:00:03 25 ms N/A N/A The context matching rule is: a contextual event is active when contextStart <= eventStart AND contextEnd >= eventStart. Users can optionally filter which context types to display: jfr view --show-context=Span,Trace WorkEvent recording.jfr --- Why This Approach? 1. No runtime overhead - context correlation happens entirely at analysis time 2. No format changes - works with existing recordings that have @Contextual events 3. Backward compatible - recordings remain readable by older tools 4. Flexible - users choose which contexts to display 5. Proven pattern - based on the timeline approach already used in PrettyWriter --- [PoC] Implementation Notes The implementation tracks context per-thread using a timeline-based approach similar to PrettyWriter.java. Events are buffered in a priority queue ordered by timestamp. Contextual events contribute both start and end timestamps, and active contexts are tracked per-thread to prevent cross-thread leakage. Memory is bounded (~1M events) to handle large recordings. Queries without --show-context bypass this entirely, so there's no overhead for existing usage. I've also added support for referencing contextual fields in GROUP BY clauses for the `jfr query` command (debug builds), enabling aggregation queries like: SELECT COUNT(*), Trace.traceId FROM WorkEvent GROUP BY Trace.traceId --- Questions for Discussion 1. Is the matching rule (contextStart <= eventStart) correct? An alternative would be to require the event to fall entirely within the context. 2. Should there be a maximum number of context columns to prevent very wide output? 3. Is 1M events a reasonable buffer size? This balances memory (~100MB) with accuracy for long-running contexts. 4. The `jfr print` command already shows context - should there be a way to disable it for consistency, or is the current always-on behavior correct? I'd welcome feedback on the approach before proceeding further. Thanks, Jaroslacv From stuefe at openjdk.org Tue Jan 27 12:53:01 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 27 Jan 2026 12:53:01 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive Message-ID: This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. ---- A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. ### Memory usage of old vs new algorithm: The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. ### Possible improvements/simplifications in the future: DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. ### Observable differences There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops and roots in reverse order. That means we may see different GC roots in JMC now compared to the recursive version of this algorithm. For example, for a static main class member: - One order of root processing may process the CLDG first; this causes the CLDs to be iterated, which iterates the Klass, which iterates the associated j.l.Class object. The reference stack starts at the j.l.Class object, and the root displayed in JMC says "ClassLoader". - Another root processing order may process the global handles first; this causes the AppClassLoader to be processed first, which causes the main class Mirror to be processed, and we also hit the static member, but now the root in JMC is displayed as "Global Object Handle: VM Global" I think that effect is benign - a root is a root. But, as a future improvement, I think displaying the first (or first n) objects of the reference stack would be very helpful to the analyst (possibly more so than just printing the roots). But maybe that feature exists already; I am no JMC expert. ## Testing - I ran extensive tests manually to make sure the resulting jfr files showed the same information (apart from the observable differences mentioned above). - I manually tested that we handle max_depth reached and probestack exhaustion gracefully - I added a new test to execute DFS-only on a very small stack size. Works fine (crashes with stock JVM). - I added a new test that exercises the new array chunking in the tracer - I made sure the patch fixes https://bugs.openjdk.org/browse/JDK-8371630 by running the TestWaste.java with a tiny stack size - crashes without patch, works with patch ------------- Commit messages: - fix after JDK-8375040 - tests - wip - revert unnecessary changes - wip - small stack test - tweaking DFS-BFS test - fix windows build warning - reduce diff - fix performance problem on bfs-dfs mixed mode - ... and 8 more: https://git.openjdk.org/jdk/compare/cba7d88c...5d79624b Changes: https://git.openjdk.org/jdk/pull/29382/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8373096 Stats: 543 lines in 8 files changed: 502 ins; 8 del; 33 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382 From stuefe at openjdk.org Tue Jan 27 12:53:02 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Tue, 27 Jan 2026 12:53:02 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive In-Reply-To: References: Message-ID: On Fri, 23 Jan 2026 10:18:05 GMT, Thomas Stuefe wrote: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... Ping @roberttoyonaga , @egahlin ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3805038334 From duke at openjdk.org Tue Jan 27 20:03:15 2026 From: duke at openjdk.org (Robert Toyonaga) Date: Tue, 27 Jan 2026 20:03:15 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive In-Reply-To: References: Message-ID: <9F7u-mBUQGEfA8o6PGa9XtHd0wa8ie6jBD444Hq2L5M=.701a8395-fc36-4cbc-b514-a589b93ceb14@github.com> On Fri, 23 Jan 2026 10:18:05 GMT, Thomas Stuefe wrote: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... This looks good to me! And it fixes the [problem we talked about earlier](https://github.com/openjdk/jdk/pull/28659#discussion_r2632525732). I have left one minor comment below. src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 217: > 215: _current_pointee = _current_ref.dereference(); > 216: > 217: _num_objects_processed++; For large arrays that have many chunks, each chunk will count as another "object" processed. Is this okay? ------------- Marked as reviewed by roberttoyonaga at github.com (no known OpenJDK username). PR Review: https://git.openjdk.org/jdk/pull/29382#pullrequestreview-3712943595 PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2733410597 From duke at openjdk.org Tue Jan 27 20:06:02 2026 From: duke at openjdk.org (Robert Toyonaga) Date: Tue, 27 Jan 2026 20:06:02 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive In-Reply-To: References: Message-ID: On Fri, 23 Jan 2026 10:18:05 GMT, Thomas Stuefe wrote: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 125: > 123: // smaller. Not a problem at all. > 124: // > 125: // But we could run into weird pathological object graphs. Therfore we also Suggestion: // But we could run into weird pathological object graphs. Therefore we also ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2733627437 From stuefe at openjdk.org Wed Jan 28 07:46:29 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 28 Jan 2026 07:46:29 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v2] In-Reply-To: References: Message-ID: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp Co-authored-by: Robert Toyonaga ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/5d79624b..af5c5585 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382 From stuefe at openjdk.org Wed Jan 28 07:53:47 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 28 Jan 2026 07:53:47 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: References: Message-ID: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: dont incremement _num_objects_processed for follow-up chunks ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/af5c5585..66276985 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=01-02 Stats: 6 lines in 1 file changed: 4 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382 From stuefe at openjdk.org Wed Jan 28 07:53:49 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 28 Jan 2026 07:53:49 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: <9F7u-mBUQGEfA8o6PGa9XtHd0wa8ie6jBD444Hq2L5M=.701a8395-fc36-4cbc-b514-a589b93ceb14@github.com> References: <9F7u-mBUQGEfA8o6PGa9XtHd0wa8ie6jBD444Hq2L5M=.701a8395-fc36-4cbc-b514-a589b93ceb14@github.com> Message-ID: On Tue, 27 Jan 2026 18:57:56 GMT, Robert Toyonaga wrote: >> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: >> >> dont incremement _num_objects_processed for follow-up chunks > > src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 217: > >> 215: _current_pointee = _current_ref.dereference(); >> 216: >> 217: _num_objects_processed++; > > For large arrays that have many chunks, each chunk will count as another "object" processed. Is this okay? Thank you for catching that; I fixed it. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2735291019 From stuefe at openjdk.org Wed Jan 28 08:33:00 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Wed, 28 Jan 2026 08:33:00 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: <9F7u-mBUQGEfA8o6PGa9XtHd0wa8ie6jBD444Hq2L5M=.701a8395-fc36-4cbc-b514-a589b93ceb14@github.com> References: <9F7u-mBUQGEfA8o6PGa9XtHd0wa8ie6jBD444Hq2L5M=.701a8395-fc36-4cbc-b514-a589b93ceb14@github.com> Message-ID: On Tue, 27 Jan 2026 20:00:51 GMT, Robert Toyonaga wrote: >> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: >> >> dont incremement _num_objects_processed for follow-up chunks > > This looks good to me! And it fixes the [problem we talked about earlier](https://github.com/openjdk/jdk/pull/28659#discussion_r2632525732). I have left one minor comment below. Thank you, @roberttoyonaga ! ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3809764432 From egahlin at openjdk.org Wed Jan 28 11:01:46 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Wed, 28 Jan 2026 11:01:46 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: References: Message-ID: On Wed, 28 Jan 2026 07:53:47 GMT, Thomas Stuefe wrote: >> This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. >> >> ---- >> >> A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). >> >> We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. >> >> This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. >> >> ### Memory usage of old vs new algorithm: >> >> The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. >> >> The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. >> >> In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. >> >> ### Possible improvements/simplifications in the future: >> >> DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. >> >> I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. >> >> ### Observable differences >> >> There is one observable side effect to the changed a... > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > dont incremement _num_objects_processed for follow-up chunks src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp line 152: > 150: assert(_probe_stack.is_empty(), "We should have drained the probe stack?"); > 151: } > 152: log_info(jfr, system, dfs)("DFS: objects processed: " UINT64_FORMAT "," I think it would be better to use the already existing oldobject log tag instead of a new dfs tag. Also, info is a bit verbose, debug might be better. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2736092382 From egahlin at openjdk.org Wed Jan 28 20:41:54 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Wed, 28 Jan 2026 20:41:54 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: References: Message-ID: On Wed, 28 Jan 2026 07:53:47 GMT, Thomas Stuefe wrote: >> This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. >> >> ---- >> >> A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). >> >> We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. >> >> This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. >> >> ### Memory usage of old vs new algorithm: >> >> The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. >> >> The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. >> >> In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. >> >> ### Possible improvements/simplifications in the future: >> >> DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. >> >> I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. >> >> ### Observable differences >> >> There is one observable side effect to the changed a... > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > dont incremement _num_objects_processed for follow-up chunks test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 53: > 51: try (Recording r = new Recording()) { > 52: Map p = new HashMap<>(settings); > 53: p.put(EventNames.OldObjectSample + "#" + Enabled.NAME, "true"); No need to set disk to true, it's true by default. It much easier to enable an event this way: r.enable(EventNames.OldObjectSample); test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 67: > 65: File recording = new File(jfrFileName + r.getId() + ".jfr"); > 66: recording.delete(); > 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath()); Why do we need to do this using jcmd? test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 68: > 66: recording.delete(); > 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath()); > 68: r.setSettings(Collections.emptyMap()); No need to clear settings ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2738506314 PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2738511154 PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2738513128 From duke at openjdk.org Thu Jan 29 10:46:13 2026 From: duke at openjdk.org (Bara' Hasheesh) Date: Thu, 29 Jan 2026 10:46:13 GMT Subject: Withdrawn: 8373439: Deadlock between flight recorder & VM shutdown In-Reply-To: References: Message-ID: On Thu, 11 Dec 2025 15:19:51 GMT, Bara' Hasheesh wrote: > description will be added once the expected is agreed on This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/28767 From stuefe at openjdk.org Thu Jan 29 11:04:04 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 29 Jan 2026 11:04:04 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: References: Message-ID: On Wed, 28 Jan 2026 20:36:34 GMT, Erik Gahlin wrote: >> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: >> >> dont incremement _num_objects_processed for follow-up chunks > > test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 67: > >> 65: File recording = new File(jfrFileName + r.getId() + ".jfr"); >> 66: recording.delete(); >> 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath()); > > Why do we need to do this using jcmd? Hmm, I just followed the same approach as `TestJcmdDumpPathToGCRoots`. I guess the alternative would be to start a child process JVM with -XX:StartFlightRecording dumponexit=true? ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2741088293 From stuefe at openjdk.org Thu Jan 29 11:56:23 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 29 Jan 2026 11:56:23 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v4] In-Reply-To: References: Message-ID: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 23 additional commits since the last revision: - erics test remarks - different ul log tag - Merge branch 'master' into JFR-leak-profiler-path-to-gc-roots-non-recursive-take2-with-tracing - dont incremement _num_objects_processed for follow-up chunks - Update src/hotspot/share/jfr/leakprofiler/chains/dfsClosure.cpp Co-authored-by: Robert Toyonaga - fix after JDK-8375040 - tests - wip - revert unnecessary changes - wip - ... and 13 more: https://git.openjdk.org/jdk/compare/dd83f006...dd8c1bfd ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/66276985..dd8c1bfd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=02-03 Stats: 5269 lines in 146 files changed: 3544 ins; 904 del; 821 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382 From stuefe at openjdk.org Thu Jan 29 11:56:24 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 29 Jan 2026 11:56:24 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: References: Message-ID: On Wed, 28 Jan 2026 07:53:47 GMT, Thomas Stuefe wrote: >> This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. >> >> ---- >> >> A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). >> >> We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. >> >> This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. >> >> ### Memory usage of old vs new algorithm: >> >> The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. >> >> The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. >> >> In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. >> >> ### Possible improvements/simplifications in the future: >> >> DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. >> >> I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. >> >> ### Observable differences >> >> There is one observable side effect to the changed a... > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > dont incremement _num_objects_processed for follow-up chunks @egahlin thank you for reviewing this. I changed the UL tag as suggested and changed the test (also removed the cutoff setting since infinity is default) ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3817180949 From stuefe at openjdk.org Thu Jan 29 12:01:20 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 29 Jan 2026 12:01:20 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v5] In-Reply-To: References: Message-ID: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision: - remove unnecessary diff - copyrights ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/dd8c1bfd..ecc2bb74 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=03-04 Stats: 8 lines in 7 files changed: 0 ins; 1 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382 From egahlin at openjdk.org Thu Jan 29 12:05:28 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Thu, 29 Jan 2026 12:05:28 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v3] In-Reply-To: References: Message-ID: On Thu, 29 Jan 2026 11:00:49 GMT, Thomas Stuefe wrote: >> test/jdk/jdk/jfr/jcmd/TestJcmdDumpPathToGCRootsDFSBase.java line 67: >> >>> 65: File recording = new File(jfrFileName + r.getId() + ".jfr"); >>> 66: recording.delete(); >>> 67: JcmdHelper.jcmd("JFR.dump", "name=dodo", pathToGcRoots, "filename=" + recording.getAbsolutePath()); >> >> Why do we need to do this using jcmd? > > Hmm, I just followed the same approach as `TestJcmdDumpPathToGCRoots`. I guess the alternative would be to start a child process JVM with -XX:StartFlightRecording dumponexit=true? This is how I would implement it: [patch.txt](https://github.com/user-attachments/files/24935579/patch.txt) - Put all relevant information in the test file so that, when it fails, it can be easily analyzed without having to flip back and forth between a base class during triage, for example. - Dump data programmatically, as it is quicker and easier to understand. - Use the Events class to dump and extract and recording data. - Place helper methods in the OldObjects class for reuse. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/29382#discussion_r2741301803 From stuefe at openjdk.org Thu Jan 29 14:57:04 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 29 Jan 2026 14:57:04 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v6] In-Reply-To: References: Message-ID: > This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. > > ---- > > A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). > > We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. > > This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. > > ### Memory usage of old vs new algorithm: > > The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. > > The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. > > In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. > > ### Possible improvements/simplifications in the future: > > DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. > > I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. > > ### Observable differences > > There is one observable side effect to the changed algorithm. The non-recursive algorithm processes oops a... Thomas Stuefe has updated the pull request incrementally with three additional commits since the last revision: - remove unnecessary copyright change - remove debug output - Erics test suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/29382/files - new: https://git.openjdk.org/jdk/pull/29382/files/ecc2bb74..957e001a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=29382&range=04-05 Stats: 442 lines in 9 files changed: 195 ins; 245 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/29382.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/29382/head:pull/29382 PR: https://git.openjdk.org/jdk/pull/29382 From stuefe at openjdk.org Thu Jan 29 14:57:07 2026 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 29 Jan 2026 14:57:07 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v5] In-Reply-To: References: Message-ID: <_13pfLfGnYCREnI81qi-sM3dyeaCVyy6KHJBRxTfrqE=.5a239109-0190-4cd3-abec-d9a13354eb16@github.com> On Thu, 29 Jan 2026 12:01:20 GMT, Thomas Stuefe wrote: >> This is a continuation - second attempt - of https://github.com/openjdk/jdk/pull/28659. >> >> ---- >> >> A customer reported a native stack overflow when producing a JFR recording with path-to-gc-roots=true. This happens regularly, see similar cases in JBS (e.g. https://bugs.openjdk.org/browse/JDK-8371630, https://bugs.openjdk.org/browse/JDK-8282427 etc). >> >> We limit the maximum graph search depth (DFSClosure::max_dfs_depth) to prevent stack overflows. That solution is brittle, however, since recursion depth is not a good proxy for thread stack usage: it depends on many factors, e.g., compiler inlining decisions and platform specifics. In this case, the VMThread's stack was too small. >> >> This patch rewrites the DFS heap tracer to be non-recursive. This is mostly textbook stuff, but the devil is in the details. Nevertheless, the algorithm should be a straightforward read. >> >> ### Memory usage of old vs new algorithm: >> >> The new algorithm uses, on average, a bit less memory than the old one. The old algorithm did cost ((avg stackframe size in bytes) * depth). As we have seen, e.g., in JDK-8371630, a depth of 3200 can max out ~1MB of stack space. >> >> The new algorithm costs ((avg number of outgoing refs per instanceKlass oop) * depth * 16. For a depth of 3200, we get typical probe stack sizes of 100KB..200KB. But we also cap probestack size, similar to how we cap the max. graph depth. >> >> In any case, these numbers are nothing to worry about. For a more in-depth explanation about memory cost, please see the comment in dfsClosure.cpp. >> >> ### Possible improvements/simplifications in the future: >> >> DFS works perfectly well alone now. It no longer depends on stack size, and its memory usage is typically smaller than BFS. IMHO, it would be perfectly fine to get rid of BFS and rely solely on the non-recursive DFS. The benefit would be a decrease in complexity and fewer tests to run and maintain. It should also be easy to convert into a parallelized version later. >> >> I kept the _max_dfs_depth_ parameter for now, but tbh it is no longer very useful. Before, it prevented stack overflows. Now, it is just an indirect way to limit probe stack size. But we also explicitly cap the probe stack size, so _max_dfs_depth_ is redundant. Removing it would require changing the statically allocated reference stack to be dynamically allocated, but that should not be difficult. >> >> ### Observable differences >> >> There is one observable side effect to the changed a... > > Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision: > > - remove unnecessary diff > - copyrights @egahlin yes, that is nicer and simpler. I adapted your approach. Not sure if this helps, but back in December, when I rewrote this, I drew up a quick visio to help with reviews. I'll attach the pdf. [JFR-leakprofiler-DFS.pdf](https://github.com/user-attachments/files/24940854/JFR-leakprofiler-DFS.pdf) ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3818189519 From egahlin at openjdk.org Fri Jan 30 13:12:48 2026 From: egahlin at openjdk.org (Erik Gahlin) Date: Fri, 30 Jan 2026 13:12:48 GMT Subject: RFR: 8373096: JFR: Path-to-gc-roots search should be non-recursive [v5] In-Reply-To: <_13pfLfGnYCREnI81qi-sM3dyeaCVyy6KHJBRxTfrqE=.5a239109-0190-4cd3-abec-d9a13354eb16@github.com> References: <_13pfLfGnYCREnI81qi-sM3dyeaCVyy6KHJBRxTfrqE=.5a239109-0190-4cd3-abec-d9a13354eb16@github.com> Message-ID: On Thu, 29 Jan 2026 14:53:01 GMT, Thomas Stuefe wrote: > @egahlin yes, that is nicer and simpler. I adapted your approach. > > Not sure if this helps, but back in December, when I rewrote this, I drew up a quick visio to help with reviews. I'll attach the pdf. [JFR-leakprofiler-DFS.pdf](https://github.com/user-attachments/files/24940854/JFR-leakprofiler-DFS.pdf) Thanks, I need some more time to review your PR, but the change is now more confined and should not be hard to backport. ------------- PR Comment: https://git.openjdk.org/jdk/pull/29382#issuecomment-3823663593