RFR: 8293864: Kitchensink24HStress.java fails with SIGSEGV in JfrCheckpointManager::lease
Greetings, With [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692), a need arose to introduce a mutex mechanism for the buffers on the global list to prevent the JFR Recorder Thread from resetting buffers currently in use. The introduced mutex mechanism was adequate as far as preventing the race condition. However, its protective characteristic was overemphasized, leading to the mistake of introducing a release, or removal, operation onto a current epoch list, as part of periodic flushing. The problem is that although the callback operations are safe, the underlying list type used with the global mspace does not allow for concurrent excision of nodes. The crash occurs because a thread attempts to dereference the next pointer to traverse a node that is excised and deleted after the thread loaded it. I have reviewed how to handle better the introduction of virtual threads and their checkpoint data. I have concluded that the best way is to preserve the old system (pre-Loom) as much as possible, keeping the global and thread-local mspaces as before. Instead, an additional mspace dedicated solely to virtual threads is introduced. This categorization preserves the matching of operations against the underlying list infrastructure. It also provides flexibility in handling virtual threads distinct from regular, or "carrier", threads. The separate mspace of virtual threads is more dynamic as it does not need to preallocate buffers, a memory win for systems not using virtual threads. Also, the sizes of the different buffer types can be better controlled. With this restructuring, there are no longer any list concurrency issues; hence the mutex mechanism introduced with [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692) becomes obsolete. Applying a release operation for the current epoch list, of the global mspace, as part of flushing is a broken invariant and is therefore removed. Testing: jdk_jfr, stress testing Thanks Markus ------------- Commit messages: - 8293864 Changes: https://git.openjdk.org/jdk/pull/10467/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10467&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8293864 Stats: 217 lines in 9 files changed: 100 ins; 33 del; 84 mod Patch: https://git.openjdk.org/jdk/pull/10467.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10467/head:pull/10467 PR: https://git.openjdk.org/jdk/pull/10467
Greetings,
With [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692), a need arose to introduce a mutex mechanism for the buffers on the global list to prevent the JFR Recorder Thread from resetting buffers currently in use.
The introduced mutex mechanism was adequate as far as preventing the race condition. However, its protective characteristic was overemphasized, leading to the mistake of introducing a release, or removal, operation onto a current epoch list, as part of periodic flushing. The problem is that although the callback operations are safe, the underlying list type used with the global mspace does not allow for concurrent excision of nodes. The crash occurs because a thread attempts to dereference the next pointer to traverse a node that is excised and deleted after the thread loaded it.
I have reviewed how to handle better the introduction of virtual threads and their checkpoint data. I have concluded that the best way is to preserve the old system (pre-Loom) as much as possible, keeping the global and thread-local mspaces as before. Instead, an additional mspace dedicated solely to virtual threads is introduced. This categorization preserves the matching of operations against the underlying list infrastructure. It also provides flexibility in handling virtual threads distinct from regular, or "carrier", threads. The separate mspace of virtual threads is more dynamic as it does not need to preallocate buffers, a memory win for systems not using virtual threads. Also, the sizes of the different buffer types can be better controlled. With this restructuring, there are no longer any list concurrency issues; hence the mutex mechanism introduced with [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692) becomes obsolete. Applying a release operation for the curren t epoch list, of the global mspace, as part of flushing is a broken invariant and is therefore removed.
Testing: jdk_jfr, stress testing
Thanks Markus
Markus Grönlund has updated the pull request incrementally with one additional commit since the last revision: update ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10467/files - new: https://git.openjdk.org/jdk/pull/10467/files/e7aa82fa..4388532f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10467&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10467&range=00-01 Stats: 26 lines in 1 file changed: 5 ins; 16 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/10467.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10467/head:pull/10467 PR: https://git.openjdk.org/jdk/pull/10467
On Thu, 6 Oct 2022 16:51:23 GMT, Markus Grönlund <mgronlun@openjdk.org> wrote:
Greetings,
With [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692), a need arose to introduce a mutex mechanism for the buffers on the global list to prevent the JFR Recorder Thread from resetting buffers currently in use.
The introduced mutex mechanism was adequate as far as preventing the race condition. However, its protective characteristic was overemphasized, leading to the mistake of introducing a release, or removal, operation onto a current epoch list, as part of periodic flushing. The problem is that although the callback operations are safe, the underlying list type used with the global mspace does not allow for concurrent excision of nodes. The crash occurs because a thread attempts to dereference the next pointer to traverse a node that is excised and deleted after the thread loaded it.
I have reviewed how to handle better the introduction of virtual threads and their checkpoint data. I have concluded that the best way is to preserve the old system (pre-Loom) as much as possible, keeping the global and thread-local mspaces as before. Instead, an additional mspace dedicated solely to virtual threads is introduced. This categorization preserves the matching of operations against the underlying list infrastructure. It also provides flexibility in handling virtual threads distinct from regular, or "carrier", threads. The separate mspace of virtual threads is more dynamic as it does not need to preallocate buffers, a memory win for systems not using virtual threads. Also, the sizes of the different buffer types can be better controlled. With this restructuring, there are no longer any list concurrency issues; hence the mutex mechanism introduced with [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692) becomes obsolete. Applying a release operation for the curre nt epoch list, of the global mspace, as part of flushing is a broken invariant and is therefore removed.
Testing: jdk_jfr, stress testing
Thanks Markus
Markus Grönlund has updated the pull request incrementally with one additional commit since the last revision:
update
Marked as reviewed by egahlin (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/10467
On Wed, 28 Sep 2022 12:33:20 GMT, Markus Grönlund <mgronlun@openjdk.org> wrote:
Greetings,
With [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692), a need arose to introduce a mutex mechanism for the buffers on the global list to prevent the JFR Recorder Thread from resetting buffers currently in use.
The introduced mutex mechanism was adequate as far as preventing the race condition. However, its protective characteristic was overemphasized, leading to the mistake of introducing a release, or removal, operation onto a current epoch list, as part of periodic flushing. The problem is that although the callback operations are safe, the underlying list type used with the global mspace does not allow for concurrent excision of nodes. The crash occurs because a thread attempts to dereference the next pointer to traverse a node that is excised and deleted after the thread loaded it.
I have reviewed how to handle better the introduction of virtual threads and their checkpoint data. I have concluded that the best way is to preserve the old system (pre-Loom) as much as possible, keeping the global and thread-local mspaces as before. Instead, an additional mspace dedicated solely to virtual threads is introduced. This categorization preserves the matching of operations against the underlying list infrastructure. It also provides flexibility in handling virtual threads distinct from regular, or "carrier", threads. The separate mspace of virtual threads is more dynamic as it does not need to preallocate buffers, a memory win for systems not using virtual threads. Also, the sizes of the different buffer types can be better controlled. With this restructuring, there are no longer any list concurrency issues; hence the mutex mechanism introduced with [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692) becomes obsolete. Applying a release operation for the curren t epoch list, of the global mspace, as part of flushing is a broken invariant and is therefore removed.
Testing: jdk_jfr, stress testing
Thanks Markus
This pull request has now been integrated. Changeset: 35d17a00 Author: Markus Grönlund <mgronlun@openjdk.org> URL: https://git.openjdk.org/jdk/commit/35d17a00ab4028071a8fc7cd781b3306e6811970 Stats: 206 lines in 9 files changed: 89 ins; 33 del; 84 mod 8293864: Kitchensink24HStress.java fails with SIGSEGV in JfrCheckpointManager::lease Reviewed-by: egahlin ------------- PR: https://git.openjdk.org/jdk/pull/10467
participants (2)
-
Erik Gahlin
-
Markus Grönlund