RFR: 8293864: Kitchensink24HStress.java fails with SIGSEGV in JfrCheckpointManager::lease

Markus Grönlund mgronlun at openjdk.org
Wed Sep 28 12:42:11 UTC 2022


Greetings,

With [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692), a need arose to introduce a mutex mechanism for the buffers on the global list to prevent the JFR Recorder Thread from resetting buffers currently in use.

The introduced mutex mechanism was adequate as far as preventing the race condition. However, its protective characteristic was overemphasized, leading to the mistake of introducing a release, or removal, operation onto a current epoch list, as part of periodic flushing. The problem is that although the callback operations are safe, the underlying list type used with the global mspace does not allow for concurrent excision of nodes. The crash occurs because a thread attempts to dereference the next pointer to traverse a node that is excised and deleted after the thread loaded it.

I have reviewed how to handle better the introduction of virtual threads and their checkpoint data. I have concluded that the best way is to preserve the old system (pre-Loom) as much as possible, keeping the global and thread-local mspaces as before. Instead, an additional mspace dedicated solely to virtual threads is introduced. This categorization preserves the matching of operations against the underlying list infrastructure. It also provides flexibility in handling virtual threads distinct from regular, or "carrier", threads. The separate mspace of virtual threads is more dynamic as it does not need to preallocate buffers, a memory win for systems not using virtual threads. Also, the sizes of the different buffer types can be better controlled. With this restructuring, there are no longer any list concurrency issues; hence the mutex mechanism introduced with [JDK-8289692](https://bugs.openjdk.org/browse/JDK-8289692) becomes obsolete. Applying a release operation for the current 
 epoch list, of the global mspace, as part of flushing is a broken invariant and is therefore removed.

Testing: jdk_jfr, stress testing

Thanks
Markus

-------------

Commit messages:
 - 8293864

Changes: https://git.openjdk.org/jdk/pull/10467/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10467&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8293864
  Stats: 217 lines in 9 files changed: 100 ins; 33 del; 84 mod
  Patch: https://git.openjdk.org/jdk/pull/10467.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10467/head:pull/10467

PR: https://git.openjdk.org/jdk/pull/10467


More information about the hotspot-jfr-dev mailing list