[jdk11u-dev] RFR: 8312182: THPs cause huge RSS due to thread start timing issue

Thomas Stuefe stuefe at openjdk.org
Sat Aug 19 06:40:28 UTC 2023


Unclean composite backport to fix JDK-8312182 - "THPs cause huge RSS due to thread start timing issue" (https://bugs.openjdk.org/browse/JDK-8312182)

Problem:

On a machine with transparent huge pages (THP) unconditionally enabled (/sys/kernel/mm/transparent_hugepage/enabled = "always"), the JVM may show a huge memory footprint (RSS) and degraded thread start performance.

The following factors make the problem more severe and more likely:
- thread stack size of 2M (on arm64 or x64) or larger
- many threads, or high thread creation churn
- a slow or overloaded machine (since part of the problem is timing-dependent)

For a detailed discussion of the underlying problem, please see https://github.com/openjdk/jdk/pull/14919.

----------------

In jdk Head, the issue got fixed with a sequence of patches:

- [JDK-8303215](https://bugs.openjdk.org/browse/JDK-8303215) "Make thread stacks not use huge pages"
- [JDK-8312182](https://bugs.openjdk.org/browse/JDK-8312182) "THPs cause huge RSS due to thread start timing"

However, JDK-8312182 itself needed one preparatory fix:
- [JDK-8310233](https://bugs.openjdk.org/browse/JDK-8310233) "Fix THP detection on Linux"

and then we had several corner-case test problems which are fixed with:
- [JDK-8312394](https://bugs.openjdk.org/browse/JDK-8312394) "[linux] SIGSEGV if kernel was built without hugepage support"
- [JDK-8312620](https://bugs.openjdk.org/browse/JDK-8312620) "WSL Linux build crashes after JDK-8310233"
- [JDK-8314139](https://bugs.openjdk.org/browse/JDK-8314139) "TEST_BUG: runtime/os/THPsInThreadStackPreventionTest.java could fail on machine with large number of cores"

and finally, we decided to rename the switch that allows to switch off the THP mitigation with a final patch:
- [JDK-8312585](https://bugs.openjdk.org/browse/JDK-8312585) "Rename DisableTHPStackMitigation flag to THPStackMitigation"



Instead of downporting these 7 patches verbatim, I prepared a composite patch containing only the necessary mitigation and mitigation tests.

This patch does:
- make sure that all thread stacks have at least one glibc guard page to prevent clustering of adjacent thread stacks into one VMA
- change the default size of stacks to be not aligned to 2MB to prevent intra-stack THPs from forming

The patch needs some infrastructure, but I downported only the necessary parts: the helper class "HugePages", which is used in head to scan the operating system for information about THP settings. I only included the parts to do with THPs and left the rest out.

The patch also includes a regression test.


---------------------

Testing:

I manually tested the JVM on Linux x64 with THP=always:

Without the patch (-Xmx1g -Xms1g -XX:+AlwaysPreTouch -Xss2m, 10000 threads started), I see slow thread startup and *11 GB - 14 GB* of RSS.

The patched version comes up a lot faster and only shows *1.3* GB of RSS.

GHAs: unfortunately broken due to infrastructure issues.

-------------

Commit messages:
 - Backport 84b325b844c08809448a9c073a11443d9e3c3f8e

Changes: https://git.openjdk.org/jdk11u-dev/pull/2086/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk11u-dev&pr=2086&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8312182
  Stats: 741 lines in 7 files changed: 737 ins; 2 del; 2 mod
  Patch: https://git.openjdk.org/jdk11u-dev/pull/2086.diff
  Fetch: git fetch https://git.openjdk.org/jdk11u-dev.git pull/2086/head:pull/2086

PR: https://git.openjdk.org/jdk11u-dev/pull/2086


More information about the jdk-updates-dev mailing list