RFR: JDK-8312182: THPs cause huge RSS due to thread start timing issue

Tue Jul 18 15:46:36 UTC 2023

If THP (Transparent Huge Pages) are enabled unconditionally on the system, java applications that use many threads may see a huge Resident Set Size. That footprint is caused by thread stacks being mostly paged in. This page-in is caused by thread stack memory being transformed into huge pages by `khugepaged`; later, those huge pages usually shatter into small pages when Java guard pages are established at thread start, but the remaining splinter small pages remain paged in.

[JDK-8303215](https://bugs.openjdk.org/browse/JDK-8303215) attempted to fix this problem by making it unlikely that thread stack boundaries are aligned to THP page size. Unfortunately, that was not sufficient. We still see JVMs with huge footprints, especially if they did create many Java threads in rapid succession.

Note that this effect is independent of any JVM switches; in particular, it happens regardless of `-XX:+UseTransparentHugePages` or `-XX:+UseLargePages`.

##### Demonstration:

Linux 5.15 on x64, glibc 2.31: 10000 idle threads with 100 MB pre-touched java heap, `-Xss2M`, on x64, will consume:

A) Baseline (THP disabled on system):  *369 MB*
B) THP="always", JDK-8303215 present: *1.5 GB .. >2 GB* (very wobbly)
C) THP="always", JDK-8303215 present, artificial delay after thread start: **20,6 GB** (!).

#### Cause:

The problem is caused by timing. When we create multiple Java threads, the following sequence of actions happens:

In the parent thread:
- the parent thread calls `pthread_create(3)`
- `pthread_create(3)` creates the thread stack by calling `mmap(2)`
- `pthread_create(3)` calls `clone(2)` to start the child thread
- repeat to start more threads

Each child thread:
- queries its stack dimensions
- handshakes with the parent to signal lifeness
- establishes guard pages at the low end of the stack

The thread stack mapping is established in the parent thread; the guard pages are placed by the child threads. There is a time window in which the thread stack is already mapped into address space, but guard pages still need to be placed.

If the parent is faster than the children, it will have created mappings faster than the children can place guard pages on them.

For the kernel, these thread stacks are just anonymous mappings. It places them adjacent to each other to reduce address space fragmentation. As long as no guard pages are placed yet, all these thread stack mappings (VMAs) have the same attributes - same permission bits, all anonymous. Hence, the kernel will fold them into a single large VMA. 

That VMA may be large enough to be eligible for huge pages. Now the JVM races with the `khugepaged`: If `khugepaged` is faster than the JVM, it will have converted that larger VMA partly or fully into hugepages before the child threads start creating guard pages:

glibc does              Kernel folds them        khugepaged creates
1 mmap/stack:           into one VMA (still      huge pages (paged in):
                        not paged in):
+---------+             +---------+              +---------+ 
|         |             |         |              |         | 
| stack A |             |         |              | hugepage|
|         |             |         |              | ------- | 
+---------+             |         |              |         | 
|         |             |         |              | hugepage| 
| stack B |    ====>    |  VMA    |    ====>     | ------- | 
|         |             |         |              |         | 
+---------+             |         |              | hugepage| 
|         |             |         |              | ------- | 
| stack C |             |         |              |         | 
|         |             |         |              | hugepage| 
+---------+             +---------+              +---------+

The child threads will catch up and create guard pages. That will splinter the large VMA into several smaller VMAs (two for each thread, one for the usable thread section, and one protected for the guards). Each of these VMAs will typically be smaller than a huge page, and typically not huge-page-aligned. The huge pages created by `khugepaged` will mostly shatter into small pages, but these small pages remain paged-in. Effect: we pay memory for the whole thread stacks even though the threads did not start yet.

This is a similar effect as described in [JDK-8303215](https://bugs.openjdk.org/browse/JDK-8303215); but we assumed it only affects individual threads when it affects whole regions of adjacent thread stacks.

#### Example:

Let's create three threads. Each thread stack, including guard pages, is 2M + 4K sized (+4K because of JDK-8303215).

Their thread stacks will be located at: ( `[base .. end .. guard]`:

T1: [7feea53ff000 .. 7feea5202000 .. 7feea51fe000] 
T2: [7feea5600000 .. 7feea5403000 .. 7feea53ff000] 
T3: [7feea5801000 .. 7feea5604000 .. 7feea5600000]

After `pthread_create(3)`, their thread stacks exist without JVM guard pages. Kernel merges the VMAs of their thread stacks into a single mapping > 6MB. `khugepaged` then coalesces their small pages into 3 huge pages:

7feea51fe000-7feea5801000 rw-p 00000000 00:00 0    <<<------- all three stacks as one VMA
Size:               6156 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                6148 kB
Pss:                6148 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:      6148 kB
Referenced:         6148 kB
Anonymous:          6148 kB
LazyFree:              0 kB
AnonHugePages:      6144 kB      <<<---------- 3x2MB huge pages
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1    
VmFlags: rd wr mr mw me ac sd 

Threads start and create their respective guard pages. The single VMA splinters into 6 smaller VMAs. The huge pages shatter into small pages that remain paged-in:

7feea51fe000-7feea5202000 ---p 00000000 00:00 0   <<----- guard pages for T1
Size:                 16 kB
...
7feea5202000-7feea53ff000 rw-p 00000000 00:00 0   <<------ thread stack for T1
Size:               2036 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2036 kB
Pss:                2036 kB
Private_Dirty:      2036 kB   <<<--------  all pages resident
...
7feea53ff000-7feea5403000 ---p 00000000 00:00 0   <<----- guard pages for T2
Size:                 16 kB
...
7feea5403000-7feea5600000 rw-p 00000000 00:00 0   <<------ thread stack for T2
Size:               2036 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2036 kB
Pss:                2036 kB
Private_Dirty:      2036 kB   <<<--------  all pages resident
...
7feea5600000-7feea5604000 ---p 00000000 00:00 0    <<----- guard pages for T3 
Size:                 16 kB
...
7feea5604000-7feea5801000 rw-p 00000000 00:00 0    <<------ thread stack for T3
Size:               2036 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                2036 kB
Pss:                2036 kB
Private_Dirty:      2036 kB   <<<--------  all pages resident
...

#### Fix

The mitigation for the huge RSS buildup involves letting the glibc allocate guards for Java thread stacks. We usually avoid that since Java thread stacks are already guarded by JVM guard pages. However, here we use the fact that the glibc guard page is established by `pthread_create(3)` **before** starting the child thread. Therefore adjacent thread stack VMAs will be separated by a guard page - we close the time window within which the kernel sees a large VMA. 

The fix works: the aforementioned 20.6 GB RSS example, with the fix applied, drops back to ~400 MB RSS.

As a byproduct, this patch fixes [JDK-8310687](https://bugs.openjdk.org/browse/JDK-8310687): the mitigation will only be activated if THPs are enabled unconditionally on the system (THP mode = "always"). And we use the correct page size for THPs.

Finally, note that this solution has a slight disadvantage: threads now have two guards, the JVM guard pages and the glibc guard page. Currently, the kernel cannot merge these VMAs because their attributes differ. This problem may be solvable; I opened [JDK-8312211](https://bugs.openjdk.org/browse/JDK-8312211)  to track this problem.

-------------

Commit messages:
 - 8312182-Huge-RSS-with-THPs-even-after-JDK-8303215

Changes: https://git.openjdk.org/jdk/pull/14919/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14919&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8312182
  Stats: 300 lines in 3 files changed: 291 ins; 9 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/14919.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14919/head:pull/14919

PR: https://git.openjdk.org/jdk/pull/14919