RFR: JDK-8310111: Shenandoah wastes memory when running with very large page sizes [v2]

Fri Jun 30 13:25:58 UTC 2023

On Tue, 20 Jun 2023 15:21:24 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> This proposal changes the reservation of bitmaps and region storage to reduce the wastage associated with running with very large page sizes (e.g. 1GB on x64, 512M on arm64) for both non-THP- and THP-mode.
>> 
>> This patch does:
>> - introducing the notion of "allowed overdraft factor" - allocations for a given page size are rejected if they would cause more wastage than the factor allows
>> - if it makes sense, it places mark- and aux-bitmap into a contiguous region to let them share a ginourmous page. E.g. for a heap of 16G, both bitmaps would now share a single GB page.
>> 
>> Examples:
>> 
>> Note: annoyingly, huge page usage does not show up in RSS. I therefore use a script that parses /proc/pid/smaps and counts hugepage usage to count cost for the following examples:
>> 
>> Example 1:
>> 
>> A machine configured with 1G pages (and nothing else). Heap is allocated with 1G pages, the bitmaps fall back to 4K pages because JVM figures 1GB would be wasted:
>> 
>> 
>> thomas at starfish$ ./images/jdk/bin/java -Xmx4600m -Xlog:pagesize -XX:+UseShenandoahGC -XX:+UseLargePages
>> ...
>> [0.028s][info][pagesize] Mark Bitmap: req_size=160M req_page_size=4K base=0x00007f8149fff000 size=160M page_size=4K
>> [0.028s][info][pagesize] Aux Bitmap: req_size=160M req_page_size=4K base=0x00007f813fffe000 size=160M page_size=4K
>> [0.028s][info][pagesize] Region Storage: req_size=320K req_page_size=4K base=0x00007f817c06f000 size=320K page_size=4K
>> 
>> 
>> Cost before: 8GB. Cost now: 5GB + (2*160M)
>> 
>> Example 2: JVM with 14GB heap: mark and aux bitmap together are large enough to justify another 1G page, so they share it. Notice how we also place the region storage on this page:
>> 
>> 
>> thomas at starfish:/shared/projects/openjdk/jdk-jdk/output-release$ ./images/jdk/bin/java -Xmx14g -Xlog:pagesize 
>> -XX:+UseShenandoahGC -XX:+UseLargePages -cp $REPROS_JAR de.stuefe.repros.Simple
>> [0.003s][info][pagesize] Heap: req_size=14G req_page_size=1G base=0x0000000480000000 size=14G page_size=1G
>> [0.003s][info][pagesize] Mark+Aux Bitmap: req_size=896M req_page_size=1G base=0x00007fee00000000 size=1G page_size=1G
>> [0.003s][info][pagesize] Region Storage: piggy-backing on Mark Bitmap: base=0x00007fee38000000 size=1792
>> <press key>
>> 
>> 
>> Cost before: 17GB. Cost now: 15GB.
>> 
>> From a bang-for-hugepages-buck multiples of 16GB are a sweet spot here since (on x64 with 1GB pages) since this allows us to put both 512m bitmaps onto a single huge page.
>> 
>> -----------
>> 
>> No test yet, since I wanted to se...
>
> Thomas Stuefe has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
> 
>   start

>     * cset is normally tiny. Therefore the chance to map in lower-based regions is very good. The combined-everything RS would be a lot larger, therfore diminish the chance of mapping in lower regions. It would also reduce the chance of mapping the java heap in lower based regions. Especially if we ever try to attach the heap for zero-based _and_ zero-shift oops mode, which at the moment we don't even try.

Right. Then maybe leave cset bitmap alone, so it is allocatable before the heap, as it needs the most compact encoding possible to optimize the barrier fastpaths. There seem to be little reason to large-page-alloc it at all: the normal region count is about the size of small page size. Then keep the GC structures RS reservation after the heap reservation is done, so we are not interfering with heap base allocation.

>     * Allocating everything together is an all-or-nothing deal: we need to have enough large pages for everyone, otherwise nobody gets large pages.

Seems a fair deal to me. I think we _definitely_ want to avoid entangling the Java heap large pages allocation. After Java heap is allocated with large pages, all-or-nothing for the rest of GC structures seem a reasonable compromise for implementation simplicity.

> My proposal would be: as you propose, use one RS. Then:
>     * static huge pages - no uncommit, no problem
>     * THP: never uncommit, all stays committed, no problem
>     * Non-LP: we use tiny system pages. Align the mark bitmap start to system page size, then all works as it did before.

Would still be nice to support slice uncommit with THP, at least for smaller 2M pages. Aligning marking bitmaps to 2M does not seem to waste a lot of memory, and our footprint is still much better with uncommits.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14559#issuecomment-1614647483