[master] RFR: JDK-8325104: Lilliput: Shrink Classpointers

Wed Feb 7 14:22:00 UTC 2024

Hi,

I wanted to get input on the following improvement for Lilliput. Testing is still ongoing, but things look really good, so this patch is hopefully near its final form (barring any objections from reviewers, of course).

Note: I have a companion patch prepared for upstream, minus the markword changes. I will attempt to get that one upstream quickly in order to not have a large delta between upstream and lilliput, especially in Metaspace.

## High-Level Overview

(for a short sequence of slides, please see https://github.com/tstuefe/fosdem24/blob/master/classpointers-and-liliput.pdf - these accompanied a talk we held at FOSDEM 24).

We want to reduce the bit size of narrow Klass to free up bits in the MarkWord. 

We cannot just reduce the Klass encoding range size (well, we could, and maybe we will later, but for now we decided not to). We instead increase the alignment Klass is stored at, and use that alignment shadow to store other information.

In other words, this patch changes the narrow Klass Pointer to a Klass ID, since now (almost) every value in its value range points to a different class. Therefore, we use the value range of nKlass much more efficiently.

We then use the newly freed bits in the MarkWord to restore the iHash to 31 bits: 

[ 22-bit nKlass | 31-bit iHash | 4 free bits | age | fwd | lck ]

nKlass gets reduced to 22 bits. Identity hash gets re-inflated to 31 bits. Preceding iHash are now 4 unused bits. Rest is unchanged.

(Note: I originally wanted to swap iHash and nKlass such that either of them could be loaded with a 32-bit load, but I found that tricky since C2 seems to rely on the nKlass offset in the Markword being > 0.)

## nKlass reduction:

The reduction in nKlass size is made by only storing them at 10-bit aligned addresses. That alignment (1KB) works well in practice since Klass - although var sized - typically is between 512 bytes and 1KB in size. Outliers are possible, but the size distribution is bell-curvish [1], so far-away outliers are very rare. 

To not lose memory to alignment waste, metaspace is reshaped to handle arbitrarily aligned allocations efficiently. Basically, we allow the non-Klass arena of a class loader to steal the alignment waste storage from the class arena. So, alignment waste blocks are filled with non-Klass metadata. That works very well in practice since non-Klass metadata is numerous and fine-granular compared to the big Klass blocks. Total footprint loss in metaspace is, therefore, almost zero (a few KB at most). We see class space to be used more, non-Klass space to be used less, but the total footprint is about the same.

The number of addressable classes is somewhat reduced now. From ~5 million we could fit before into a maxed-out (3 GB) class space, to ~3 million with 10-bit alignment. That is still a very high and acceptable number.

## CDS:

CDS is also modified to store Klass at 10-bit aligned addresses. That works well since CDS contains both non-Klass- and Klass-data both, so things just shift around. As with Metaspace, very little footprint loss here.

We now create four CDS archives: with and without coops, and with and without UseCOH. Since that is potentially contentious, I guarded the +COH versions with the configure option --enable-cds-archive-coh. 

## Costs:

The costs of these improvements are:

- when decoding an nKlass, the shift is not mandatory. Before, it was usually optional (only used on certain platforms when CDS was disabled).
- we reduce the number of loadable classes from ~5 mio to ~ 3 mio
- In theory, hyper-aligning >cache line could be detrimental. I attempted to measure that effect with a custom microbenchmark that aims to overwhelm L1, but so far I have not seen any effect above background noise. I will repeat these tests though, and have mitigations planned in case this should be an issue (basically, vary cadence by cache line size).

## Testing

I tested the default variant (COH disabled) and a variant with +COH as default. GHAs are clean. I ran tier 1 and 2 manually on Linux x64 and MacOS. 

Tests are currently being executed at Adoptium on Linux arm64 and x64.

## Further outlook

Further size reductions are possible. We can probably squeeze a bit or two out of reducing the number of loadable classes. To go further, e.g. to 16-bit class pointers, is possible but would probably need something like the variable-sized-header idea of John Rose.

In any case, this PR is a necessary prerequisite, since it allows us to use the bits in nKlass much more efficiently. So, a 16-bit nKlass can really address ~64k classes.

[1] https://github.com/tstuefe/metaspace-statistics/blob/master/Histogram.svg

-------------

Commit messages:
 - Fix Typo
 - Better CDS arch generation
 - Fix error in COH archive generation
 - Fixes
 - Fix Windows-Only bug that was caused by !INCLUDE_CDS_JAVA_HEAP
 - fix runtime/cds/appcds/TestCombinedCompressedFlags and runtime/cds/appcds/CommandLineFlagComboNegative.java
 - Fix CdsDifferentCompactObjectHeaders
 - hash->32bit
 - WIP
 - Markword changes
 - ... and 3 more: https://git.openjdk.org/lilliput/compare/c0496069...a86fb8fd

Changes: https://git.openjdk.org/lilliput/pull/128/files
 Webrev: https://webrevs.openjdk.org/?repo=lilliput&pr=128&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8325104
  Stats: 2690 lines in 81 files changed: 1916 ins; 417 del; 357 mod
  Patch: https://git.openjdk.org/lilliput/pull/128.diff
  Fetch: git fetch https://git.openjdk.org/lilliput.git pull/128/head:pull/128

PR: https://git.openjdk.org/lilliput/pull/128