[master] RFR: Smaller class pointers [v3]

Tue Dec 7 13:51:34 UTC 2021

> May I please have opinions on this proposal to reduce class pointer size? 
> 
> This patch prepares the VM to work with narrow Klass pointers smaller than 32 bits and - for now - reduces them to 22 bits.
> 
> The patch implements the proposal I made in September on lilliput-dev [1]. In that mail thread, John Rose outlined the state of the discussions at that time [2]. Discussions focus on reducing the number of Klass structures falling into the encoding range and/or reducing the size of Klass structures to make more of them fit into the encoding range (and as a bonus, I assume, making them fixed-sized). The former approach would see us splitting classes into two sets and letting only one of them use narrow Klass pointers. The latter would mean breaking Klass up into several structures and moving lesser-used ones out of the encoding range.
> 
> My approach is less invasive and requires fewer fundamental changes to Klass handling. We can use it as a stepping stone to get the ball rolling and later refine our approach. Or, maybe my patch already suffices. Or perhaps you don't like it. Let's see :)
> 
> ## Narrow Klass encoding changes
> 
> Like compressed oops, Klass pointer encoding uses an add+shift scheme. 
> 
> The shift value had always been (since at least 2013) zero or three: Zero for CDS=off, three for CDS=on. Note that CDS has been active by default since JDK 15, and it usually makes no sense to deactivate it since it improves startup significantly. Therefore, in practice, the shift value had always been three. Until recently, that is - in May, we switched to zero to solve an unrelated aarch64 problem [3].
> 
> The current scheme uses a non-zero shift to, in theory, enlarge the encoding range to 32GB - see `KlassEncodingMetaspaceMax`. But that does not do much good since we also deliberately cap the size of the relevant memory areas to 4GB (3GB for class space, the remainder for CDS). So in practice, the enlarged encoding range does not give us more breathing space for class data. Nor would it be needed since 4GB is a lot.
> 
> Enlarging the encoding range has an indirect positive effect since it increases the chance of getting zero-based encoding. However, CDS deliberately maps its archive and the class space outside the lower 32G address range. There are reasons for that, but they don't matter for this PR. So, again, in practice, we usually did not benefit from that advantage.
> 
> In conclusion: for years, we chiefly used a non-zero shift value. But I believe that shift had never been particularly useful.
> 
> ### New scheme
> 
> This patch proposes to revive the shift and give it a new meaning. It also increases its value to - for now - 9 bits. Instead of using it to enlarge the encoding range, we snip off those bits to make the narrow Klass pointer smaller. We save additional bits by reducing the encoding range, for now, to 2 GB. Both changes save us 10 bits, and we thus arrive at 22 bit narrow Klass pointers. These values are not set in stone; they are tweakable, with an increasing number of caveats.
> 
> #### Klass structure sizes
> 
> Increasing the encoding shift increases alignment requirements for Klass structures and hence alignment-related waste. So it trades metaspace/CDS memory for heap memory. It does so expecting that heap memory reduction by Lilliput far outweighs the modest increase in metaspace/CDS footprint. 
> 
> To understand these tradeoffs, here are some facts:
> 
> - `Klass` structures are variable-sized. Their size ranges from about ~480 bytes to just shy of a megabyte.
> - Their size distribution clusters heavily around the lower end of that range. Large Klass structures do exist and need to be supported, but they are scarce.
> - The overwhelming majority of all Klass structures are smaller than 1K. Of those, a large part is a wee bit smaller than 512 bytes.
> - Of particular relevance are classes generated for lambdas and reflection invocation. These get generated by the VM and can appear in masses. Their size typically hovers just a bit over 512 bytes.
> 
> I created some histograms of Klass size distribution for some common scenarios [4]. For example, this is spring petclinic after startup:
> 
> ![spring-petclinic-histo](https://user-images.githubusercontent.com/6041414/144208866-d672e287-4158-4ff0-851b-b75f8cdcc013.png)
> 
> We can see in all scenarios that Klass size distribution likes to spike around 512 bytes. Distribution then tapers out toward larger sizes, and we see very few Klass structures larger than 1 KB. Of course, this is not an exhaustive investigation, but enough to understand the reasoning that went into this patch.
> 
> From these numbers, we can see that either 9 bits (512 B alignment) or 10 bits (1 KB alignment) would be reasonable shift values. 
> 
> #### Memory cost of increased alignment
> 
> My patch takes measures to limit alignment waste due to the larger Klass alignment:
> 
> 1) Before this patch, metaspace used a global allocation alignment, which happened to be 8 bytes. Naively increasing that one to 512 bytes would have been very expensive since most metadata are tiny. So instead, my patch only changes class space to use the larger alignment. Storage of non-class metadata is unaffected.
> 
> 2) The patch does the same in CDS: Klass structures in the CDS archives (and only those) now get stored at 512-byte boundaries. CDS is a bit different from class space, though, in that it intersperses Klass structures with non-class metadata. Therefore, the larger alignment gaps preceding Klass structures mostly get filled with non-class metadata. Little fish among the big fish.
> 
> We can roughly calculate the footprint tradeoff I made in my patch:
> 
> A 512-byte alignment wastes 256 bytes per Klass structure on average in class space. In reality, this number will be skewed, but it is difficult to predict in which direction (see [4]). Assuming Lilliput manages to reduce object header size by just 4 bytes, we'd break even if we allocate 64 objects per class on average. I think that is realistic, especially for larger heaps.
> 
> And note that this is just a first step. I am sure we can reduce that alignment gap price in class space. Use it for something productive. I am playing around with some ideas for future improvements.
> 
> #### Using the alignment gap?
> 
> One idea - which may sound absurd at first - would be to eliminate the non-class metaspace and store everything in class space. At first glance, that sounds counterintuitive since it increases the memory range we need to encode. We don't want that, right? But coupled with a large shift for Klass structures - maybe even larger than 9 bits, let's say 12 (4 KB) - this could make sense.  The large alignment gaps Klass structures cause would naturally fill up with its non-class brethren. And the large shift in turn would expand the encoding range. Such an approach would effectively use the implicit knowledge that Klass and non class data, which belong to one class, are usually allocated around the same time and have the same life span.
> 
> It is just an idea, maybe it does not pan out. But it shows that the modest memory footprint increase this patch causes is not set in stone.
> 
> #### Class space arena becomes a table
> 
> One could say that by using a larger shift, we transform the class space from an arena with tight contiguous allocation into a table with 512-byte slots. In that table, Klass structures use up 1 to n slots. A narrow Klass pointer could be seen as an index into that table.
> 
> So we approach a table-based approach, but it smoothly accommodates larger-sized Klass structures that cover multiple slots. And it does so while retaining all the metaspace goodies. We get to keep free-list management, defragmentation after unloading, memory reclamation to the OS, and so on, at no added complexity.
> 
> ## How many classes can we accommodate
> 
> From the Klass size distribution, we can see that Klass is overwhelmingly smaller than 1K, and a significant part is just short of 512 bytes. So Klass structures, on average, take about one-and-a-half 512-byte table slots. A narrow Klass pointer size of 22 bits gives us 4 mio slots, so we could hold ~3 mio classes. 
> 
> Note that this calculation is a bit fuzzy. Larger Klass structures do exist; they are just rare. And as we have seen, CDS stores data more tightly than class space. Note, however, that CDS is never the culprit in scenarios where we load tons of classes since those tend to be dynamically generated.
> 
> ## 16-bit Klass pointers, or how many classes we want to accommodate.
> 
> The maximum number of classes Lilliput supports is important. The maximal reduction of the narrow Klass pointer depends on this answer.
> 
> I believe we have to be backward compatible and allow scenarios with tons of classes even after Lilliput. Those seem exclusively cases where classes are generated. One prominent example is JRuby. Anecdotally, according to Charles Nutter, JRuby can reach >500k classes. That is a lot. I am certain other examples exist too.
> 
> In the current JDK, we have no limit to the number of classes we can load, at least none imposed by class space. That is because a fallback exists with `UseCompressedClassPointers`. If one crazy app really maxes out 3 GB class space, it still could disable `UseCompressedClassPointers` and keep working. 
> 
> But if we remove `UseCompressedClassPointers` for good, we remove that fallback. That means whatever limit Lilliput imposes is final and needs to be good enough for outlandish cases like JRuby. And thus may be oversized for standard applications that need to load just a few thousand classes.
> 
> Were we to keep `UseCompressedClassPointers`, we'd have a fallback and therefore could set the class limit for Lilliput much lower. Lilliput would only need to cover 99%, not 100%, of all cases. A much lower maximum class limit could lead to a much smaller narrow Klass pointer. One even may arrive at 16-bit class pointers.
> 
> Anyway, for now my patch does not touch this question. 9-bit shift and 2 GB encoding range are tame enough that even workloads like JRuby should not be bothered.
> 
> ## The patch
> 
> 1) I changed the metaspace to work with an arena-specific alignment instead of dictating a global allocation alignment. I then switched the class space to use that new large Klass alignment. I also needed to rewrite guard handling for metaspace arenas. But I did that upstream [5] to simplify this patch.
> 2) I changed alignment handling in CDS.
> 3) Factored out narrow Klass encoding to its own header/implementation (compressedKlass.hpp/cpp)
> 4) Fixed narrow Klass encoding/decoding for both x64 and aarch64. 
> 5) Added new tests to test Klass pointer encoding setup with different values for encoding base address. In addition I had to disable some older tests that made no sense after Lilliput (some of which were questionable even before).
> 
> More details:
> 
> - MacroAssembler:
>   - aarch64 figures out the decode mode once during initialization, then uses that info. x64 recalculates the decode mode anew on each decode/encode. I liked aarch64 more, since it seems cleaner and allows easier error handling, and used that scheme for both platforms.
>   - I introduced a new `MacroAssembler::klass_decode_mode(base address)` alongside the existing version that takes no parameter and instead uses the globally set encoding base. It can be used to negotiate valid encoding scheme during initialization without having to outguess the assembler decoding (compare aarch64's old and new version of `CompressedKlassPointers::is_valid_base()`).
>   - Shift is never zero with my patch. So I removed handling of shift=0.
>   - For x64, I added a xor+shift mode to be used instead of add+shift if possible
>   - On x64, we cannot use `leaq`, since the scale factor is limited to at most 8. I removed that path and replaced it with add+shift.
>   - Encodings get tested by a new jtreg test, `runtime/CompressedOops/CompressedClassPointerEncodingTest.java`, which forces the encoding range to several arbitrary bases and then checks that the VM figures out the correct encoding mode. It also should serve as a primitive functional test since some java code runs and some classes are loaded. It does mostly tests on x64, but I plan to beef up the aarch64 side later.
> 
> - Metaspace: those changes are mostly straightforward, a lot of the work happened upstream. In this patch:
>   - I factored out the new alignment handling into an own header, `metaspaceAlignment.hpp`
>   - Metaspace arenas now get their own alignment, replacing the global metaspace allocation alignment. Arenas in class space then use the larger Klass alignment. 
>   - The more onerous changes were to the metaspace gtests, which needed to be made aware of arena-specific alignment.
> 
> - CDS:
>   - The comment in `archiveBuilder.hpp`, highlighting the various alignments CDS has to deal with, could help here
>   - In my first attempt to fix CDS, I was a bit too aggressive and found that CDS is hellishly easy to break - especially if one wants to get it right on both 32-bit and 64-bit platforms. The reason is that CDS has some implicit alignment assumptions which are not immediately clear. The comment in `DumpRegion::allocate` explains this. In the end, I threw my first version away and implemented a minimal approach.
> 
> - Narrow Klass encoding: 
>   - before this patch, that coding was intermixed with compressed oops encoding in `compressedOops.hpp`. I moved all that coding into a separate set of files (`compressedKlass.cpp/hpp`) for clarity
>   - Shift is now a constant. There is no zero shift with Lilliput if we follow my proposal.
>   - I am still a bit unhappy with how the encoding base address gets negotiated at VM startup. I have been unhappy before too. But this only affects either the CDS=off case or the case where we cannot reserve space at the preferred CDS base address. In the big scheme of things, it probably does not matter. Certainly not for this PR.
>   - I also am unhappy with the fact that we compile a lot of coding unnecessarily on 32-bit. In the future, maybe we could exclude all that code cleanly from compiling on 32-bit. That would also simplify things in metaspace. But not for this PR.
> 
> - I removed `COMPRESSED_CLASS_POINTERS_DEPENDS_ON_COMPRESSED_OOPS`, because we want to run with or without coops without tying narrow class pointers to that decision. If that causes problems for downstream projects, that is regrettable but not my focus atm.
> 
> I realize that this patch is largish, and I apologize for the review work. I tried to keep it small, but it kept growing on me. If we bring this into Lilliput, I will bring parts of it upstream, piecemeal, to ease merging pains.
> 
> ## Test state:
> 
> - All GHAs are green. That means tier1 successfully ran on x64 and x86 and the VM at least builds on the remaining platforms, including the oddballs.
> - I manually did some tests on aarch64 and made sure the basics worked. I ran `runtime/CompressedOops`, `runtime/Metaspace` and the various gtest variants. As well as manually testing Klass encoding. However, I work on an underpowered Raspi here and testing is painful.
> - PPC, s390 don't work for now and remain broken.
> - Zero remains broken - it builds but won't run, as before.
> 
> I'd appreciate it if others could test too, especially on aarch64. I did what tests I could but have not much hardware available atm.
> 
> Thank you,
> 
> Thomas
> 
> [1] https://mail.openjdk.java.net/pipermail/lilliput-dev/2021-September/000101.html
> [2] https://mail.openjdk.java.net/pipermail/lilliput-dev/2021-September/000102.html
> [3] https://bugs.openjdk.java.net/browse/JDK-8265705
> [4] https://github.com/tstuefe/metaspace-klass-size-analysis/blob/master/README.md
> [5] https://bugs.openjdk.java.net/browse/JDK-8273783

Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:

  feedback Ioi, fix CDS archive size estimate

-------------

Changes:
  - all: https://git.openjdk.java.net/lilliput/pull/13/files
  - new: https://git.openjdk.java.net/lilliput/pull/13/files/c7861b6a..da6c2272

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=lilliput&pr=13&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=lilliput&pr=13&range=01-02

  Stats: 9 lines in 1 file changed: 3 ins; 0 del; 6 mod
  Patch: https://git.openjdk.java.net/lilliput/pull/13.diff
  Fetch: git fetch https://git.openjdk.java.net/lilliput pull/13/head:pull/13

PR: https://git.openjdk.java.net/lilliput/pull/13