[master] RFR: JDK-8325104: Lilliput: Shrink Classpointers [v3]

Mon Mar 25 19:40:44 UTC 2024

On Mon, 25 Mar 2024 14:51:14 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> Hi,
>> 
>> I wanted to get input on the following improvement for Lilliput. Testing is still ongoing, but things look really good, so this patch is hopefully near its final form (barring any objections from reviewers, of course).
>> 
>> Note: I have a companion patch prepared for upstream, minus the markword changes. I will attempt to get that one upstream quickly in order to not have a large delta between upstream and lilliput, especially in Metaspace.
>> 
>> ## High-Level Overview
>> 
>> (for a short sequence of slides, please see https://github.com/tstuefe/fosdem24/blob/master/classpointers-and-liliput.pdf - these accompanied a talk we held at FOSDEM 24).
>> 
>> We want to reduce the bit size of narrow Klass to free up bits in the MarkWord. 
>> 
>> We cannot just reduce the Klass encoding range size (well, we could, and maybe we will later, but for now we decided not to). We instead increase the alignment Klass is stored at, and use that alignment shadow to store other information.
>> 
>> In other words, this patch changes the narrow Klass Pointer to a Klass ID, since now (almost) every value in its value range points to a different class. Therefore, we use the value range of nKlass much more efficiently.
>> 
>> We then use the newly freed bits in the MarkWord to restore the iHash to 31 bits: 
>> 
>> 
>> [ 22-bit nKlass | 31-bit iHash | 4 free bits | age | fwd | lck ]
>> 
>> nKlass gets reduced to 22 bits. Identity hash gets re-inflated to 31 bits. Preceding iHash are now 4 unused bits. Rest is unchanged.
>> 
>> (Note: I originally wanted to swap iHash and nKlass such that either of them could be loaded with a 32-bit load, but I found that tricky since C2 seems to rely on the nKlass offset in the Markword being > 0.)
>> 
>> ## nKlass reduction:
>> 
>> The reduction in nKlass size is made by only storing them at 10-bit aligned addresses. That alignment (1KB) works well in practice since Klass - although var sized - typically is between 512 bytes and 1KB in size. Outliers are possible, but the size distribution is bell-curvish [1], so far-away outliers are very rare. 
>> 
>> To not lose memory to alignment waste, metaspace is reshaped to handle arbitrarily aligned allocations efficiently. Basically, we allow the non-Klass arena of a class loader to steal the alignment waste storage from the class arena. So, alignment waste blocks are filled with non-Klass metadata. That works very well in practice since non-Klass metadata is numerous and fine-granular compared to the big Klass bloc...
>
> Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits:
> 
>  - Roman feedback - small stuff
>  - Merge branch 'master' into Smaller-ClassPointers
>  - revert COH archive generation
>  - Remove files that accidentally slipped in
>  - Merge branch 'master' into Smaller-ClassPointers
>  - Merge
>  - Merge commit 'c1281e6b45ed167df69d29a6039d81854c145ae6~1' into Smaller-ClassPointers
>  - Fix Typo
>  - Better CDS arch generation
>  - Fix error in COH archive generation
>  - ... and 10 more: https://git.openjdk.org/lilliput/compare/b2fcfb73...1260f2d6

One possible problem with hyper-aligned data structures (such as Klass* pointers allocated zero mod 1024) is interactions with data cache dynamics.  If your D$ maps blocks aligned zero mod 64, then your Klass blocks can occupy only 1/16 of possible D$ entries.  Sometimes this matters, when you are loading from many Klasses:  The effect is of making the D$, just for them, smaller by a factor of 16, causing more frequent collisions and evictions.

(This sort of effect has been observed, IIRC, with both stacks and TLABs, in the past.  I don’t know if Klass accesses are frequent enough to make similar effects be detectable or not.)

A solution is to perturb the hyper-alignment in such a way that the lower bits select all possible D$ frames, instead of just 1/16 of them.  The simplest way to do this is to make your hyper-aligned Klass* blocks aligned mod (1024 + 64), where 1024 is your intended size and 64 is your D$ entry size.  (Or maybe mod 1024+8, if it is faster to shift by 8 instead of by 64.).  Decoding a Klass index consists of multiplying by the sum of a pair of powers of two, instead of by a single power of two.  That means an extra instruction in the decode fast-path, unless you can hide the extra shift/add in an addressing mode.

Given that we work hard to make the Klass base address be zero, it is clear that every instruction counts, and even if the Klass base is zero, you still need an extra instruction or two to do the extra shift/add.  So that is sad, but the fact that we are sensitive to performance of these sequences means that we might possibly be sensitive to D$ collision effects also:  Saving one instruction but having the average number of cycles go up, due to D$ misses, could be a net loss.

So I suggest considering (even to the point of doing performance tests) the idea of assigning Klass addresses mod 1088 instead of 1024.  It might require some imaginative coding to adjust the allocator logic to hand out such addresses, but I don’t think it’s very bad.  For allocator performance, multiplication  by 1088 is trivial.  There will be some divisions by 1088, probably, but those are optimized (usually) to multiplications as well, if the divisor (1088) is a compile time constant.

The allocator could search for available blocks mod 1024, and then only at the end, before handing out the block, round up to the next multiple of 1088.  That would add a number from 0 to 63 to the base address which has already been selected.  If that shift makes the block overflow at the other end, then the allocator has to keep searching.

-------------

PR Comment: https://git.openjdk.org/lilliput/pull/128#issuecomment-2018769429