[master] RFR: JDK-8325104: Lilliput: Shrink Classpointers [v3]

Tue Mar 26 09:23:43 UTC 2024

On Mon, 25 Mar 2024 19:38:11 GMT, John R Rose <jrose at openjdk.org> wrote:

> One possible problem with hyper-aligned data structures (such as Klass* pointers allocated zero mod 1024) is interactions with data cache dynamics. If your D$ maps blocks aligned zero mod 64, then your Klass blocks can occupy only 1/16 of possible D$ entries. Sometimes this matters, when you are loading from many Klasses: The effect is of making the D$, just for them, smaller by a factor of 16, causing more frequent collisions and evictions.
> 
> (This sort of effect has been observed, IIRC, with both stacks and TLABs, in the past. I don’t know if Klass accesses are frequent enough to make similar effects be detectable or not.)
> 
> A solution is to perturb the hyper-alignment in such a way that the lower bits select all possible D$ frames, instead of just 1/16 of them. The simplest way to do this is to make your hyper-aligned Klass* blocks aligned mod (1024 + 64), where 1024 is your intended size and 64 is your D$ entry size. (Or maybe mod 1024+8, if it is faster to shift by 8 instead of by 64.). Decoding a Klass index consists of multiplying by the sum of a pair of powers of two, instead of by a single power of two. That means an extra instruction in the decode fast-path, unless you can hide the extra shift/add in an addressing mode.
> 
> Given that we work hard to make the Klass base address be zero, it is clear that every instruction counts, and even if the Klass base is zero, you still need an extra instruction or two to do the extra shift/add. So that is sad, but the fact that we are sensitive to performance of these sequences means that we might possibly be sensitive to D$ collision effects also: Saving one instruction but having the average number of cycles go up, due to D$ misses, could be a net loss.
> 
> So I suggest considering (even to the point of doing performance tests) the idea of assigning Klass addresses mod 1088 instead of 1024. It might require some imaginative coding to adjust the allocator logic to hand out such addresses, but I don’t think it’s very bad. For allocator performance, multiplication by 1088 is trivial. There will be some divisions by 1088, probably, but those are optimized (usually) to multiplications as well, if the divisor (1088) is a compile time constant.
> 
> The allocator could search for available blocks mod 1024, and then only at the end, before handing out the block, round up to the next multiple of 1088. That would add a number from 0 to 63 to the base address which has already been selected. If that shift makes the block overflow at the other end, then the allocator has to keep searching.

Hi John,

I remember well your point from our last interchange (https://github.com/openjdk/lilliput/pull/13#issuecomment-988456995). Last summer, I did a number of performance tests but found no clear result above background noise. But I had planned to revisit this issue and potentially fix it as a follow-up task. One never knows, my test methodology may have been off.

Like you, I was worried about the increased cost and complexity of Klass ID decoding. Last summer I investigated how a possible decoding would look like. Since I am not a compiler expert, I just looked at what C compilers do when referencing items in a 1088-stride array. x64 e.g. uses imul, aarch64 uses some form of shift+add. Maybe its not so bad, but the effects need to be measured. 

My short-term contingency plan had been to use some of the now free four bits to increase the Klass ID bits and thus bring alignment back to 64 bytes. That would work at least for 64-bit headers, as long as Valhalla does not need those bits too.

I will re-examine my performance runs, and double-check for methodology errors.

> I see acknowledgement of this issue in a comment in this PR that says "smallest shift possible that still covers the whole range”. This is a good solution as long as that smallest shift is 64 (or less). A shift of 128 would dilute the D$ by 50%, but would reach 2^(22+7-10) distinct Klasses, half a million, which is great. I guess VMs which configure themselves for a quarter million classes or fewer just use the cache line as the allocation modulus, and all is well. So my solution (mod 1088) is really useful only for huge workloads. I retract my suggestion that performance experiments are needed at this time, given the solution you have in place.

Yes, that was the underlying reason for this logic. Unfortunately, the default is for the class space to be 1G, with some additional space for CDS, so we are at a 2G range, which brings us just one bit. There are knobs to play with, e.g. CDS size is usually very small, so both CDS and class space default together could easily fit into 1G. That brings us 2 bits, back to an alignment of 256. Still only every fourth cache line.

-------------

PR Comment: https://git.openjdk.org/lilliput/pull/128#issuecomment-2019911943