[External] : Solving the Klass hyperalignment problem
coleen.phillimore at oracle.com
coleen.phillimore at oracle.com
Tue Jun 4 20:31:17 UTC 2024
Hi Thomas,
I'm sad and disappointed that the indirect klass pointer approach
performed poorly. Maybe someday we'll have hardware where loading from
memory is not so noticeably slower or algorithms where we don't load the
klass in tight loops. Until that day, your uneven cache line approach
sounds like the most feasible approach, despite the quite complicated
instructions to encode and decode the klass pointers. Luckily we don't
encode and decode klass pointers in too many places.
I do have a patch for allocating klasses without instances but I won't
get to it for a few weeks.
Thank you for the great writeup.
Coleen
On 5/23/24 11:20 AM, Thomas Stüfe wrote:
> Hi all,
>
> I would like help deciding on the best mitigation strategy for
> Lilliput's Klass hyperalignment problem. Since it has wide effects
> (e.g. a possible removal of class space), I'd like to base the next
> steps on consensus.
>
> (a more readable version of this, in markdown, is here:
> https://gist.github.com/tstuefe/6d8c4a40689c34b12f79442a8469504e
> <https://urldefense.com/v3/__https://gist.github.com/tstuefe/6d8c4a40689c34b12f79442a8469504e__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cuTZl9fZ$>).
>
> 1. Background
>
> We store class information in Klass, and resolving Klass from oop is a
> hot path. One example is GC: During GCs, we churn through tons of
> objects and need to get at least object size (layout helper) and
> Oopmap from Klass frequently. Therefore, the way we resolve a nKlass
> matters for performance.
>
> Today (non-Lilliput), we go from Object to Klass (with compressed
> class pointers) this way: We pluck the nKlass from the word adjacent
> to the MW. We then calculate Klass* from nKlass by - typically - just
> adding the encoding base as immediate. We may or may not omit that
> add, and we may or may not shift the nKlass, but the most typical case
> (CDS enabled) is just the addition.
>
> Today's decoding does not need a single memory access, it can happen
> in registers only.
>
> In Lilliput, the nKlass lives in the MW (which allows us to load
> nKlass with the same load that loads the MW). Therefore, nKlass needs
> to shrink. The problem with the classic 32-bit nKlass is that its
> value range is not used effectively. Klass structures tend to be
> large, on average 500-700 bytes [1], and that means a lot of values in
> that 32-bit range are "blind" - point into the middle of a class - and
> are hence wasted. Ideally, one would want to use one nKlass value per
> class.
>
> In Lilliput, we reduced nKlass to 22-bit. We do this by placing Klass
> structures only on 1KB-aligned addresses. Therefore, the lower 10 bits
> are 0 and can be shifted out. 1KB was chosen as a middle-ground that
> allows us to use both the nKlass value range and the Klass encoding
> range (4GB) effectively.
>
>
> 2.The Problem
>
> By keeping Klass structures 1KB-aligned, we march in lockstep with
> respect to CPU caches. With a cache line size of 64 bytes (6 bits) and
> an alignment of 1KB (10 bits), we lose 4 bits of entropy. Therefore,
> loads from a Klass structure have a high chance of evicting earlier
> loads from other Klass structures. Depending on saturation and number
> of cache ways, we may only use a 16th of the caches.
>
> We see this effect clearly, especially in GC pause times. The bad
> cache behavior is clearly noticeable and needs to be solved.
>
>
> 3. The solutions
>
> 3.1. Short-term mitigation: Increasing the nKlass size to 26 bits
>
> A simple short-term mitigation for Lilliput Milestone 1 is just
> increasing the nKlass size. By reducing the nKlass size to 22 bits, we
> freed up 10 bits, but to date, we only use 6 of them. For now, we have
> 4 spare bits in the header. We could increase the nKlass to 26 bits
> and work with a shift of 6 bits instead of 10 bits. That would require
> no additional work. Klass would be aligned to just 64 bytes, so the
> cache performance problems would disappear.
>
> Note, however, that we earmarked those 4 spare bits for Valhalla's
> use. Therefore, reverting to a 26-bit nKlass can only be an
> intermediate step. And, obviously, it won't do for 32-bit headers.
>
>
> 3.2 Use a pointer indirection table
>
> This idea resurfaces every couple of years. The idea is to replace the
> class space with an indirection pointer table. In this approach, a
> nKlass would be an index into a global pointer table, and that pointer
> table contains the real Klass* pointers.
>
> The enticing part of this approach is that we could throw away the
> class space and a bunch of supporting code. Klass structures could
> live in normal Metaspace like all the other data. However, we would
> need some new code to maintain the global Klass* table, recycle empty
> pointer slots after class unloading, etc.
>
> Decoding a nKlass would mean:
> - load the nKlass from the object
> - load Klass* from the indirection table at the index nKlass points to
>
> The approach would solve the cache problem described above
> since removing any alignment requirement from Klass structures allows
> us to place them wherever we like (e.g., in standard Metaspace), and
> their start addresses would not march in lockstep.
>
> However, it introduces a new cache problem since we now have a new
> load in the hot decoding path. And the Klasspointer table can only be
> improved so much: only 8 uncompressed pointers fit into a cache line.
> From a certain number of classes, subsequent table accesses will have
> little spatial locality.
>
> 3.3 Place Klass on alternating cache lines
>
> Originally brought up by John Rose [2] when we did the first iteration
> of 22-bit class pointers in Lilliput. The idea is to alter the
> locations of Klass structures by cache line size. There is nothing
> that forces us to use a power-of-two stride. We can use any uneven
> multiple of cache lines that we like. For example, 11 cache lines (704
> bytes) would mean that Klass structures would come to be located on
> different cache lines.
>
> With a non-pow2 stride, decoding becomes a bit more complex. We cannot
> use shift, we need to do integer multiplication:
> - multiply nKlass with 704
> - add base.
>
> But all of this can still happen in registers only. No memory load is
> needed.
>
> 4. The prototypes
>
> I wanted to measure the performance impacts of all approaches. So I
> compared four JVMs:
>
> - A) (the unmitigated case) a Lilliput JVM as it is today: 22-bit
> nKlass, 10-bit shift
> - B) a Lilliput JVM that uses a 26-bit nKlass with a 6-bit shift
> - C) a Lilliput JVM that uses a 22-bit nKlass and a Klass pointer
> indirection table [3]
> - D) a Lilliput JVM that uses a 22-bit nKlass and a non-pow2 alignment
> of Klass of 704 [4]
>
> It turned out that Coleen also wrote a prototype with a Klass pointer
> indirection table [5], but that is identical to mine (C) in all
> relevant points. The only difference is that Coleen based it on the
> mainline JVM; mine is based on Lilliput. But I repeated all my tests
> with Coleen's prototype.
>
> 5. The Tests
>
> I did both SpecJBB2015 and a custom-written Microbenchmark [6].
>
> The microbenchmark was designed to stress Object-to-Klass
> dereferencing during GC. It fills the heap with many objects of
> randomly chosen classes. It keeps those objects alive in an array. It
> then executes several Full GCs and sums up all GC pause times. Walking
> these objects forces the GCs to de-reference many different nKlass values.
>
> The Microbenchmark was run on a Ryzen Zen 2, the SpecJBB on an older
> i7-4770, and Coleen's prototype I also tested on a Raspberry 4. The
> tests were isolated to 8 cores (well, apart from the Raspberry), and I
> tried to minimize scheduler interference. The microbenchmark results
> were pretty stable, but the SpecJBB2015 results fluctuated - despite
> my attempts at stabilizing them.
>
> I repeated the Microbenchmark for a number of classes (512..16384) and
> three different collectors (Serial, Parallel, G1).
>
> 6. The Results
>
> The microbenchmark shows clear and stable results. SpecJBB fluctuated
> more.
>
> 6.1 Microbenchmark, G1GC
>
> See graph [7].
>
> (A) - the unmitigated 22-bit version showed overall the worst
> performance, with a 41% increase over the best performer (B, the
> 26-bit version) at 16k classes.
> (B) - best performance
> (C) - the klass pointer table seemed overall the worst of the three
> mitigation prototypes. For 4k..8k classes, even worse than the
> unmitigated case (A). Maxes out at +36% over the best performer (B) at
> 16k classes.
> (D) - second best performance, for certain class ranges even best.
> Maxes out at +11% at 16k classes.
>
> I wondered why (D) could be better than (B). My assumption is that
> with (D), we go out of our way to choose different cache lines. With
> (B), the cache line chosen is "random" and may be subject to
> allocation pattern artifacts of the underlying allocator.
>
> 6.2 Microbenchmark, ParallelGC
>
> See graphs [8].
>
> Differences are less pronounced, but the results are similar. (B)
> better than (D) better than (C).
>
> 6.3 Microbenchmark, SerialGC
>
> See graphs [9].
>
> Again, the same result. Here are the deltas most pronounced, cache
> inefficiency measured via GC pauses is the most apparent.
>
> 6.4 Coleen's prototype
>
> I repeated the measurements with Coleens prototype (with G1),
> comparing it against the same JVM with klass table switched off. No
> surprises, similar behavior to (C) vs (B). See
> [10]. I also did a run with perf to measure L1 misses, and we see up
> to 32% more L1 cache misses with the klass pointer table [11].
>
> 6.5 SPecJBB2015
>
> SpecJBB results were quite volatile despite my stabilization efforts.
> The deltas between maxJOps and critJOps did not rise above random
> noise. The GC pause times showed (Percentage numbers, compared with
> (A)==100%):
>
> | Run | 1 | 2 | 3 |
> |-----|---------|---------|--------|
> | B | 108.68% | 95.88% | 97.01% |
> | C | 104.26% | 92.15% | 90.25% |
> | D | 96.20% | 94.31% | 86.44% |
>
> (all with G1GC)
>
> Again, (D) seems to perform best. I am unsure what the problem is with
> (B) here (the 26-bit class pointer approach) since it seems to perform
> worst in all cases. I will look into that.
>
> 7. Side considerations
>
> 7.1 Running out of class space/addressable classes?
>
> This issue does not affect the number of addressable classes much.
> That one is limited by the size of nKlass. With a 22-bit nKlass that
> is 4 mio. We can only work beyond that with the concept of near- vs
> far classes suggested by John Rose. In any case, that is not the focus
> here.
>
> The class-space-based approaches (A), (B), and (D) also have a soft
> limit in that the number of Klass we can store is limited by the size
> of the encoding range (4GB). However, that is a rather soft limit
> because no hard technical reason prevents us from having a larger
> encoding range. The 4G limitation exists only because we optimize the
> addition of the base immediate by using e.g. 16-bit moves on some
> platforms, so the nKlass must not extend beyond bit 31. We may be able
> to do that differently. Note that 4GB is also really large for Klass data.
>
> 7.2 Reducing the number of Klass structures addressable via nKlass
>
> Coleen had a great idea that never instantiated classes (e.g., Lambda
> Forms) don't need a nKlass at all. They, therefore, don't have to live
> within the Klass encoding range. That would be a great improvement
> since these classes are typically generated, and their number is
> unpredictable. By removing this kind of classes from the equation, the
> question of number-of-addressable classes becomes a lot more relaxed.
>
> 8. Conclusions
>
> For the moment, I prefer (D) (the uneven-cache-lines-approach). It
> shows the overall best performance, in parts even outperforming the
> 26-bit approach.
>
> The Klasspointer-Table approach (C) would be nice since we could
> eliminate class space and a lot of coding that goes with it. That
> would reduce complexity. But the additional load in hot decoding paths
> hurts. There is also the vague fear of not being future-proof. I am
> apprehensive about sacrificing Klass resolving performance since Klass
> lookup seems to be something we will always do.
>
> That said, all input (especially from you, Coleen!) is surely welcome.
>
> # Materials
>
> All test results, tests, etc can be found here: [12]
>
>
> Thanks, Thomas
>
>
> - [1] [Allocation Histogram for Klass- and Non-Klass
> Allocations](https://raw.githubusercontent.com/tstuefe/metaspace-statistics/ab3625e041d42243039f37983969ac8b770a9f4a/Histogram.svg
> <https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/metaspace-statistics/ab3625e041d42243039f37983969ac8b770a9f4a/Histogram.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cngQfoPq$>)
> - [2]
> https://github.com/openjdk/lilliput/pull/13#issuecomment-988456995
> <https://urldefense.com/v3/__https://github.com/openjdk/lilliput/pull/13*issuecomment-988456995__;Iw!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cr9oA1s2$>
> - [3] [Klasstable
> prototype](https://github.com/tstuefe/lilliput/tree/lilliput-with-Klass-indirection-table
> <https://urldefense.com/v3/__https://github.com/tstuefe/lilliput/tree/lilliput-with-Klass-indirection-table__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cieRCPym$>)
> - [4] [Uneven alignment
> prototype](https://github.com/tstuefe/lilliput/tree/lilliput-with-staggered-Klass-alignment
> <https://urldefense.com/v3/__https://github.com/tstuefe/lilliput/tree/lilliput-with-staggered-Klass-alignment__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cunpQuE_$>)
> - [5] [Klass table prototype,
> Coleen](https://github.com/openjdk/jdk/pull/19272
> <https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/19272__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cvpuKMG0$>)
> - [6] [The
> Microbenchmark](https://github.com/tstuefe/test-hyperaligning-lilliput/blob/master/microbenchmark/the-test/src/main/java/de/stuefe/repros/metaspace/InterleaveKlassRefsInHeap.java
> <https://urldefense.com/v3/__https://github.com/tstuefe/test-hyperaligning-lilliput/blob/master/microbenchmark/the-test/src/main/java/de/stuefe/repros/metaspace/InterleaveKlassRefsInHeap.java__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cqIbufDr$>)
> - [7] [Graph: G1 microbench, absolute GC
> pauses](https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/G1GC-abs-pauses.svg
> <https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/G1GC-abs-pauses.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-csFNBmYb$>)
> - [8] [Graph: ParallelGC microbench, absolute GC
> pauses](https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/ParallelGC-abs-pauses.svg
> <https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/ParallelGC-abs-pauses.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cvjyYiUL$>)
> - [9] [Graph: SerialGC microbench, absolute GC
> pauses](https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/SerialGC-abs-pauses.svg
> <https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/SerialGC-abs-pauses.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-ct-eEsRy$>)
> - [10] [Graph: G1 microbench, absolute GC pauses, alternate klasstable
> prototype](https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-pauses-absolute.svg
> <https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-pauses-absolute.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cgSsWHuA$>)
> - [11] [Graph: G1 microbench, L1 misses, alternate klasstable
> prototype](https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-l1misses-absolute.svg
> <https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-l1misses-absolute.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-co1avvvg$>)
> - [12]
> https://github.com/tstuefe/test-hyperaligning-lilliput/tree/master
> <https://urldefense.com/v3/__https://github.com/tstuefe/test-hyperaligning-lilliput/tree/master__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-ctCiBYUF$>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/lilliput-dev/attachments/20240604/e032482d/attachment-0001.htm>
More information about the lilliput-dev
mailing list