<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <br>

    Hi Thomas,<br>

    <br>

    I'm sad and disappointed that the indirect klass pointer approach

    performed poorly.  Maybe someday we'll have hardware where loading

    from memory is not so noticeably slower or algorithms where we don't

    load the klass in tight loops.  Until that day, your uneven cache

    line approach sounds like the most feasible approach, despite the

    quite complicated instructions to encode and decode the klass

    pointers.  Luckily we don't encode and decode klass pointers in too

    many places.<br>

    <br>

    I do have a patch for allocating klasses without instances but I

    won't get to it for a few weeks.<br>

    <br>

    Thank you for the great writeup.<br>

    <br>

    Coleen<br>

    <br>

    <div class="moz-cite-prefix">On 5/23/24 11:20 AM, Thomas Stüfe

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAA-vtUxp6=khsb+iGo55SNWbE99W1418f_SighZUJGZx6TLeCQ@mail.gmail.com">

      <div dir="ltr">

        <div>Hi all,<br>

          <br>

          I would like help deciding on the best mitigation strategy for

          Lilliput's Klass hyperalignment problem. Since it has wide

          effects (e.g. a possible removal of class space), I'd like to

          base the next steps on consensus.<br>

          <br>

          (a more readable version of this, in markdown, is here: <a href="https://urldefense.com/v3/__https://gist.github.com/tstuefe/6d8c4a40689c34b12f79442a8469504e__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cuTZl9fZ$" target="_blank" moz-do-not-send="true">https://gist.github.com/tstuefe/6d8c4a40689c34b12f79442a8469504e</a>).<br>

          <br>

        </div>

        <div>1. Background<br>

          <br>

          We store class information in Klass, and resolving Klass from

          oop is a hot path. One example is GC: During GCs, we churn

          through tons of objects and need to get at least object size

          (layout helper) and Oopmap from Klass frequently. Therefore,

          the way we resolve a nKlass matters for performance.<br>

          <br>

          Today (non-Lilliput), we go from Object to Klass (with

          compressed class pointers) this way: We pluck the nKlass from

          the word adjacent to the MW. We then calculate Klass* from

          nKlass by - typically - just adding the encoding base as

          immediate. We may or may not omit that add, and we may or may

          not shift the nKlass, but the most typical case (CDS enabled)

          is just the addition. <br>

          <br>

          Today's decoding does not need a single memory access, it can

          happen in registers only.<br>

          <br>

          In Lilliput, the nKlass lives in the MW (which allows us to

          load nKlass with the same load that loads the MW). Therefore,

          nKlass needs to shrink. The problem with the classic 32-bit

          nKlass is that its value range is not used effectively. Klass

          structures tend to be large, on average 500-700 bytes [1], and

          that means a lot of values in that 32-bit range are "blind" -

          point into the middle of a class - and are hence wasted.

          Ideally, one would want to use one nKlass value per class.</div>

        <div><br>

          In Lilliput, we reduced nKlass to 22-bit. We do this by

          placing Klass structures only on 1KB-aligned addresses.

          Therefore, the lower 10 bits are 0 and can be shifted out. 1KB

          was chosen as a middle-ground that allows us to use both the

          nKlass value range and the Klass encoding range (4GB)

          effectively.<br>

          <br>

          <br>

          2.<span> </span>The Problem<br>

          <br>

          By keeping Klass structures 1KB-aligned, we march in lockstep

          with respect to CPU caches. With a cache line size of 64 bytes

          (6 bits) and an alignment of 1KB (10 bits), we lose 4 bits of

          entropy. Therefore, loads from a Klass structure have a high

          chance of evicting earlier loads from other Klass structures.

          Depending on saturation and number of cache ways, we may only

          use a 16th of the caches.<br>

          <br>

          We see this effect clearly, especially in GC pause times. The

          bad cache behavior is clearly noticeable and needs to be

          solved.<br>

          <br>

          <br>

          3. The solutions<br>

          <br>

          3.1. Short-term mitigation: Increasing the nKlass size to 26

          bits<br>

          <br>

          A simple short-term mitigation for Lilliput Milestone 1 is

          just increasing the nKlass size. By reducing the nKlass size

          to 22 bits, we freed up 10 bits, but to date, we only use 6 of

          them. For now, we have 4 spare bits in the header. We could

          increase the nKlass to 26 bits and work with a shift of 6 bits

          instead of 10 bits. That would require no additional work.

          Klass would be aligned to just 64 bytes, so the cache

          performance problems would disappear.<br>

          <br>

          Note, however, that we earmarked those 4 spare bits for

          Valhalla's use. Therefore, reverting to a 26-bit nKlass can

          only be an intermediate step. And, obviously, it won't do for

          32-bit headers.<br>

          <br>

        </div>

        <div><br>

        </div>

        <div>3.2 Use a pointer indirection table<br>

        </div>

        <div><br>

          This idea resurfaces every couple of years. The idea is to

          replace the class space with an indirection pointer table. In

          this approach, a nKlass would be an index into a global

          pointer table, and that pointer table contains the real Klass*

          pointers.<br>

          <br>

          The enticing part of this approach is that we could throw away

          the class space and a bunch of supporting code. Klass

          structures could live in normal Metaspace like all the other

          data. However, we would need some new code to maintain the

          global Klass* table, recycle empty pointer slots after class

          unloading, etc.<br>

          <br>

          Decoding a nKlass would mean: <br>

          - load the nKlass from the object<br>

          - load Klass* from the indirection table at the index nKlass

          points to<br>

          <br>

          The approach would solve the cache problem described above

          since removing any alignment requirement from Klass structures

          allows us to place them wherever we like (e.g., in standard

          Metaspace), and their start addresses would not march in

          lockstep.<br>

          <br>

          However, it introduces a new cache problem since we now have a

          new load in the hot decoding path. And the Klasspointer table

          can only be improved so much: only 8 uncompressed pointers fit

          into a cache line. From a certain number of classes,

          subsequent table accesses will have little spatial locality.<br>

          <br>

          3.3 Place Klass on alternating cache lines<br>

          <br>

          Originally brought up by John Rose [2] when we did the first

          iteration of 22-bit class pointers in Lilliput. The idea is to

          alter the locations of Klass structures by cache line size.

          There is nothing that forces us to use a power-of-two stride.

          We can use any uneven multiple of cache lines that we like.

          For example, 11 cache lines (704 bytes) would mean that Klass

          structures would come to be located on different cache lines.<br>

          <br>

          With a non-pow2 stride, decoding becomes a bit more complex.

          We cannot use shift, we need to do integer multiplication:<br>

          - multiply nKlass with 704<br>

          - add base.<br>

          <br>

          But all of this can still happen in registers only. No memory

          load is needed.<br>

          <br>

          4. The prototypes<br>

          <br>

          I wanted to measure the performance impacts of all approaches.

          So I compared four JVMs:<br>

          <br>

          - A) (the unmitigated case) a Lilliput JVM as it is today:

          22-bit nKlass, 10-bit shift<br>

          - B) a Lilliput JVM that uses a 26-bit nKlass with a 6-bit

          shift<br>

          - C) a Lilliput JVM that uses a 22-bit nKlass and a Klass

          pointer indirection table [3]<br>

          - D) a Lilliput JVM that uses a 22-bit nKlass and a non-pow2

          alignment of Klass of 704 [4]<br>

          <br>

          It turned out that Coleen also wrote a prototype with a Klass

          pointer indirection table [5], but that is identical to mine

          (C) in all relevant points. The only difference is that Coleen

          based it on the mainline JVM; mine is based on Lilliput. But I

          repeated all my tests with Coleen's prototype.<br>

          <br>

          5. The Tests<br>

          <br>

          I did both SpecJBB2015 and a custom-written Microbenchmark

          [6]. <br>

          <br>

          The microbenchmark was designed to stress Object-to-Klass

          dereferencing during GC. It fills the heap with many objects

          of randomly chosen classes. It keeps those objects alive in an

          array. It then executes several Full GCs and sums up all GC

          pause times. Walking these objects forces the GCs to

          de-reference many different nKlass values.<br>

          <br>

          The Microbenchmark was run on a Ryzen Zen 2, the SpecJBB on an

          older i7-4770, and Coleen's prototype I also tested on a

          Raspberry 4. The tests were isolated to 8 cores (well, apart

          from the Raspberry), and I tried to minimize scheduler

          interference. The microbenchmark results were pretty stable,

          but the SpecJBB2015 results fluctuated - despite my attempts

          at stabilizing them.<br>

          <br>

          I repeated the Microbenchmark for a number of classes

          (512..16384) and three different collectors (Serial, Parallel,

          G1).<br>

          <br>

          6. The Results<br>

          <br>

          <div>The microbenchmark shows clear and stable results.

            SpecJBB fluctuated more.</div>

          <br>

          6.1 Microbenchmark, G1GC<br>

          <br>

        </div>

        <div>See graph [7].<br>

          <br>

          (A) - the unmitigated 22-bit version showed overall the worst

          performance, with a 41% increase over the best performer (B,

          the 26-bit version) at 16k classes.<br>

          (B) - best performance<br>

          (C) - the klass pointer table seemed overall the worst of the

          three mitigation prototypes. For 4k..8k classes, even worse

          than the unmitigated case (A). Maxes out at +36% over the best

          performer (B) at 16k classes.<br>

          (D) - second best performance, for certain class ranges even

          best. Maxes out at +11% at 16k classes.</div>

        <div><br>

        </div>

        <div>I wondered why (D) could be better than (B). My assumption

          is that with (D), we go out of our way to choose different

          cache lines. With (B), the cache line chosen is "random" and

          may be subject to allocation pattern artifacts of the

          underlying allocator.<br>

          <br>

        </div>

        <div>6.2 Microbenchmark, ParallelGC<br>

          <br>

          See graphs [8]. <br>

          <br>

          Differences are less pronounced, but the results are similar.

          (B) better than (D) better than (C).<br>

          <br>

          6.3 Microbenchmark, SerialGC<br>

          <br>

          See graphs [9].<br>

          <br>

          Again, the same result. Here are the deltas most pronounced,

          cache inefficiency measured via GC pauses is the most

          apparent.<br>

          <br>

          6.4 Coleen's prototype<br>

          <br>

          I repeated the measurements with Coleens prototype (with G1),

          comparing it against the same JVM with klass table switched

          off. No surprises, similar behavior to (C) vs (B). See<br>

          [10]. I also did a run with perf to measure L1 misses, and we

          see up to 32% more L1 cache misses with the klass pointer

          table [11].<br>

          <br>

          6.5 SPecJBB2015<br>

          <br>

          SpecJBB results were quite volatile despite my stabilization

          efforts. The deltas between maxJOps and critJOps did not rise

          above random noise. The GC pause times showed (Percentage

          numbers, compared with (A)==100%):<br>

          <br>

          | Run | 1       | 2       | 3      |<br>

          |-----|---------|---------|--------|<br>

          | B   | 108.68% | 95.88%  | 97.01% |<br>

          | C   | 104.26% | 92.15%  | 90.25% |<br>

          | D   | 96.20%  | 94.31%  | 86.44% |<br>

          <br>

          (all with G1GC)<br>

          <br>

          Again, (D) seems to perform best. I am unsure what the problem

          is with (B) here (the 26-bit class pointer approach) since it

          seems to perform worst in all cases. I will look into that.<br>

          <br>

          7. Side considerations<br>

          <br>

          7.1 Running out of class space/addressable classes?<br>

          <br>

          This issue does not affect the number of addressable classes

          much. That one is limited by the size of nKlass. With a 22-bit

          nKlass that is 4 mio. We can only work beyond that with the

          concept of near- vs far classes suggested by John Rose. In any

          case, that is not the focus here.<br>

          <br>

          The class-space-based approaches (A), (B), and (D) also have a

          soft limit in that the number of Klass we can store is limited

          by the size of the encoding range (4GB). However, that is a

          rather soft limit because no hard technical reason prevents us

          from having a larger encoding range. The 4G limitation exists

          only because we optimize the addition of the base immediate by

          using e.g. 16-bit moves on some platforms, so the nKlass must

          not extend beyond bit 31. We may be able to do that

          differently. Note that 4GB is also really large for Klass

          data.<br>

          <br>

          7.2 Reducing the number of Klass structures addressable via

          nKlass<br>

          <br>

          Coleen had a great idea that never instantiated classes (e.g.,

          Lambda Forms) don't need a nKlass at all. They, therefore,

          don't have to live within the Klass encoding range. That would

          be a great improvement since these classes are typically

          generated, and their number is unpredictable. By removing this

          kind of classes from the equation, the question of

          number-of-addressable classes becomes a lot more relaxed.<br>

          <br>

          8. Conclusions<br>

          <br>

          For the moment, I prefer (D) (the

          uneven-cache-lines-approach). It shows the overall best

          performance, in parts even outperforming the 26-bit approach.<br>

          <br>

          The Klasspointer-Table approach (C) would be nice since we

          could eliminate class space and a lot of coding that goes with

          it. That would reduce complexity. But the additional load in

          hot decoding paths hurts. There is also the vague fear of not

          being future-proof. I am apprehensive about sacrificing Klass

          resolving performance since Klass lookup seems to be something

          we will always do.</div>

        <div><br>

          That said, all input (especially from you, Coleen!) is surely

          welcome.<br>

          <br>

          # Materials<br>

          <br>

          All test results, tests, etc can be found here: [12] <br>

          <br>

          <br>

          Thanks, Thomas<br>

          <br>

          <br>

          - [1] [Allocation Histogram for Klass- and Non-Klass

          Allocations](<a href="https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/metaspace-statistics/ab3625e041d42243039f37983969ac8b770a9f4a/Histogram.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cngQfoPq$" target="_blank" moz-do-not-send="true">https://raw.githubusercontent.com/tstuefe/metaspace-statistics/ab3625e041d42243039f37983969ac8b770a9f4a/Histogram.svg</a>)<br>

          - [2] <a href="https://urldefense.com/v3/__https://github.com/openjdk/lilliput/pull/13*issuecomment-988456995__;Iw!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cr9oA1s2$" target="_blank" moz-do-not-send="true">https://github.com/openjdk/lilliput/pull/13#issuecomment-988456995</a><br>

          - [3] [Klasstable prototype](<a href="https://urldefense.com/v3/__https://github.com/tstuefe/lilliput/tree/lilliput-with-Klass-indirection-table__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cieRCPym$" target="_blank" moz-do-not-send="true">https://github.com/tstuefe/lilliput/tree/lilliput-with-Klass-indirection-table</a>)<br>

          - [4] [Uneven alignment prototype](<a href="https://urldefense.com/v3/__https://github.com/tstuefe/lilliput/tree/lilliput-with-staggered-Klass-alignment__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cunpQuE_$" target="_blank" moz-do-not-send="true">https://github.com/tstuefe/lilliput/tree/lilliput-with-staggered-Klass-alignment</a>)<br>

          - [5] [Klass table prototype, Coleen](<a href="https://urldefense.com/v3/__https://github.com/openjdk/jdk/pull/19272__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cvpuKMG0$" target="_blank" moz-do-not-send="true">https://github.com/openjdk/jdk/pull/19272</a>)<br>

          - [6] [The Microbenchmark](<a href="https://urldefense.com/v3/__https://github.com/tstuefe/test-hyperaligning-lilliput/blob/master/microbenchmark/the-test/src/main/java/de/stuefe/repros/metaspace/InterleaveKlassRefsInHeap.java__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cqIbufDr$" target="_blank" moz-do-not-send="true">https://github.com/tstuefe/test-hyperaligning-lilliput/blob/master/microbenchmark/the-test/src/main/java/de/stuefe/repros/metaspace/InterleaveKlassRefsInHeap.java</a>)<br>

          - [7] [Graph: G1 microbench, absolute GC pauses](<a href="https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/G1GC-abs-pauses.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-csFNBmYb$" target="_blank" moz-do-not-send="true">https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/G1GC-abs-pauses.svg</a>)<br>

          - [8] [Graph: ParallelGC microbench, absolute GC pauses](<a href="https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/ParallelGC-abs-pauses.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cvjyYiUL$" target="_blank" moz-do-not-send="true">https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/ParallelGC-abs-pauses.svg</a>)<br>

          - [9] [Graph: SerialGC microbench, absolute GC pauses](<a href="https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/SerialGC-abs-pauses.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-ct-eEsRy$" target="_blank" moz-do-not-send="true">https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/SerialGC-abs-pauses.svg</a>)<br>

          - [10] [Graph: G1 microbench, absolute GC pauses, alternate

          klasstable prototype](<a href="https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-pauses-absolute.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-cgSsWHuA$" target="_blank" moz-do-not-send="true">https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-pauses-absolute.svg</a>)<br>

          - [11] [Graph: G1 microbench, L1 misses, alternate klasstable

          prototype](<a href="https://urldefense.com/v3/__https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-l1misses-absolute.svg__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-co1avvvg$" target="_blank" moz-do-not-send="true">https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-l1misses-absolute.svg</a>)<br>

          - [12] <a href="https://urldefense.com/v3/__https://github.com/tstuefe/test-hyperaligning-lilliput/tree/master__;!!ACWV5N9M2RV99hQ!NYEJtMO0b0uXdTO64TV9yF49aPkmRbUTFD-z4C4cWxbub2Ba3KtGhCLLdGi2x29_KMBOo0IpYJwt-lgsXWc-ctCiBYUF$" target="_blank" moz-do-not-send="true">https://github.com/tstuefe/test-hyperaligning-lilliput/tree/master</a><br>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>