Solving the Klass hyperalignment problem

Thu May 23 15:20:21 UTC 2024

Hi all,

I would like help deciding on the best mitigation strategy for Lilliput's
Klass hyperalignment problem. Since it has wide effects (e.g. a possible
removal of class space), I'd like to base the next steps on consensus.

(a more readable version of this, in markdown, is here:
https://gist.github.com/tstuefe/6d8c4a40689c34b12f79442a8469504e).

1. Background

We store class information in Klass, and resolving Klass from oop is a hot
path. One example is GC: During GCs, we churn through tons of objects and
need to get at least object size (layout helper) and Oopmap from Klass
frequently. Therefore, the way we resolve a nKlass matters for performance.

Today (non-Lilliput), we go from Object to Klass (with compressed class
pointers) this way: We pluck the nKlass from the word adjacent to the MW.
We then calculate Klass* from nKlass by - typically - just adding the
encoding base as immediate. We may or may not omit that add, and we may or
may not shift the nKlass, but the most typical case (CDS enabled) is just
the addition.

Today's decoding does not need a single memory access, it can happen in
registers only.

In Lilliput, the nKlass lives in the MW (which allows us to load nKlass
with the same load that loads the MW). Therefore, nKlass needs to shrink.
The problem with the classic 32-bit nKlass is that its value range is not
used effectively. Klass structures tend to be large, on average 500-700
bytes [1], and that means a lot of values in that 32-bit range are "blind"
- point into the middle of a class - and are hence wasted. Ideally, one
would want to use one nKlass value per class.

In Lilliput, we reduced nKlass to 22-bit. We do this by placing Klass
structures only on 1KB-aligned addresses. Therefore, the lower 10 bits are
0 and can be shifted out. 1KB was chosen as a middle-ground that allows us
to use both the nKlass value range and the Klass encoding range (4GB)
effectively.

2. The Problem

By keeping Klass structures 1KB-aligned, we march in lockstep with respect
to CPU caches. With a cache line size of 64 bytes (6 bits) and an alignment
of 1KB (10 bits), we lose 4 bits of entropy. Therefore, loads from a Klass
structure have a high chance of evicting earlier loads from other Klass
structures. Depending on saturation and number of cache ways, we may only
use a 16th of the caches.

We see this effect clearly, especially in GC pause times. The bad cache
behavior is clearly noticeable and needs to be solved.

3. The solutions

3.1. Short-term mitigation: Increasing the nKlass size to 26 bits

A simple short-term mitigation for Lilliput Milestone 1 is just increasing
the nKlass size. By reducing the nKlass size to 22 bits, we freed up 10
bits, but to date, we only use 6 of them. For now, we have 4 spare bits in
the header. We could increase the nKlass to 26 bits and work with a shift
of 6 bits instead of 10 bits. That would require no additional work. Klass
would be aligned to just 64 bytes, so the cache performance problems would
disappear.

Note, however, that we earmarked those 4 spare bits for Valhalla's use.
Therefore, reverting to a 26-bit nKlass can only be an intermediate step.
And, obviously, it won't do for 32-bit headers.

3.2 Use a pointer indirection table

This idea resurfaces every couple of years. The idea is to replace the
class space with an indirection pointer table. In this approach, a nKlass
would be an index into a global pointer table, and that pointer table
contains the real Klass* pointers.

The enticing part of this approach is that we could throw away the class
space and a bunch of supporting code. Klass structures could live in normal
Metaspace like all the other data. However, we would need some new code to
maintain the global Klass* table, recycle empty pointer slots after class
unloading, etc.

Decoding a nKlass would mean:
- load the nKlass from the object
- load Klass* from the indirection table at the index nKlass points to

The approach would solve the cache problem described above since removing
any alignment requirement from Klass structures allows us to place them
wherever we like (e.g., in standard Metaspace), and their start addresses
would not march in lockstep.

However, it introduces a new cache problem since we now have a new load in
the hot decoding path. And the Klasspointer table can only be improved so
much: only 8 uncompressed pointers fit into a cache line. From a certain
number of classes, subsequent table accesses will have little spatial
locality.

3.3 Place Klass on alternating cache lines

Originally brought up by John Rose [2] when we did the first iteration of
22-bit class pointers in Lilliput. The idea is to alter the locations of
Klass structures by cache line size. There is nothing that forces us to use
a power-of-two stride. We can use any uneven multiple of cache lines that
we like. For example, 11 cache lines (704 bytes) would mean that Klass
structures would come to be located on different cache lines.

With a non-pow2 stride, decoding becomes a bit more complex. We cannot use
shift, we need to do integer multiplication:
- multiply nKlass with 704
- add base.

But all of this can still happen in registers only. No memory load is
needed.

4. The prototypes

I wanted to measure the performance impacts of all approaches. So I
compared four JVMs:

- A) (the unmitigated case) a Lilliput JVM as it is today: 22-bit nKlass,
10-bit shift
- B) a Lilliput JVM that uses a 26-bit nKlass with a 6-bit shift
- C) a Lilliput JVM that uses a 22-bit nKlass and a Klass pointer
indirection table [3]
- D) a Lilliput JVM that uses a 22-bit nKlass and a non-pow2 alignment of
Klass of 704 [4]

It turned out that Coleen also wrote a prototype with a Klass pointer
indirection table [5], but that is identical to mine (C) in all relevant
points. The only difference is that Coleen based it on the mainline JVM;
mine is based on Lilliput. But I repeated all my tests with Coleen's
prototype.

5. The Tests

I did both SpecJBB2015 and a custom-written Microbenchmark [6].

The microbenchmark was designed to stress Object-to-Klass dereferencing
during GC. It fills the heap with many objects of randomly chosen classes.
It keeps those objects alive in an array. It then executes several Full GCs
and sums up all GC pause times. Walking these objects forces the GCs to
de-reference many different nKlass values.

The Microbenchmark was run on a Ryzen Zen 2, the SpecJBB on an older
i7-4770, and Coleen's prototype I also tested on a Raspberry 4. The tests
were isolated to 8 cores (well, apart from the Raspberry), and I tried to
minimize scheduler interference. The microbenchmark results were pretty
stable, but the SpecJBB2015 results fluctuated - despite my attempts at
stabilizing them.

I repeated the Microbenchmark for a number of classes (512..16384) and
three different collectors (Serial, Parallel, G1).

6. The Results

The microbenchmark shows clear and stable results. SpecJBB fluctuated more.

6.1 Microbenchmark, G1GC

See graph [7].

(A) - the unmitigated 22-bit version showed overall the worst performance,
with a 41% increase over the best performer (B, the 26-bit version) at 16k
classes.
(B) - best performance
(C) - the klass pointer table seemed overall the worst of the three
mitigation prototypes. For 4k..8k classes, even worse than the unmitigated
case (A). Maxes out at +36% over the best performer (B) at 16k classes.
(D) - second best performance, for certain class ranges even best. Maxes
out at +11% at 16k classes.

I wondered why (D) could be better than (B). My assumption is that with
(D), we go out of our way to choose different cache lines. With (B), the
cache line chosen is "random" and may be subject to allocation pattern
artifacts of the underlying allocator.

6.2 Microbenchmark, ParallelGC

See graphs [8].

Differences are less pronounced, but the results are similar. (B) better
than (D) better than (C).

6.3 Microbenchmark, SerialGC

See graphs [9].

Again, the same result. Here are the deltas most pronounced, cache
inefficiency measured via GC pauses is the most apparent.

6.4 Coleen's prototype

I repeated the measurements with Coleens prototype (with G1), comparing it
against the same JVM with klass table switched off. No surprises, similar
behavior to (C) vs (B). See
[10]. I also did a run with perf to measure L1 misses, and we see up to 32%
more L1 cache misses with the klass pointer table [11].

6.5 SPecJBB2015

SpecJBB results were quite volatile despite my stabilization efforts. The
deltas between maxJOps and critJOps did not rise above random noise. The GC
pause times showed (Percentage numbers, compared with (A)==100%):

| Run | 1       | 2       | 3      |
|-----|---------|---------|--------|
| B   | 108.68% | 95.88%  | 97.01% |
| C   | 104.26% | 92.15%  | 90.25% |
| D   | 96.20%  | 94.31%  | 86.44% |

(all with G1GC)

Again, (D) seems to perform best. I am unsure what the problem is with (B)
here (the 26-bit class pointer approach) since it seems to perform worst in
all cases. I will look into that.

7. Side considerations

7.1 Running out of class space/addressable classes?

This issue does not affect the number of addressable classes much. That one
is limited by the size of nKlass. With a 22-bit nKlass that is 4 mio. We
can only work beyond that with the concept of near- vs far classes
suggested by John Rose. In any case, that is not the focus here.

The class-space-based approaches (A), (B), and (D) also have a soft limit
in that the number of Klass we can store is limited by the size of the
encoding range (4GB). However, that is a rather soft limit because no hard
technical reason prevents us from having a larger encoding range. The 4G
limitation exists only because we optimize the addition of the base
immediate by using e.g. 16-bit moves on some platforms, so the nKlass must
not extend beyond bit 31. We may be able to do that differently. Note that
4GB is also really large for Klass data.

7.2 Reducing the number of Klass structures addressable via nKlass

Coleen had a great idea that never instantiated classes (e.g., Lambda
Forms) don't need a nKlass at all. They, therefore, don't have to live
within the Klass encoding range. That would be a great improvement since
these classes are typically generated, and their number is unpredictable.
By removing this kind of classes from the equation, the question of
number-of-addressable classes becomes a lot more relaxed.

8. Conclusions

For the moment, I prefer (D) (the uneven-cache-lines-approach). It shows
the overall best performance, in parts even outperforming the 26-bit
approach.

The Klasspointer-Table approach (C) would be nice since we could eliminate
class space and a lot of coding that goes with it. That would reduce
complexity. But the additional load in hot decoding paths hurts. There is
also the vague fear of not being future-proof. I am apprehensive about
sacrificing Klass resolving performance since Klass lookup seems to be
something we will always do.

That said, all input (especially from you, Coleen!) is surely welcome.

# Materials

All test results, tests, etc can be found here: [12]

Thanks, Thomas

- [1] [Allocation Histogram for Klass- and Non-Klass Allocations](
https://raw.githubusercontent.com/tstuefe/metaspace-statistics/ab3625e041d42243039f37983969ac8b770a9f4a/Histogram.svg
)
- [2] https://github.com/openjdk/lilliput/pull/13#issuecomment-988456995
- [3] [Klasstable prototype](
https://github.com/tstuefe/lilliput/tree/lilliput-with-Klass-indirection-table
)
- [4] [Uneven alignment prototype](
https://github.com/tstuefe/lilliput/tree/lilliput-with-staggered-Klass-alignment
)
- [5] [Klass table prototype, Coleen](
https://github.com/openjdk/jdk/pull/19272)
- [6] [The Microbenchmark](
https://github.com/tstuefe/test-hyperaligning-lilliput/blob/master/microbenchmark/the-test/src/main/java/de/stuefe/repros/metaspace/InterleaveKlassRefsInHeap.java
)
- [7] [Graph: G1 microbench, absolute GC pauses](
https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/G1GC-abs-pauses.svg
)
- [8] [Graph: ParallelGC microbench, absolute GC pauses](
https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/ParallelGC-abs-pauses.svg
)
- [9] [Graph: SerialGC microbench, absolute GC pauses](
https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/e55025622ec3574b0252ff82d860e63603a9f7df/microbenchmark/archived/results-2024-05-14T15-36-27-CEST/SerialGC-abs-pauses.svg
)
- [10] [Graph: G1 microbench, absolute GC pauses, alternate klasstable
prototype](
https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-pauses-absolute.svg
)
- [11] [Graph: G1 microbench, L1 misses, alternate klasstable prototype](
https://raw.githubusercontent.com/tstuefe/test-hyperaligning-lilliput/6fb9d64fff930e16154c4b3fab9a24169b021de8/microbenchmark/archived/coleen-kptable-results-2024-05-21T11-14-31-CEST/g1gc-l1misses-absolute.svg
)
- [12] https://github.com/tstuefe/test-hyperaligning-lilliput/tree/master
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/lilliput-dev/attachments/20240523/45bfe5fb/attachment-0001.htm>