From thomas.stuefe at gmail.com  Fri Sep 10 11:11:56 2021
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Fri, 10 Sep 2021 13:11:56 +0200
Subject: Reducing class pointer size useful?
Message-ID: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>

Hi,

Would it be of use for Lilliput to shrink the class pointer size beyond 32
bit? I did not closely follow the discussions. Therefore I am not sure
where the current thinking goes.

If yes, maybe we could reduce the pointer size not only by reducing the
encoding range but by using larger alignments.

We encode with add-and-shift, as we do with compressed oops. Traditionally
the shift was 3, since sizeof(void*) is the alignment requirement for
metaspace allocations. This shift was used to enlarge the coverage of class
pointer encoding from 4GB to 32GB (KlassEncodingMetaspaceMax). But we never
used this to my knowledge since we limit class space size to 3GB at most.
And nobody needs 32GB class space anyway. So there was never a reason to
cover more than (3GB + <cds size>). Unless I missed something, the shift
had been useless. In fact, we recently removed the shift if CDS is on
(JDK-8265705) to solve an unrelated aarch64 issue, and nothing bad happened.

But we could use the shift, not to enlarge the encoding range but to reduce
the class pointer size. And we could use a larger shift value. For example,
let's say we shift 8 bits. Then cut off those bits and reduce the class
pointer to 24 bits.

The resulting alignment would be 256 bytes. Applied to all metaspace
allocations such an alignment would be prohibitively expensive, since most
allocations are very small. But if we apply this larger alignment to the
class space only, leave the rest of the metaspace alone, it is not so bad.
Before JEP 387, using different alignments would have been difficult to
implement, but metaspace coding is much more modular now, and using
different alignments for the different regions can be done.

So we apply the larger alignment only to Klass structures. Klass structures
are large, and the relative loss due to alignment would matters less. They
are variable-sized but sizes are clustered between ~512 bytes and ~1K. They
can get much larger than that, but that is rare. Alignment loss would be
between 0-255 bytes, lets say on average 127. For a typical larger app of
10000 classes, this would waste ~1.2MB. If that is acceptable depends on
what positive effect the smaller compressed class pointer has on project
Lilliput.

---

One could argue that using an 8 bit shifted class pointer emans it stops
being a pointer and becomes an index into a table of 256-byte-slots,
populated with variable-sized Klass structures. With Klass sizes clustered
between 512 bytes..1K each Klass would populate 2..4 slots on average. The
24-bit pointer is enough to address 16mio slots, hence on average 4..8
million Klass structures, still covering a 4G total range.

We could further slim down the class pointer if we agree on a lower maximum
number of classes. E.g. with 22 bits, we could address 4mio slots and house
about 500k...1mio classes, still allowing for a maximum encoding range of
1G.

We could play around with these variables. E.g. a larger shift of 10 bits -
1KB alignment - would mean most Klass structures occupy just one slot, we
would have to live a somewhat higher alignment waste of 0...1024, but now
can reduce the encoded class pointer to 20 bits, still being able to
address 1 mio slots resp. close to 1mio classes, with the total encoding
range still covering a 1GB.

---

I think this approach is a variant of the
Klass-structures-in-a-table-and-store-the-index approach, but it allows for
those rare Klass structures to be larger than a single table slot and it
has a much larger max. cap on the number of classes than if we were just to
limit the encoding range. To me this matters somewhat because I have seen
productive installations where the number of classes was the low 100000's.
I don't think the 8192 limit cited in the Lilliput Wiki is practical.

If I am right this approach should not require a lot of changes:
- we would need to modify metaspace to use separte alignments for the class
space
- may have to fix class pointer encoding for the various platforms if they
don't work with larger shifts out of the box, or are inefficient. E.g. on
x64, we use LEAQ to encode pointers, and LEAQ allows for a max. shift of 3,
so for shift=8 we may need to use separate add and shift.
- CDS may need some work too, since the Klass structures in the CDS region
need to be aligned to the larger alignment as well.

Hope I did not make some gross miscalculation somewhare, but that's my
idea. What do you think.

Thanks, Thomas

From john.r.rose at oracle.com  Sat Sep 11 01:30:57 2021
From: john.r.rose at oracle.com (John Rose)
Date: Sat, 11 Sep 2021 01:30:57 +0000
Subject: Reducing class pointer size useful?
In-Reply-To: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
References: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
Message-ID: <B6B6543D-219E-4E52-8481-A12ED374F1CE@oracle.com>

On Sep 10, 2021, at 4:11 AM, Thomas St?fe <thomas.stuefe at gmail.com<mailto:thomas.stuefe at gmail.com>> wrote:

Would it be of use for Lilliput to shrink the class pointer size beyond 32
bit? I did not closely follow the discussions. Therefore I am not sure
where the current thinking goes.

Such things are on the table I think.

There are two parameters here:  Number of
bits |ki| in the klass-ID, and size in bytes |ks|
(usually power of 2) of a klass struct.

Both |ki| and |ks| can be freely varied, I think,
as a design and optimization parameter.

1. *Not all* klasses need to be addressed using the
klass-ID of size |ki|; put another way, the first
2^|ki| glasses are privileged to have a compact
object header representation while other may
require more bits (an extension field in the
object layout).

2. *Not all* of the bytes of a klass need to be
represented in the |ks| bytes.  You can add
a level of indirection, and it won?t hurt much
as long as all the stuff JVM need fastest access
to is in the first |ks| bytes.

The second insight leads also to the concept
of ?near klass? vs ?far klass?, and opens the
conversation about having *several* near klasses
for one far klass.  In some designs, that allows
you to subdivide refine a system of classes
into a system of classes and ?species?, where
several species can share on class.

From rkennke at redhat.com  Mon Sep 13 14:19:06 2021
From: rkennke at redhat.com (Roman Kennke)
Date: Mon, 13 Sep 2021 16:19:06 +0200
Subject: Reducing class pointer size useful?
In-Reply-To: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
References: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
Message-ID: <ae84352f-c4b7-90cd-806a-43214df11554@redhat.com>

Hi Thomas,

Yes, indeed, this would be very helpful!

The current state of the prototype is that I'm putting the compressed 
Klass* in the upper 32bit of the header. (The original Klass* is still 
currently present in the 2nd word, but unused, except for verification 
purposes.) The layout is basically: 32bits for Klass*, 26bits for the 
hashcode, and 6 bits for the rest (locking and GC). Here it would be 
nice to have 32bit hashcodes instead, and 26bits for the Klass*.

I'm also working on moving the hashcode out of the header, requiring 
only 2 bits for managing the hashcode state, which makes it very 
reasonable to consider header sizes of 32bit: 24bits for the Klass* and 
8 bits for GC+locking+hashcode.

So yes, any mechanims to reduce the Klass* to 24bits (maybe with some 
flexibility in case we need more bits, e.g. for Loom or Valhalla) would 
be very welcome. My thinking went in very similar direction as you 
indicated (larger alignments for the Klass objects), and John Rose 
sketched some more ideas in his reply.

Are you planning to work on this?

Thanks,
Roman

> Hi,
> 
> Would it be of use for Lilliput to shrink the class pointer size beyond 32
> bit? I did not closely follow the discussions. Therefore I am not sure
> where the current thinking goes.
> 
> If yes, maybe we could reduce the pointer size not only by reducing the
> encoding range but by using larger alignments.
> 
> We encode with add-and-shift, as we do with compressed oops. Traditionally
> the shift was 3, since sizeof(void*) is the alignment requirement for
> metaspace allocations. This shift was used to enlarge the coverage of class
> pointer encoding from 4GB to 32GB (KlassEncodingMetaspaceMax). But we never
> used this to my knowledge since we limit class space size to 3GB at most.
> And nobody needs 32GB class space anyway. So there was never a reason to
> cover more than (3GB + <cds size>). Unless I missed something, the shift
> had been useless. In fact, we recently removed the shift if CDS is on
> (JDK-8265705) to solve an unrelated aarch64 issue, and nothing bad happened.
> 
> But we could use the shift, not to enlarge the encoding range but to reduce
> the class pointer size. And we could use a larger shift value. For example,
> let's say we shift 8 bits. Then cut off those bits and reduce the class
> pointer to 24 bits.
> 
> The resulting alignment would be 256 bytes. Applied to all metaspace
> allocations such an alignment would be prohibitively expensive, since most
> allocations are very small. But if we apply this larger alignment to the
> class space only, leave the rest of the metaspace alone, it is not so bad.
> Before JEP 387, using different alignments would have been difficult to
> implement, but metaspace coding is much more modular now, and using
> different alignments for the different regions can be done.
> 
> So we apply the larger alignment only to Klass structures. Klass structures
> are large, and the relative loss due to alignment would matters less. They
> are variable-sized but sizes are clustered between ~512 bytes and ~1K. They
> can get much larger than that, but that is rare. Alignment loss would be
> between 0-255 bytes, lets say on average 127. For a typical larger app of
> 10000 classes, this would waste ~1.2MB. If that is acceptable depends on
> what positive effect the smaller compressed class pointer has on project
> Lilliput.
> 
> ---
> 
> One could argue that using an 8 bit shifted class pointer emans it stops
> being a pointer and becomes an index into a table of 256-byte-slots,
> populated with variable-sized Klass structures. With Klass sizes clustered
> between 512 bytes..1K each Klass would populate 2..4 slots on average. The
> 24-bit pointer is enough to address 16mio slots, hence on average 4..8
> million Klass structures, still covering a 4G total range.
> 
> We could further slim down the class pointer if we agree on a lower maximum
> number of classes. E.g. with 22 bits, we could address 4mio slots and house
> about 500k...1mio classes, still allowing for a maximum encoding range of
> 1G.
> 
> We could play around with these variables. E.g. a larger shift of 10 bits -
> 1KB alignment - would mean most Klass structures occupy just one slot, we
> would have to live a somewhat higher alignment waste of 0...1024, but now
> can reduce the encoded class pointer to 20 bits, still being able to
> address 1 mio slots resp. close to 1mio classes, with the total encoding
> range still covering a 1GB.
> 
> ---
> 
> I think this approach is a variant of the
> Klass-structures-in-a-table-and-store-the-index approach, but it allows for
> those rare Klass structures to be larger than a single table slot and it
> has a much larger max. cap on the number of classes than if we were just to
> limit the encoding range. To me this matters somewhat because I have seen
> productive installations where the number of classes was the low 100000's.
> I don't think the 8192 limit cited in the Lilliput Wiki is practical.
> 
> If I am right this approach should not require a lot of changes:
> - we would need to modify metaspace to use separte alignments for the class
> space
> - may have to fix class pointer encoding for the various platforms if they
> don't work with larger shifts out of the box, or are inefficient. E.g. on
> x64, we use LEAQ to encode pointers, and LEAQ allows for a max. shift of 3,
> so for shift=8 we may need to use separate add and shift.
> - CDS may need some work too, since the Klass structures in the CDS region
> need to be aligned to the larger alignment as well.
> 
> Hope I did not make some gross miscalculation somewhare, but that's my
> idea. What do you think.
> 
> Thanks, Thomas
> 

From thomas.stuefe at gmail.com  Tue Sep 14 05:31:16 2021
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Tue, 14 Sep 2021 07:31:16 +0200
Subject: Reducing class pointer size useful?
In-Reply-To: <B6B6543D-219E-4E52-8481-A12ED374F1CE@oracle.com>
References: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
 <B6B6543D-219E-4E52-8481-A12ED374F1CE@oracle.com>
Message-ID: <CAA-vtUygwv00c0Xiz2VuJBvXMkjr+8YX1WXPHY3KfzZUeD1WxQ@mail.gmail.com>

Hi John,

thanks a lot for the concise recap. I think I understand now after yours
and Romans mail where we stand.

I hope that my proposal - if it works, and if I have not missed something -
would be a good first step toward shrinking the class pointer. Its
advantage is that it would not require deep changes. Neither splitting the
Klass nor dissecting classes into near and far ones would be required. It
may not be the best proposal memory-wise, but it would allow us to
experiment with smaller class pointers and refine the encoding process
later.

Thanks, Thomas

On Sat, Sep 11, 2021 at 3:31 AM John Rose <john.r.rose at oracle.com> wrote:

> On Sep 10, 2021, at 4:11 AM, Thomas St?fe <thomas.stuefe at gmail.com> wrote:
>
>
> Would it be of use for Lilliput to shrink the class pointer size beyond 32
> bit? I did not closely follow the discussions. Therefore I am not sure
> where the current thinking goes.
>
>
> Such things are on the table I think.
>
> There are two parameters here:  Number of
> bits |ki| in the klass-ID, and size in bytes |ks|
> (usually power of 2) of a klass struct.
>
> Both |ki| and |ks| can be freely varied, I think,
> as a design and optimization parameter.
>
> 1. *Not all* klasses need to be addressed using the
> klass-ID of size |ki|; put another way, the first
> 2^|ki| glasses are privileged to have a compact
> object header representation while other may
> require more bits (an extension field in the
> object layout).
>
> 2. *Not all* of the bytes of a klass need to be
> represented in the |ks| bytes.  You can add
> a level of indirection, and it won?t hurt much
> as long as all the stuff JVM need fastest access
> to is in the first |ks| bytes.
>
> The second insight leads also to the concept
> of ?near klass? vs ?far klass?, and opens the
> conversation about having *several* near klasses
> for one far klass.  In some designs, that allows
> you to subdivide refine a system of classes
> into a system of classes and ?species?, where
> several species can share on class.
>
>
>

From thomas.stuefe at gmail.com  Tue Sep 14 05:36:03 2021
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Tue, 14 Sep 2021 07:36:03 +0200
Subject: Reducing class pointer size useful?
In-Reply-To: <ae84352f-c4b7-90cd-806a-43214df11554@redhat.com>
References: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
 <ae84352f-c4b7-90cd-806a-43214df11554@redhat.com>
Message-ID: <CAA-vtUxe46JrR-Kf_vPyh8TDeiQEZSQ3pKL=7bffusRuUQWeig@mail.gmail.com>

Hi Roman,

On Mon, Sep 13, 2021 at 4:19 PM Roman Kennke <rkennke at redhat.com> wrote:

> Hi Thomas,
>
> Yes, indeed, this would be very helpful!
>
> The current state of the prototype is that I'm putting the compressed
> Klass* in the upper 32bit of the header. (The original Klass* is still
> currently present in the 2nd word, but unused, except for verification
> purposes.) The layout is basically: 32bits for Klass*, 26bits for the
> hashcode, and 6 bits for the rest (locking and GC). Here it would be
> nice to have 32bit hashcodes instead, and 26bits for the Klass*.
>
>
Understood.

> I'm also working on moving the hashcode out of the header, requiring
> only 2 bits for managing the hashcode state, which makes it very
> reasonable to consider header sizes of 32bit: 24bits for the Klass* and
> 8 bits for GC+locking+hashcode.
>
> So yes, any mechanims to reduce the Klass* to 24bits (maybe with some
> flexibility in case we need more bits, e.g. for Loom or Valhalla) would
> be very welcome. My thinking went in very similar direction as you
> indicated (larger alignments for the Klass objects), and John Rose
> sketched some more ideas in his reply.
>
> Are you planning to work on this?
>

Of the three steps I sketched out:
- getting the class space to use its separate alignment, and make it
tweakable
- getting the various encoding functions to work (again?) with arbitrary
shifts
- getting CDS to work with larger alignments

I'll tackle the first, then maybe the second. After that, we hopefully are
able to work with arbitrary alignments for Klass and it should work at
least with Xshare:off.

(Note: metaspace-wise I think the largest alignment which would not cause
larger ripples would be 1K, since that is the smallest chunk size, but that
should be plenty)

Let's see how this goes. Maybe I'm wrong and it does not work, but then we
know.

Cheers, Thomas

>
> Thanks,
> Roman
>
>
> > Hi,
> >
> > Would it be of use for Lilliput to shrink the class pointer size beyond
> 32
> > bit? I did not closely follow the discussions. Therefore I am not sure
> > where the current thinking goes.
> >
> > If yes, maybe we could reduce the pointer size not only by reducing the
> > encoding range but by using larger alignments.
> >
> > We encode with add-and-shift, as we do with compressed oops.
> Traditionally
> > the shift was 3, since sizeof(void*) is the alignment requirement for
> > metaspace allocations. This shift was used to enlarge the coverage of
> class
> > pointer encoding from 4GB to 32GB (KlassEncodingMetaspaceMax). But we
> never
> > used this to my knowledge since we limit class space size to 3GB at most.
> > And nobody needs 32GB class space anyway. So there was never a reason to
> > cover more than (3GB + <cds size>). Unless I missed something, the shift
> > had been useless. In fact, we recently removed the shift if CDS is on
> > (JDK-8265705) to solve an unrelated aarch64 issue, and nothing bad
> happened.
> >
> > But we could use the shift, not to enlarge the encoding range but to
> reduce
> > the class pointer size. And we could use a larger shift value. For
> example,
> > let's say we shift 8 bits. Then cut off those bits and reduce the class
> > pointer to 24 bits.
> >
> > The resulting alignment would be 256 bytes. Applied to all metaspace
> > allocations such an alignment would be prohibitively expensive, since
> most
> > allocations are very small. But if we apply this larger alignment to the
> > class space only, leave the rest of the metaspace alone, it is not so
> bad.
> > Before JEP 387, using different alignments would have been difficult to
> > implement, but metaspace coding is much more modular now, and using
> > different alignments for the different regions can be done.
> >
> > So we apply the larger alignment only to Klass structures. Klass
> structures
> > are large, and the relative loss due to alignment would matters less.
> They
> > are variable-sized but sizes are clustered between ~512 bytes and ~1K.
> They
> > can get much larger than that, but that is rare. Alignment loss would be
> > between 0-255 bytes, lets say on average 127. For a typical larger app of
> > 10000 classes, this would waste ~1.2MB. If that is acceptable depends on
> > what positive effect the smaller compressed class pointer has on project
> > Lilliput.
> >
> > ---
> >
> > One could argue that using an 8 bit shifted class pointer emans it stops
> > being a pointer and becomes an index into a table of 256-byte-slots,
> > populated with variable-sized Klass structures. With Klass sizes
> clustered
> > between 512 bytes..1K each Klass would populate 2..4 slots on average.
> The
> > 24-bit pointer is enough to address 16mio slots, hence on average 4..8
> > million Klass structures, still covering a 4G total range.
> >
> > We could further slim down the class pointer if we agree on a lower
> maximum
> > number of classes. E.g. with 22 bits, we could address 4mio slots and
> house
> > about 500k...1mio classes, still allowing for a maximum encoding range of
> > 1G.
> >
> > We could play around with these variables. E.g. a larger shift of 10
> bits -
> > 1KB alignment - would mean most Klass structures occupy just one slot, we
> > would have to live a somewhat higher alignment waste of 0...1024, but now
> > can reduce the encoded class pointer to 20 bits, still being able to
> > address 1 mio slots resp. close to 1mio classes, with the total encoding
> > range still covering a 1GB.
> >
> > ---
> >
> > I think this approach is a variant of the
> > Klass-structures-in-a-table-and-store-the-index approach, but it allows
> for
> > those rare Klass structures to be larger than a single table slot and it
> > has a much larger max. cap on the number of classes than if we were just
> to
> > limit the encoding range. To me this matters somewhat because I have seen
> > productive installations where the number of classes was the low
> 100000's.
> > I don't think the 8192 limit cited in the Lilliput Wiki is practical.
> >
> > If I am right this approach should not require a lot of changes:
> > - we would need to modify metaspace to use separte alignments for the
> class
> > space
> > - may have to fix class pointer encoding for the various platforms if
> they
> > don't work with larger shifts out of the box, or are inefficient. E.g. on
> > x64, we use LEAQ to encode pointers, and LEAQ allows for a max. shift of
> 3,
> > so for shift=8 we may need to use separate add and shift.
> > - CDS may need some work too, since the Klass structures in the CDS
> region
> > need to be aligned to the larger alignment as well.
> >
> > Hope I did not make some gross miscalculation somewhare, but that's my
> > idea. What do you think.
> >
> > Thanks, Thomas
> >
>
>

From rkennke at openjdk.java.net  Thu Sep 16 10:52:50 2021
From: rkennke at openjdk.java.net (Roman Kennke)
Date: Thu, 16 Sep 2021 10:52:50 GMT
Subject: [master] RFR: Read class from object header
In-Reply-To: <865395C4-4DE2-45DF-B048-18E46A3B2472@oracle.com>
References: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
 <865395C4-4DE2-45DF-B048-18E46A3B2472@oracle.com>
Message-ID: <4wx58-Q4BEEGrvYBPgZa17yk7YQb1R5HCH1oRr5mtic=.3ec60908-dd33-497f-93fc-0b1cec386c1b@github.com>

On Thu, 5 Aug 2021 19:42:23 GMT, Dave Dice <dave.dice at oracle.com> wrote:

> The following ? ?Compact Java Monitors? -- might provide some relief : https://arxiv.org/pdf/2102.04188.pdf.
> 
> As described, it handles just the identity hashCode value, but it?s trivial to extend the idea to displacing the whole lilliput header word.

Thanks, Dave! I will study it!
My current thinking wrt identityHashcode is to not store it in the header at all, but instead append it to the object on-demand. That is, when identityHashCode() is called, it would get re-computed as long as the object does not move, and as soon as it moves, an extra field would get appended to the object (or rather, often it fits in alignment gap at the end), and the hashcode is installed there. This is implemented in prototype here: https://github.com/rkennke/lilliput/tree/compact-hashcode
But, as far as I can see, it doesn't help with the problem that concurrent GCs have with the thread-locks.

-------------

PR: https://git.openjdk.java.net/lilliput/pull/12

From rkennke at openjdk.java.net  Thu Sep 16 12:09:32 2021
From: rkennke at openjdk.java.net (Roman Kennke)
Date: Thu, 16 Sep 2021 12:09:32 GMT
Subject: [master] RFR: Read class from object header [v3]
In-Reply-To: <9IVtuzNAPCOn5oGNKk4acD8iFZvSqhjHmpzaz7GUw84=.f4406f05-c065-439d-bbc9-4c899aa5ddab@github.com>
References: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
 <_KjFB_7rJQ_bW_t30Q2eO7uvWExH3np7zqJ3QnaQ8h8=.d28df3c6-b690-4766-bc3d-8a63a67c5076@github.com>
 <9IVtuzNAPCOn5oGNKk4acD8iFZvSqhjHmpzaz7GUw84=.f4406f05-c065-439d-bbc9-4c899aa5ddab@github.com>
Message-ID: <ZOmL_2gU6QKe1hY7QAw8JhIoop8tVxNwFBiz5VIep2U=.d3d87894-3efc-4373-bc30-e580c95eec01@github.com>

On Fri, 30 Jul 2021 07:55:31 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Add missing new markWord.inline.hpp
>
> src/hotspot/share/gc/shared/preservedMarks.inline.hpp line 58:
> 
>> 56:     header = header.displaced_mark_helper();
>> 57:   }
>> 58:   narrowKlass nklass = header.narrow_klass();
> 
> This assumes `UseCompressedClassPointers` is `true`, right? Needs to be asserted?

Yes, the whole Lilliput assumes UseCompressedClassPointers==true. We enforce that in arguments.cpp. We might even go so far to remove the flag and code paths altogether, but not now.

-------------

PR: https://git.openjdk.java.net/lilliput/pull/12

From rkennke at openjdk.java.net  Thu Sep 16 13:23:10 2021
From: rkennke at openjdk.java.net (Roman Kennke)
Date: Thu, 16 Sep 2021 13:23:10 GMT
Subject: [master] RFR: Read class from object header [v4]
In-Reply-To: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
References: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
Message-ID: <HSIOdt2LK44K_ViLLVsxg_mZsA8D3fBnOyIHLS3HeVY=.8b5f5500-73f8-4151-84cd-42bb948684f8@github.com>

> This changes the Hotspot runtime to load the Klass* from the header instead of the dedicated Klass* word. The dedicated word is only still used for verification and for access by generated code (the former will eventually go away, the latter will be implemented separately).
> 
> Currently, this means we need to coordinate with the ObjectSynchronizer: when encountering a header that is a stack lock or a monitor, the header is displaced. Worse, if it is a stack-locked that is owned by a thread other than the calling thread, we must first inflate the lock to a full monitor. This is particularily bad for GCs. Luckily, most paths only do this at a safepoint, where we can actually safely access foreign stack locks and don't need to worry about inflation. Notably exception is concurrent marking by G1GC, which can cause inflation of locks, but it doesn't hurt very much.
> 
> It's really bad for Shenandoah and ZGC, though: when relocating objects, GC needs to know the object size of the from-space copy. However, this can cause inflation, and inflation creates new WeakHandle in the resulting monitor, and that would be initialized with a from-space copy, which is a no-go during evacuation/relocation.
> 
> That said, I have been told that work is under way to get rid of displaced headers altogether, which would neatly solve all those problems. I have no desire to make complicated workarounds for Shenandoah GC and ZGC. I disabled both in my own builds for now, and will implement them as soon as the monitor changes arrive.
> 
> In a couple of places in GC we need to access the header carefully: when concurrently forwarding (by parallel GC threads), we need to ensure we access the Klass* from an unforwarded header, and must also ensure to avoid re-loading the Klass* once we have the good header (that is why so many asserts have been removed - they would potentially re-load the Klass* from a header that may now be forwarded).
> 
> Testing:
>  - [x] tier1
>  - [x] tier2
>  - [x] hotspot_gc
> (all without Shenandoah and ZGC, see above)

Roman Kennke has updated the pull request incrementally with four additional commits since the last revision:

 - Rename oopDesc::narrow_klass() to narrow_klass_legacy()
 - Formatting changes
 - Re-shape ObjectSynchronizer::stable_mark() and add assert
 - Assert +UseCompressedClassPointers

-------------

Changes:
  - all: https://git.openjdk.java.net/lilliput/pull/12/files
  - new: https://git.openjdk.java.net/lilliput/pull/12/files/1f9e5aef..f9c58407

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=lilliput&pr=12&range=03
 - incr: https://webrevs.openjdk.java.net/?repo=lilliput&pr=12&range=02-03

  Stats: 28 lines in 5 files changed: 2 ins; 2 del; 24 mod
  Patch: https://git.openjdk.java.net/lilliput/pull/12.diff
  Fetch: git fetch https://git.openjdk.java.net/lilliput pull/12/head:pull/12

PR: https://git.openjdk.java.net/lilliput/pull/12

From rkennke at openjdk.java.net  Thu Sep 16 14:20:33 2021
From: rkennke at openjdk.java.net (Roman Kennke)
Date: Thu, 16 Sep 2021 14:20:33 GMT
Subject: [master] Integrated: Read class from object header
In-Reply-To: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
References: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
Message-ID: <druwviJVpPX8K-WL0OSf4KLXWDjG1NZk_Uqq5rDj6Vo=.5f0d438f-e498-44ab-b38d-4d5e5521eb24@github.com>

On Tue, 13 Jul 2021 12:20:56 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

> This changes the Hotspot runtime to load the Klass* from the header instead of the dedicated Klass* word. The dedicated word is only still used for verification and for access by generated code (the former will eventually go away, the latter will be implemented separately).
> 
> Currently, this means we need to coordinate with the ObjectSynchronizer: when encountering a header that is a stack lock or a monitor, the header is displaced. Worse, if it is a stack-locked that is owned by a thread other than the calling thread, we must first inflate the lock to a full monitor. This is particularily bad for GCs. Luckily, most paths only do this at a safepoint, where we can actually safely access foreign stack locks and don't need to worry about inflation. Notably exception is concurrent marking by G1GC, which can cause inflation of locks, but it doesn't hurt very much.
> 
> It's really bad for Shenandoah and ZGC, though: when relocating objects, GC needs to know the object size of the from-space copy. However, this can cause inflation, and inflation creates new WeakHandle in the resulting monitor, and that would be initialized with a from-space copy, which is a no-go during evacuation/relocation.
> 
> That said, I have been told that work is under way to get rid of displaced headers altogether, which would neatly solve all those problems. I have no desire to make complicated workarounds for Shenandoah GC and ZGC. I disabled both in my own builds for now, and will implement them as soon as the monitor changes arrive.
> 
> In a couple of places in GC we need to access the header carefully: when concurrently forwarding (by parallel GC threads), we need to ensure we access the Klass* from an unforwarded header, and must also ensure to avoid re-loading the Klass* once we have the good header (that is why so many asserts have been removed - they would potentially re-load the Klass* from a header that may now be forwarded).
> 
> Testing:
>  - [x] tier1
>  - [x] tier2
>  - [x] hotspot_gc
> (all without Shenandoah and ZGC, see above)

This pull request has now been integrated.

Changeset: 02606f25
Author:    Roman Kennke <rkennke at openjdk.org>
URL:       https://git.openjdk.java.net/lilliput/commit/02606f251d69735a1550f06828f9b5ef3f10122d
Stats:     245 lines in 24 files changed: 180 ins; 35 del; 30 mod

Read class from object header

Reviewed-by: shade

-------------

PR: https://git.openjdk.java.net/lilliput/pull/12

From dave.dice at oracle.com  Thu Sep 16 16:41:54 2021
From: dave.dice at oracle.com (Dave Dice)
Date: Thu, 16 Sep 2021 16:41:54 +0000
Subject: [master] RFR: Read class from object header
In-Reply-To: <4wx58-Q4BEEGrvYBPgZa17yk7YQb1R5HCH1oRr5mtic=.3ec60908-dd33-497f-93fc-0b1cec386c1b@github.com>
References: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
 <865395C4-4DE2-45DF-B048-18E46A3B2472@oracle.com>
 <4wx58-Q4BEEGrvYBPgZa17yk7YQb1R5HCH1oRr5mtic=.3ec60908-dd33-497f-93fc-0b1cec386c1b@github.com>
Message-ID: <76606615-ADD2-44C2-91DB-463B8307E429@oracle.com>

On 2021-9-16, at 6:52 AM, Roman Kennke <rkennke at openjdk.java.net<mailto:rkennke at openjdk.java.net>> wrote:

On Thu, 5 Aug 2021 19:42:23 GMT, Dave Dice <dave.dice at oracle.com<mailto:dave.dice at oracle.com>> wrote:

The following ? ?Compact Java Monitors? -- might provide some relief : https://arxiv.org/pdf/2102.04188.pdf.

As described, it handles just the identity hashCode value, but it?s trivial to extend the idea to displacing the whole lilliput header word.

Thanks, Dave! I will study it!
My current thinking wrt identityHashcode is to not store it in the header at all, but instead append it to the object on-demand. That is, when identityHashCode() is called, it would get re-computed as long as the object does not move, and as soon as it moves, an extra field would get appended to the object (or rather, often it fits in alignment gap at the end), and the hashcode is installed there. This is implemented in prototype here: https://github.com/rkennke/lilliput/tree/compact-hashcode

Hi Roman,

I looked through compact-hashcode and, if I?m reading the definitions of ?hashctrl? correctly, this appears to be the tri-state (2-bit) hashCode algorithm from Bacon et al. : https://doi.org/10.1007/3-540-47993-7_5.  If that?s actually the case, it?d likely be good to include a citation in the code.   It was a good idea 20 years ago, and remains a good idea.   The only downside I know of is that you can exhaust memory extending objects in a moving GC, but there are ways to guard against that condition.

The Compact Java Monitor approach is rather agonistic concerning the hashCode (and for that matter, anything else in the header, such as the class/klass information, and age bits).   If you use the IBM 2-bit scheme, that?s fine, and if you need to displace it, that?s fine as well.

I hope to send out some results next week comparing the relative performance of a few potential ?synchronized? implementations.

Regards
Dave

But, as far as I can see, it doesn't help with the problem that concurrent GCs have with the thread-locks.

-------------

PR: https://git.openjdk.java.net/lilliput/pull/12

From rkennke at openjdk.java.net  Thu Sep 16 17:01:57 2021
From: rkennke at openjdk.java.net (Roman Kennke)
Date: Thu, 16 Sep 2021 17:01:57 GMT
Subject: [master] RFR: Read class from object header
In-Reply-To: <76606615-ADD2-44C2-91DB-463B8307E429@oracle.com>
References: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
 <76606615-ADD2-44C2-91DB-463B8307E429@oracle.com>
Message-ID: <rfOXf1egnx-oJA21oa-wj9tDEZGDvA46RJDg-D1NbJQ=.212736ec-de0f-421d-8af5-a326b7bf09b0@github.com>

On Thu, 16 Sep 2021 16:43:30 GMT, Dave Dice <dave.dice at oracle.com> wrote:

> I looked through compact-hashcode and, if I?m reading the definitions of ?hashctrl? correctly, this appears to be the tri-state (2-bit) hashCode algorithm from Bacon et al. : https://doi.org/10.1007/3-540-47993-7_5. If that?s actually the case, it?d likely be good to include a citation in the code.

That is likely true. I can't access the paper, though, it asks me to pay 26? ;-) I haven't read the paper, I've adopted the algorithm from talking to an OpenJ9 guy - OJ9 apparently uses a similar algorithm (haven't looked at their code, either). 

> It was a good idea 20 years ago, and remains a good idea. The only downside I know of is that you can exhaust memory extending objects in a moving GC, but there are ways to guard against that condition.

Right. It can't happen with sliding GCs, but copying GCs could theoretically run into this problem. It seems very very unlikely, but not impossible.

> The Compact Java Monitor approach is rather agonistic concerning the hashCode (and for that matter, anything else in the header, such as the class/klass information, and age bits). If you use the IBM 2-bit scheme, that?s fine, and if you need to displace it, that?s fine as well.
> 
> I hope to send out some results next week comparing the relative performance of a few potential ?synchronized? implementations.

I am a bit unsure about which direction to go. I also heard that there is work under way to remove JVM-side locking altogether, and use j.u.c instead, which would make the whole displaced-header problem go away. Getting rid of displaced headers would be a huge win. Otherwise we'll have to come up with a way to deal with it in concurrent GCs.

Thanks,
Roman

-------------

PR: https://git.openjdk.java.net/lilliput/pull/12

From dave.dice at oracle.com  Thu Sep 16 17:16:47 2021
From: dave.dice at oracle.com (Dave Dice)
Date: Thu, 16 Sep 2021 17:16:47 +0000
Subject: [master] RFR: Read class from object header
In-Reply-To: <rfOXf1egnx-oJA21oa-wj9tDEZGDvA46RJDg-D1NbJQ=.212736ec-de0f-421d-8af5-a326b7bf09b0@github.com>
References: <FsRMxgy20qY_epkwxt-NPeM9SGjcH7WjBeLXcHun49A=.734e457c-2512-44cd-8bfa-2411f7b4b62d@github.com>
 <76606615-ADD2-44C2-91DB-463B8307E429@oracle.com>
 <rfOXf1egnx-oJA21oa-wj9tDEZGDvA46RJDg-D1NbJQ=.212736ec-de0f-421d-8af5-a326b7bf09b0@github.com>
Message-ID: <829AF6D8-38A6-4BBD-B386-B9B81E5373AB@oracle.com>

On 2021-9-16, at 1:01 PM, Roman Kennke <rkennke at openjdk.java.net<mailto:rkennke at openjdk.java.net>> wrote:

On Thu, 16 Sep 2021 16:43:30 GMT, Dave Dice <dave.dice at oracle.com<mailto:dave.dice at oracle.com>> wrote:

I looked through compact-hashcode and, if I?m reading the definitions of ?hashctrl? correctly, this appears to be the tri-state (2-bit) hashCode algorithm from Bacon et al. : https://doi.org/10.1007/3-540-47993-7_5. If that?s actually the case, it?d likely be good to include a citation in the code.

That is likely true. I can't access the paper, though, it asks me to pay 26? ;-) I haven't read the paper, I've adopted the algorithm from talking to an OpenJ9 guy - OJ9 apparently uses a similar algorithm (haven't looked at their code, either).

Here?s a non-paywall version of the paper hosted by IBM :

https://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon02Space.pdf

It remains a good read.

?

It was a good idea 20 years ago, and remains a good idea. The only downside I know of is that you can exhaust memory extending objects in a moving GC, but there are ways to guard against that condition.

Right. It can't happen with sliding GCs, but copying GCs could theoretically run into this problem. It seems very very unlikely, but not impossible.

Agreed ...

The Compact Java Monitor approach is rather agonistic concerning the hashCode (and for that matter, anything else in the header, such as the class/klass information, and age bits). If you use the IBM 2-bit scheme, that?s fine, and if you need to displace it, that?s fine as well.

I hope to send out some results next week comparing the relative performance of a few potential ?synchronized? implementations.

I am a bit unsure about which direction to go. I also heard that there is work under way to remove JVM-side locking altogether, and use j.u.c instead, which would make the whole displaced-header problem go away. Getting rid of displaced headers would be a huge win. Otherwise we'll have to come up with a way to deal with it in concurrent GCs.

I think the loom folks are certainly interested in replacing synchronized with ReentrantLock-like constructs to avoiding the current pinning of virtual threads to threads.

I?ve experimented (mostly outside the JVM but in a fairly faithful C++ simulacrum) with a number of ideas, all the way from approaches that don?t touch the header at all (aesthetically desirable, but costly) to ones that borrow just a few bits of the header, to ones that still use displacement, but make accessing the displaced value much more sane (CJM).  As noted, I hope to send out a rough paper with some data next week.

Regards
Dave

Thanks,
Roman

-------------

PR: https://git.openjdk.java.net/lilliput/pull/12

From thomas.stuefe at gmail.com  Sun Sep 19 06:40:10 2021
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Sun, 19 Sep 2021 08:40:10 +0200
Subject: Reducing class pointer size useful?
In-Reply-To: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
References: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
Message-ID: <CAA-vtUyMacErjdSNLxS1e5Kn8d0+jAHzxq1y0JwYGr6fzkiWPA@mail.gmail.com>

Hi,

I built a prototype following my idea and tested it a bit.

For the prototype, I changed metaspace such that the class space portion
could run with a different alignment. I had to rewrite arena guard handling
(split out as an own patch for main, [1]) and fix code generation for x64
klass pointer decoding for shift values >3. It's very likely that this has
to be done for at least aarch64 and s390 too.

I used a 10 bits shift value (1K) which seems excessive at first glance but
looks like a sweet spot (that, or 9 bits) since the vast majority of Klass
structures seem to hover around 600-700 bytes. Note that beyond the 10 bits
shift, I did not reduce the encoding range further but hard-coded it to be
4G. So, we still can cover 4G encoding range.

All this means that in my prototype compressed class pointers do not exceed
22 bits in size.

My modifications to lilliput live in a Draft PR in the lilliput repo [2],
but it is still a work in progress.

============

1) I ran a test where I loaded 40000 classes [3] (actually ~43000 including
the JDK itself).

VM args:
- `-Xshare:off` because CDS does not yet work with the modified alignemnt
- `-XX:+AlwaysPreTouch` to stability RSS somewhat
- `-Xmx512m -Xms512m` to limit heap size and reduce its effect on RSS
- `-XX:CompressedClassSpaceSize=1g` kind of pointless, its the default
- `-XX:+UnlockDiagnosticVMOptions
-XX:CompressedClassSpaceBaseAddress=0xabcde000000` to test non-base-NULL
non-shift-0 encoding.

Program args:
`--num-generations=1 --num-loaders=1 --num-classes=40000`
Only one loader sequentially loading stuff, so minimal per-loader overhead,
which would have obfuscated the difference in memory consumption due to
Klass alignment.

Results:

3-bit alignment (base):

```
[0.058s][info][metaspace] Compressed class space mapped at:
0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824

[0.058s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
klass shift: 3, Narrow klass range: 0x40000000
```

RSS: 1,26..1,28 GB

Metaspace footprint:
  Non-class space:      400,00 MB reserved,     399,12 MB (>99%) committed,
 50 nodes.
      Class space:        1,00 GB reserved,      31,00 MB (  3%) committed,
 1 nodes.
             Both:        1,39 GB reserved,     430,12 MB ( 30%) committed.

------------

10-bit alignment:

```
[0.064s][info][metaspace] Compressed class space mapped at:
0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
[0.064s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
klass shift: 10, Narrow klass range: 0x40000000
```

RSS: 1,24..1,27 GB

Metaspace footprint:
  Non-class space:      400,00 MB reserved,     398,44 MB (>99%) committed,
 50 nodes.
      Class space:        1,00 GB reserved,      42,38 MB (  4%) committed,
 1 nodes.
             Both:        1,39 GB reserved,     440,81 MB ( 31%) committed.

Class space increase compared with base: 11.38 MB
Average alignment loss per Klass compared with base: 277 bytes

-----------

Interpretation:

As expected, with 10 bits class space consumption went up somewhat, by
about 11 MB. Non-class space stayed stable since it was still using
standard metaspace alignment.
RSS wobbled too much for this small difference in class space size to be
noticeable above the background noise (despite +AlwaysPreTouch).

============

2) Since the first test used classes artificially generated by me and
therefore may be skewed, I also did a simple test with the Springboot
petclinic. I started the petclinic and measured after it came up. At that
point, the petclinic loaded about 15000 classes. I used the same VM
arguments as (1).

Results:

3 bit alignment (base):

[0.059s][info][metaspace] Compressed class space mapped at:
0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824

[0.059s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
klass shift: 3, Narrow klass range: 0x40000000

RSS: 841-866 MB

Metaspace footprint:

  Non-class space:       64,00 MB reserved,      62,56 MB ( 98%) committed,
 8 nodes.
      Class space:        1,00 GB reserved,       9,38 MB ( <1%) committed,
 1 nodes.
             Both:        1,06 GB reserved,      71,94 MB (  7%) committed.

------

10 bit alignment:

[0.060s][info][metaspace] Compressed class space mapped at:
0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824

                                                         [60/571]
[0.060s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
klass shift: 10, Narrow klass range: 0x40000000

RSS: 849-850 MB
Metaspace footprint:
  Non-class space:       64,00 MB reserved,      62,62 MB ( 98%) committed,
 8 nodes.
      Class space:        1,00 GB reserved,      15,69 MB (  2%) committed,
 1 nodes.
             Both:        1,06 GB reserved,      78,31 MB (  7%) committed.

Class space increase compared with base: 6.31 MB
Average alignment loss per Klass compared with base: 441 bytes

------

9 bit alignment (to see if avg alignment loss changes significantly with
512 bytes alignment):

[0.059s][info][metaspace] Compressed class space mapped at:
0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824

[0.059s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
klass shift: 9, Narrow klass range: 0x40000000

RSS: 836-861 MB
Metaspace footprint:
  Non-class space:       64,00 MB reserved,      62,62 MB ( 98%) committed,
 8 nodes.
      Class space:        1,00 GB reserved,      12,88 MB (  1%) committed,
 1 nodes.
             Both:        1,06 GB reserved,      75,50 MB (  7%) committed.

Class space increase compared with base: 3.5 MB
Average alignment loss per Klass compared with base: 245 bytes

-----------

Interpretation:

Mirroring the results from (1), class space consumption went up a bit, but
was not noticeable above RSS wobbling from run to run.
Reducing alignment from 10 to 9 bits reduces avg loss per Klass. So, the
Jury may still be out which alignment is more effective, but the
differences are small.

============

Conclusion:

I think this approach makes sense. The increased class space size should
easily be recoverable by reduced heap size if Lilliput is successful in
doing that.

In my opinion, this approach allows us to easily investigate other forms of
object headers. It allows for Klass structures to stay variable-sized, to
be huge if they want to be huge, we do not have to artificially limit the
number of classes, we do not have to split classes into near- and far
classes, and do not have to split Klass into several components. If we
want, we can do all that later, but for now, it could just work with the
existing hotspot and only minor modifications.

I also think that keeping Klass in metaspace has a number of advantages:
- fast, arena-style allocation
- despite being in an arena, we get free-block management
- we can continue to use the memory reclamation mechanism of metaspace on
class unloading
- we have monitoring tools in place

Basically, I feel that if we invented a different scheme to store Klass
structures, we would eventually re-invent most of that.

If you guys think this is a good alley to investigate further, the next
steps would be:

- make Klass pointer encoding work on all 64-bit platforms with arbitrary
bases and shifts, and maybe optimize it further. For now this only works on
x64.
- Fix CDS to work with the new alignment

Thanks, Thomas

[1] https://github.com/openjdk/jdk/pull/5518
[2] https://github.com/openjdk/lilliput/pull/13
[3]
https://github.com/tstuefe/ojdk-repros/blob/master/repros8/src/main/java/de/stuefe/repros/metaspace/InterleavedLoaders.java

On Fri, Sep 10, 2021 at 1:11 PM Thomas St?fe <thomas.stuefe at gmail.com>
wrote:

> Hi,
>
> Would it be of use for Lilliput to shrink the class pointer size beyond 32
> bit? I did not closely follow the discussions. Therefore I am not sure
> where the current thinking goes.
>
> If yes, maybe we could reduce the pointer size not only by reducing the
> encoding range but by using larger alignments.
>
> We encode with add-and-shift, as we do with compressed oops. Traditionally
> the shift was 3, since sizeof(void*) is the alignment requirement for
> metaspace allocations. This shift was used to enlarge the coverage of class
> pointer encoding from 4GB to 32GB (KlassEncodingMetaspaceMax). But we never
> used this to my knowledge since we limit class space size to 3GB at most.
> And nobody needs 32GB class space anyway. So there was never a reason to
> cover more than (3GB + <cds size>). Unless I missed something, the shift
> had been useless. In fact, we recently removed the shift if CDS is on
> (JDK-8265705) to solve an unrelated aarch64 issue, and nothing bad happened.
>
> But we could use the shift, not to enlarge the encoding range but to
> reduce the class pointer size. And we could use a larger shift value. For
> example, let's say we shift 8 bits. Then cut off those bits and reduce the
> class pointer to 24 bits.
>
> The resulting alignment would be 256 bytes. Applied to all metaspace
> allocations such an alignment would be prohibitively expensive, since most
> allocations are very small. But if we apply this larger alignment to the
> class space only, leave the rest of the metaspace alone, it is not so bad.
> Before JEP 387, using different alignments would have been difficult to
> implement, but metaspace coding is much more modular now, and using
> different alignments for the different regions can be done.
>
> So we apply the larger alignment only to Klass structures. Klass
> structures are large, and the relative loss due to alignment would matters
> less. They are variable-sized but sizes are clustered between ~512 bytes
> and ~1K. They can get much larger than that, but that is rare. Alignment
> loss would be between 0-255 bytes, lets say on average 127. For a typical
> larger app of 10000 classes, this would waste ~1.2MB. If that is acceptable
> depends on what positive effect the smaller compressed class pointer has on
> project Lilliput.
>
> ---
>
> One could argue that using an 8 bit shifted class pointer emans it stops
> being a pointer and becomes an index into a table of 256-byte-slots,
> populated with variable-sized Klass structures. With Klass sizes clustered
> between 512 bytes..1K each Klass would populate 2..4 slots on average. The
> 24-bit pointer is enough to address 16mio slots, hence on average 4..8
> million Klass structures, still covering a 4G total range.
>
> We could further slim down the class pointer if we agree on a lower
> maximum number of classes. E.g. with 22 bits, we could address 4mio slots
> and house about 500k...1mio classes, still allowing for a maximum encoding
> range of 1G.
>
> We could play around with these variables. E.g. a larger shift of 10 bits
> - 1KB alignment - would mean most Klass structures occupy just one slot, we
> would have to live a somewhat higher alignment waste of 0...1024, but now
> can reduce the encoded class pointer to 20 bits, still being able to
> address 1 mio slots resp. close to 1mio classes, with the total encoding
> range still covering a 1GB.
>
> ---
>
> I think this approach is a variant of the
> Klass-structures-in-a-table-and-store-the-index approach, but it allows for
> those rare Klass structures to be larger than a single table slot and it
> has a much larger max. cap on the number of classes than if we were just to
> limit the encoding range. To me this matters somewhat because I have seen
> productive installations where the number of classes was the low 100000's.
> I don't think the 8192 limit cited in the Lilliput Wiki is practical.
>
> If I am right this approach should not require a lot of changes:
> - we would need to modify metaspace to use separte alignments for the
> class space
> - may have to fix class pointer encoding for the various platforms if they
> don't work with larger shifts out of the box, or are inefficient. E.g. on
> x64, we use LEAQ to encode pointers, and LEAQ allows for a max. shift of 3,
> so for shift=8 we may need to use separate add and shift.
> - CDS may need some work too, since the Klass structures in the CDS region
> need to be aligned to the larger alignment as well.
>
> Hope I did not make some gross miscalculation somewhare, but that's my
> idea. What do you think.
>
> Thanks, Thomas
>

From rkennke at redhat.com  Mon Sep 20 14:56:29 2021
From: rkennke at redhat.com (Roman Kennke)
Date: Mon, 20 Sep 2021 16:56:29 +0200
Subject: Reducing class pointer size useful?
In-Reply-To: <CAA-vtUyMacErjdSNLxS1e5Kn8d0+jAHzxq1y0JwYGr6fzkiWPA@mail.gmail.com>
References: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
 <CAA-vtUyMacErjdSNLxS1e5Kn8d0+jAHzxq1y0JwYGr6fzkiWPA@mail.gmail.com>
Message-ID: <be62c02a-24c3-3a45-2a5c-1a270dd450a8@redhat.com>

Hi Thomas,

This is very useful! As soon as the PR is ready, I would be happy to 
merge it to enable further experimentation. It would be good if we could 
make the number-of-bits-per-Klass* configurable, even for the header 
layout, so that we can trade some Klass* bits if we need them. Currently 
I think we should be good with 24bits.

Also, 4G encoding range seems excessive for the vast majority of 
applications, can we trade some encoding range for smaller alignment at 
the same number-of-bits too?

Thanks!
Roman

> Hi,
> 
> I built a prototype following my idea and tested it a bit.
> 
> For the prototype, I changed metaspace such that the class space portion
> could run with a different alignment. I had to rewrite arena guard handling
> (split out as an own patch for main, [1]) and fix code generation for x64
> klass pointer decoding for shift values >3. It's very likely that this has
> to be done for at least aarch64 and s390 too.
> 
> I used a 10 bits shift value (1K) which seems excessive at first glance but
> looks like a sweet spot (that, or 9 bits) since the vast majority of Klass
> structures seem to hover around 600-700 bytes. Note that beyond the 10 bits
> shift, I did not reduce the encoding range further but hard-coded it to be
> 4G. So, we still can cover 4G encoding range.
> 
> All this means that in my prototype compressed class pointers do not exceed
> 22 bits in size.
> 
> My modifications to lilliput live in a Draft PR in the lilliput repo [2],
> but it is still a work in progress.
> 
> ============
> 
> 1) I ran a test where I loaded 40000 classes [3] (actually ~43000 including
> the JDK itself).
> 
> VM args:
> - `-Xshare:off` because CDS does not yet work with the modified alignemnt
> - `-XX:+AlwaysPreTouch` to stability RSS somewhat
> - `-Xmx512m -Xms512m` to limit heap size and reduce its effect on RSS
> - `-XX:CompressedClassSpaceSize=1g` kind of pointless, its the default
> - `-XX:+UnlockDiagnosticVMOptions
> -XX:CompressedClassSpaceBaseAddress=0xabcde000000` to test non-base-NULL
> non-shift-0 encoding.
> 
> Program args:
> `--num-generations=1 --num-loaders=1 --num-classes=40000`
> Only one loader sequentially loading stuff, so minimal per-loader overhead,
> which would have obfuscated the difference in memory consumption due to
> Klass alignment.
> 
> Results:
> 
> 3-bit alignment (base):
> 
> ```
> [0.058s][info][metaspace] Compressed class space mapped at:
> 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> 
> [0.058s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> klass shift: 3, Narrow klass range: 0x40000000
> ```
> 
> RSS: 1,26..1,28 GB
> 
> Metaspace footprint:
>    Non-class space:      400,00 MB reserved,     399,12 MB (>99%) committed,
>   50 nodes.
>        Class space:        1,00 GB reserved,      31,00 MB (  3%) committed,
>   1 nodes.
>               Both:        1,39 GB reserved,     430,12 MB ( 30%) committed.
> 
> ------------
> 
> 10-bit alignment:
> 
> ```
> [0.064s][info][metaspace] Compressed class space mapped at:
> 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> [0.064s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> klass shift: 10, Narrow klass range: 0x40000000
> ```
> 
> RSS: 1,24..1,27 GB
> 
> Metaspace footprint:
>    Non-class space:      400,00 MB reserved,     398,44 MB (>99%) committed,
>   50 nodes.
>        Class space:        1,00 GB reserved,      42,38 MB (  4%) committed,
>   1 nodes.
>               Both:        1,39 GB reserved,     440,81 MB ( 31%) committed.
> 
> Class space increase compared with base: 11.38 MB
> Average alignment loss per Klass compared with base: 277 bytes
> 
> -----------
> 
> Interpretation:
> 
> As expected, with 10 bits class space consumption went up somewhat, by
> about 11 MB. Non-class space stayed stable since it was still using
> standard metaspace alignment.
> RSS wobbled too much for this small difference in class space size to be
> noticeable above the background noise (despite +AlwaysPreTouch).
> 
> ============
> 
> 2) Since the first test used classes artificially generated by me and
> therefore may be skewed, I also did a simple test with the Springboot
> petclinic. I started the petclinic and measured after it came up. At that
> point, the petclinic loaded about 15000 classes. I used the same VM
> arguments as (1).
> 
> Results:
> 
> 3 bit alignment (base):
> 
> [0.059s][info][metaspace] Compressed class space mapped at:
> 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> 
> [0.059s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> klass shift: 3, Narrow klass range: 0x40000000
> 
> 
> RSS: 841-866 MB
> 
> Metaspace footprint:
> 
>    Non-class space:       64,00 MB reserved,      62,56 MB ( 98%) committed,
>   8 nodes.
>        Class space:        1,00 GB reserved,       9,38 MB ( <1%) committed,
>   1 nodes.
>               Both:        1,06 GB reserved,      71,94 MB (  7%) committed.
> 
> 
> ------
> 
> 10 bit alignment:
> 
> [0.060s][info][metaspace] Compressed class space mapped at:
> 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> 
>                                                           [60/571]
> [0.060s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> klass shift: 10, Narrow klass range: 0x40000000
> 
> RSS: 849-850 MB
> Metaspace footprint:
>    Non-class space:       64,00 MB reserved,      62,62 MB ( 98%) committed,
>   8 nodes.
>        Class space:        1,00 GB reserved,      15,69 MB (  2%) committed,
>   1 nodes.
>               Both:        1,06 GB reserved,      78,31 MB (  7%) committed.
> 
> Class space increase compared with base: 6.31 MB
> Average alignment loss per Klass compared with base: 441 bytes
> 
> ------
> 
> 9 bit alignment (to see if avg alignment loss changes significantly with
> 512 bytes alignment):
> 
> [0.059s][info][metaspace] Compressed class space mapped at:
> 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> 
> [0.059s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> klass shift: 9, Narrow klass range: 0x40000000
> 
> RSS: 836-861 MB
> Metaspace footprint:
>    Non-class space:       64,00 MB reserved,      62,62 MB ( 98%) committed,
>   8 nodes.
>        Class space:        1,00 GB reserved,      12,88 MB (  1%) committed,
>   1 nodes.
>               Both:        1,06 GB reserved,      75,50 MB (  7%) committed.
> 
> Class space increase compared with base: 3.5 MB
> Average alignment loss per Klass compared with base: 245 bytes
> 
> -----------
> 
> Interpretation:
> 
> Mirroring the results from (1), class space consumption went up a bit, but
> was not noticeable above RSS wobbling from run to run.
> Reducing alignment from 10 to 9 bits reduces avg loss per Klass. So, the
> Jury may still be out which alignment is more effective, but the
> differences are small.
> 
> ============
> 
> Conclusion:
> 
> I think this approach makes sense. The increased class space size should
> easily be recoverable by reduced heap size if Lilliput is successful in
> doing that.
> 
> In my opinion, this approach allows us to easily investigate other forms of
> object headers. It allows for Klass structures to stay variable-sized, to
> be huge if they want to be huge, we do not have to artificially limit the
> number of classes, we do not have to split classes into near- and far
> classes, and do not have to split Klass into several components. If we
> want, we can do all that later, but for now, it could just work with the
> existing hotspot and only minor modifications.
> 
> I also think that keeping Klass in metaspace has a number of advantages:
> - fast, arena-style allocation
> - despite being in an arena, we get free-block management
> - we can continue to use the memory reclamation mechanism of metaspace on
> class unloading
> - we have monitoring tools in place
> 
> Basically, I feel that if we invented a different scheme to store Klass
> structures, we would eventually re-invent most of that.
> 
> If you guys think this is a good alley to investigate further, the next
> steps would be:
> 
> - make Klass pointer encoding work on all 64-bit platforms with arbitrary
> bases and shifts, and maybe optimize it further. For now this only works on
> x64.
> - Fix CDS to work with the new alignment
> 
> Thanks, Thomas
> 
> [1] https://github.com/openjdk/jdk/pull/5518
> [2] https://github.com/openjdk/lilliput/pull/13
> [3]
> https://github.com/tstuefe/ojdk-repros/blob/master/repros8/src/main/java/de/stuefe/repros/metaspace/InterleavedLoaders.java
> 
> 
> On Fri, Sep 10, 2021 at 1:11 PM Thomas St?fe <thomas.stuefe at gmail.com>
> wrote:
> 
>> Hi,
>>
>> Would it be of use for Lilliput to shrink the class pointer size beyond 32
>> bit? I did not closely follow the discussions. Therefore I am not sure
>> where the current thinking goes.
>>
>> If yes, maybe we could reduce the pointer size not only by reducing the
>> encoding range but by using larger alignments.
>>
>> We encode with add-and-shift, as we do with compressed oops. Traditionally
>> the shift was 3, since sizeof(void*) is the alignment requirement for
>> metaspace allocations. This shift was used to enlarge the coverage of class
>> pointer encoding from 4GB to 32GB (KlassEncodingMetaspaceMax). But we never
>> used this to my knowledge since we limit class space size to 3GB at most.
>> And nobody needs 32GB class space anyway. So there was never a reason to
>> cover more than (3GB + <cds size>). Unless I missed something, the shift
>> had been useless. In fact, we recently removed the shift if CDS is on
>> (JDK-8265705) to solve an unrelated aarch64 issue, and nothing bad happened.
>>
>> But we could use the shift, not to enlarge the encoding range but to
>> reduce the class pointer size. And we could use a larger shift value. For
>> example, let's say we shift 8 bits. Then cut off those bits and reduce the
>> class pointer to 24 bits.
>>
>> The resulting alignment would be 256 bytes. Applied to all metaspace
>> allocations such an alignment would be prohibitively expensive, since most
>> allocations are very small. But if we apply this larger alignment to the
>> class space only, leave the rest of the metaspace alone, it is not so bad.
>> Before JEP 387, using different alignments would have been difficult to
>> implement, but metaspace coding is much more modular now, and using
>> different alignments for the different regions can be done.
>>
>> So we apply the larger alignment only to Klass structures. Klass
>> structures are large, and the relative loss due to alignment would matters
>> less. They are variable-sized but sizes are clustered between ~512 bytes
>> and ~1K. They can get much larger than that, but that is rare. Alignment
>> loss would be between 0-255 bytes, lets say on average 127. For a typical
>> larger app of 10000 classes, this would waste ~1.2MB. If that is acceptable
>> depends on what positive effect the smaller compressed class pointer has on
>> project Lilliput.
>>
>> ---
>>
>> One could argue that using an 8 bit shifted class pointer emans it stops
>> being a pointer and becomes an index into a table of 256-byte-slots,
>> populated with variable-sized Klass structures. With Klass sizes clustered
>> between 512 bytes..1K each Klass would populate 2..4 slots on average. The
>> 24-bit pointer is enough to address 16mio slots, hence on average 4..8
>> million Klass structures, still covering a 4G total range.
>>
>> We could further slim down the class pointer if we agree on a lower
>> maximum number of classes. E.g. with 22 bits, we could address 4mio slots
>> and house about 500k...1mio classes, still allowing for a maximum encoding
>> range of 1G.
>>
>> We could play around with these variables. E.g. a larger shift of 10 bits
>> - 1KB alignment - would mean most Klass structures occupy just one slot, we
>> would have to live a somewhat higher alignment waste of 0...1024, but now
>> can reduce the encoded class pointer to 20 bits, still being able to
>> address 1 mio slots resp. close to 1mio classes, with the total encoding
>> range still covering a 1GB.
>>
>> ---
>>
>> I think this approach is a variant of the
>> Klass-structures-in-a-table-and-store-the-index approach, but it allows for
>> those rare Klass structures to be larger than a single table slot and it
>> has a much larger max. cap on the number of classes than if we were just to
>> limit the encoding range. To me this matters somewhat because I have seen
>> productive installations where the number of classes was the low 100000's.
>> I don't think the 8192 limit cited in the Lilliput Wiki is practical.
>>
>> If I am right this approach should not require a lot of changes:
>> - we would need to modify metaspace to use separte alignments for the
>> class space
>> - may have to fix class pointer encoding for the various platforms if they
>> don't work with larger shifts out of the box, or are inefficient. E.g. on
>> x64, we use LEAQ to encode pointers, and LEAQ allows for a max. shift of 3,
>> so for shift=8 we may need to use separate add and shift.
>> - CDS may need some work too, since the Klass structures in the CDS region
>> need to be aligned to the larger alignment as well.
>>
>> Hope I did not make some gross miscalculation somewhare, but that's my
>> idea. What do you think.
>>
>> Thanks, Thomas
>>
> 

From thomas.stuefe at gmail.com  Tue Sep 21 05:03:39 2021
From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=)
Date: Tue, 21 Sep 2021 07:03:39 +0200
Subject: Reducing class pointer size useful?
In-Reply-To: <be62c02a-24c3-3a45-2a5c-1a270dd450a8@redhat.com>
References: <CAA-vtUwnMSt7K1bW1kzg6Cct-+S_9NK1Lv1K-vhb7qzbKweAmw@mail.gmail.com>
 <CAA-vtUyMacErjdSNLxS1e5Kn8d0+jAHzxq1y0JwYGr6fzkiWPA@mail.gmail.com>
 <be62c02a-24c3-3a45-2a5c-1a270dd450a8@redhat.com>
Message-ID: <CAA-vtUwrXSpJtdMqfybssvvxPe8XPZLcfGNAXGbXsq3uo4ft5A@mail.gmail.com>

Hi Roman,

On Mon, Sep 20, 2021 at 4:56 PM Roman Kennke <rkennke at redhat.com> wrote:

> Hi Thomas,
>
> This is very useful! As soon as the PR is ready, I would be happy to
> merge it to enable further experimentation. It would be good if we could
> make the number-of-bits-per-Klass* configurable, even for the header
> layout, so that we can trade some Klass* bits if we need them. Currently
> I think we should be good with 24bits.
>

It's already configurable as a compile-time constant:
https://github.com/openjdk/lilliput/blob/42fd0204a9fb427f3a93f595866640cb0481d3b5/src/hotspot/share/utilities/globalDefinitions.hpp#L537
.

>
> Also, 4G encoding range seems excessive for the vast majority of
> applications, can we trade some encoding range for smaller alignment at
> the same number-of-bits too?
>

I think yes. Easily down to 2 or 1G. Below that some fiddling is needed
since the range needs to encompass both class space and those CDS regions
which contain Klasses.

>
> Thanks!
> Roman
>

I'll prepare the patch.

Cheers, Thomas

>
> > Hi,
> >
> > I built a prototype following my idea and tested it a bit.
> >
> > For the prototype, I changed metaspace such that the class space portion
> > could run with a different alignment. I had to rewrite arena guard
> handling
> > (split out as an own patch for main, [1]) and fix code generation for x64
> > klass pointer decoding for shift values >3. It's very likely that this
> has
> > to be done for at least aarch64 and s390 too.
> >
> > I used a 10 bits shift value (1K) which seems excessive at first glance
> but
> > looks like a sweet spot (that, or 9 bits) since the vast majority of
> Klass
> > structures seem to hover around 600-700 bytes. Note that beyond the 10
> bits
> > shift, I did not reduce the encoding range further but hard-coded it to
> be
> > 4G. So, we still can cover 4G encoding range.
> >
> > All this means that in my prototype compressed class pointers do not
> exceed
> > 22 bits in size.
> >
> > My modifications to lilliput live in a Draft PR in the lilliput repo [2],
> > but it is still a work in progress.
> >
> > ============
> >
> > 1) I ran a test where I loaded 40000 classes [3] (actually ~43000
> including
> > the JDK itself).
> >
> > VM args:
> > - `-Xshare:off` because CDS does not yet work with the modified alignemnt
> > - `-XX:+AlwaysPreTouch` to stability RSS somewhat
> > - `-Xmx512m -Xms512m` to limit heap size and reduce its effect on RSS
> > - `-XX:CompressedClassSpaceSize=1g` kind of pointless, its the default
> > - `-XX:+UnlockDiagnosticVMOptions
> > -XX:CompressedClassSpaceBaseAddress=0xabcde000000` to test non-base-NULL
> > non-shift-0 encoding.
> >
> > Program args:
> > `--num-generations=1 --num-loaders=1 --num-classes=40000`
> > Only one loader sequentially loading stuff, so minimal per-loader
> overhead,
> > which would have obfuscated the difference in memory consumption due to
> > Klass alignment.
> >
> > Results:
> >
> > 3-bit alignment (base):
> >
> > ```
> > [0.058s][info][metaspace] Compressed class space mapped at:
> > 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> >
> > [0.058s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> > klass shift: 3, Narrow klass range: 0x40000000
> > ```
> >
> > RSS: 1,26..1,28 GB
> >
> > Metaspace footprint:
> >    Non-class space:      400,00 MB reserved,     399,12 MB (>99%)
> committed,
> >   50 nodes.
> >        Class space:        1,00 GB reserved,      31,00 MB (  3%)
> committed,
> >   1 nodes.
> >               Both:        1,39 GB reserved,     430,12 MB ( 30%)
> committed.
> >
> > ------------
> >
> > 10-bit alignment:
> >
> > ```
> > [0.064s][info][metaspace] Compressed class space mapped at:
> > 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> > [0.064s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> > klass shift: 10, Narrow klass range: 0x40000000
> > ```
> >
> > RSS: 1,24..1,27 GB
> >
> > Metaspace footprint:
> >    Non-class space:      400,00 MB reserved,     398,44 MB (>99%)
> committed,
> >   50 nodes.
> >        Class space:        1,00 GB reserved,      42,38 MB (  4%)
> committed,
> >   1 nodes.
> >               Both:        1,39 GB reserved,     440,81 MB ( 31%)
> committed.
> >
> > Class space increase compared with base: 11.38 MB
> > Average alignment loss per Klass compared with base: 277 bytes
> >
> > -----------
> >
> > Interpretation:
> >
> > As expected, with 10 bits class space consumption went up somewhat, by
> > about 11 MB. Non-class space stayed stable since it was still using
> > standard metaspace alignment.
> > RSS wobbled too much for this small difference in class space size to be
> > noticeable above the background noise (despite +AlwaysPreTouch).
> >
> > ============
> >
> > 2) Since the first test used classes artificially generated by me and
> > therefore may be skewed, I also did a simple test with the Springboot
> > petclinic. I started the petclinic and measured after it came up. At that
> > point, the petclinic loaded about 15000 classes. I used the same VM
> > arguments as (1).
> >
> > Results:
> >
> > 3 bit alignment (base):
> >
> > [0.059s][info][metaspace] Compressed class space mapped at:
> > 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> >
> > [0.059s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> > klass shift: 3, Narrow klass range: 0x40000000
> >
> >
> > RSS: 841-866 MB
> >
> > Metaspace footprint:
> >
> >    Non-class space:       64,00 MB reserved,      62,56 MB ( 98%)
> committed,
> >   8 nodes.
> >        Class space:        1,00 GB reserved,       9,38 MB ( <1%)
> committed,
> >   1 nodes.
> >               Both:        1,06 GB reserved,      71,94 MB (  7%)
> committed.
> >
> >
> > ------
> >
> > 10 bit alignment:
> >
> > [0.060s][info][metaspace] Compressed class space mapped at:
> > 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> >
> >                                                           [60/571]
> > [0.060s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> > klass shift: 10, Narrow klass range: 0x40000000
> >
> > RSS: 849-850 MB
> > Metaspace footprint:
> >    Non-class space:       64,00 MB reserved,      62,62 MB ( 98%)
> committed,
> >   8 nodes.
> >        Class space:        1,00 GB reserved,      15,69 MB (  2%)
> committed,
> >   1 nodes.
> >               Both:        1,06 GB reserved,      78,31 MB (  7%)
> committed.
> >
> > Class space increase compared with base: 6.31 MB
> > Average alignment loss per Klass compared with base: 441 bytes
> >
> > ------
> >
> > 9 bit alignment (to see if avg alignment loss changes significantly with
> > 512 bytes alignment):
> >
> > [0.059s][info][metaspace] Compressed class space mapped at:
> > 0x00000abcde000000-0x00000abd1e000000, reserved size: 1073741824
> >
> > [0.059s][info][metaspace] Narrow klass base: 0x00000abcde000000, Narrow
> > klass shift: 9, Narrow klass range: 0x40000000
> >
> > RSS: 836-861 MB
> > Metaspace footprint:
> >    Non-class space:       64,00 MB reserved,      62,62 MB ( 98%)
> committed,
> >   8 nodes.
> >        Class space:        1,00 GB reserved,      12,88 MB (  1%)
> committed,
> >   1 nodes.
> >               Both:        1,06 GB reserved,      75,50 MB (  7%)
> committed.
> >
> > Class space increase compared with base: 3.5 MB
> > Average alignment loss per Klass compared with base: 245 bytes
> >
> > -----------
> >
> > Interpretation:
> >
> > Mirroring the results from (1), class space consumption went up a bit,
> but
> > was not noticeable above RSS wobbling from run to run.
> > Reducing alignment from 10 to 9 bits reduces avg loss per Klass. So, the
> > Jury may still be out which alignment is more effective, but the
> > differences are small.
> >
> > ============
> >
> > Conclusion:
> >
> > I think this approach makes sense. The increased class space size should
> > easily be recoverable by reduced heap size if Lilliput is successful in
> > doing that.
> >
> > In my opinion, this approach allows us to easily investigate other forms
> of
> > object headers. It allows for Klass structures to stay variable-sized, to
> > be huge if they want to be huge, we do not have to artificially limit the
> > number of classes, we do not have to split classes into near- and far
> > classes, and do not have to split Klass into several components. If we
> > want, we can do all that later, but for now, it could just work with the
> > existing hotspot and only minor modifications.
> >
> > I also think that keeping Klass in metaspace has a number of advantages:
> > - fast, arena-style allocation
> > - despite being in an arena, we get free-block management
> > - we can continue to use the memory reclamation mechanism of metaspace on
> > class unloading
> > - we have monitoring tools in place
> >
> > Basically, I feel that if we invented a different scheme to store Klass
> > structures, we would eventually re-invent most of that.
> >
> > If you guys think this is a good alley to investigate further, the next
> > steps would be:
> >
> > - make Klass pointer encoding work on all 64-bit platforms with arbitrary
> > bases and shifts, and maybe optimize it further. For now this only works
> on
> > x64.
> > - Fix CDS to work with the new alignment
> >
> > Thanks, Thomas
> >
> > [1] https://github.com/openjdk/jdk/pull/5518
> > [2] https://github.com/openjdk/lilliput/pull/13
> > [3]
> >
> https://github.com/tstuefe/ojdk-repros/blob/master/repros8/src/main/java/de/stuefe/repros/metaspace/InterleavedLoaders.java
> >
> >
> > On Fri, Sep 10, 2021 at 1:11 PM Thomas St?fe <thomas.stuefe at gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Would it be of use for Lilliput to shrink the class pointer size beyond
> 32
> >> bit? I did not closely follow the discussions. Therefore I am not sure
> >> where the current thinking goes.
> >>
> >> If yes, maybe we could reduce the pointer size not only by reducing the
> >> encoding range but by using larger alignments.
> >>
> >> We encode with add-and-shift, as we do with compressed oops.
> Traditionally
> >> the shift was 3, since sizeof(void*) is the alignment requirement for
> >> metaspace allocations. This shift was used to enlarge the coverage of
> class
> >> pointer encoding from 4GB to 32GB (KlassEncodingMetaspaceMax). But we
> never
> >> used this to my knowledge since we limit class space size to 3GB at
> most.
> >> And nobody needs 32GB class space anyway. So there was never a reason to
> >> cover more than (3GB + <cds size>). Unless I missed something, the shift
> >> had been useless. In fact, we recently removed the shift if CDS is on
> >> (JDK-8265705) to solve an unrelated aarch64 issue, and nothing bad
> happened.
> >>
> >> But we could use the shift, not to enlarge the encoding range but to
> >> reduce the class pointer size. And we could use a larger shift value.
> For
> >> example, let's say we shift 8 bits. Then cut off those bits and reduce
> the
> >> class pointer to 24 bits.
> >>
> >> The resulting alignment would be 256 bytes. Applied to all metaspace
> >> allocations such an alignment would be prohibitively expensive, since
> most
> >> allocations are very small. But if we apply this larger alignment to the
> >> class space only, leave the rest of the metaspace alone, it is not so
> bad.
> >> Before JEP 387, using different alignments would have been difficult to
> >> implement, but metaspace coding is much more modular now, and using
> >> different alignments for the different regions can be done.
> >>
> >> So we apply the larger alignment only to Klass structures. Klass
> >> structures are large, and the relative loss due to alignment would
> matters
> >> less. They are variable-sized but sizes are clustered between ~512 bytes
> >> and ~1K. They can get much larger than that, but that is rare. Alignment
> >> loss would be between 0-255 bytes, lets say on average 127. For a
> typical
> >> larger app of 10000 classes, this would waste ~1.2MB. If that is
> acceptable
> >> depends on what positive effect the smaller compressed class pointer
> has on
> >> project Lilliput.
> >>
> >> ---
> >>
> >> One could argue that using an 8 bit shifted class pointer emans it stops
> >> being a pointer and becomes an index into a table of 256-byte-slots,
> >> populated with variable-sized Klass structures. With Klass sizes
> clustered
> >> between 512 bytes..1K each Klass would populate 2..4 slots on average.
> The
> >> 24-bit pointer is enough to address 16mio slots, hence on average 4..8
> >> million Klass structures, still covering a 4G total range.
> >>
> >> We could further slim down the class pointer if we agree on a lower
> >> maximum number of classes. E.g. with 22 bits, we could address 4mio
> slots
> >> and house about 500k...1mio classes, still allowing for a maximum
> encoding
> >> range of 1G.
> >>
> >> We could play around with these variables. E.g. a larger shift of 10
> bits
> >> - 1KB alignment - would mean most Klass structures occupy just one
> slot, we
> >> would have to live a somewhat higher alignment waste of 0...1024, but
> now
> >> can reduce the encoded class pointer to 20 bits, still being able to
> >> address 1 mio slots resp. close to 1mio classes, with the total encoding
> >> range still covering a 1GB.
> >>
> >> ---
> >>
> >> I think this approach is a variant of the
> >> Klass-structures-in-a-table-and-store-the-index approach, but it allows
> for
> >> those rare Klass structures to be larger than a single table slot and it
> >> has a much larger max. cap on the number of classes than if we were
> just to
> >> limit the encoding range. To me this matters somewhat because I have
> seen
> >> productive installations where the number of classes was the low
> 100000's.
> >> I don't think the 8192 limit cited in the Lilliput Wiki is practical.
> >>
> >> If I am right this approach should not require a lot of changes:
> >> - we would need to modify metaspace to use separte alignments for the
> >> class space
> >> - may have to fix class pointer encoding for the various platforms if
> they
> >> don't work with larger shifts out of the box, or are inefficient. E.g.
> on
> >> x64, we use LEAQ to encode pointers, and LEAQ allows for a max. shift
> of 3,
> >> so for shift=8 we may need to use separate add and shift.
> >> - CDS may need some work too, since the Klass structures in the CDS
> region
> >> need to be aligned to the larger alignment as well.
> >>
> >> Hope I did not make some gross miscalculation somewhare, but that's my
> >> idea. What do you think.
> >>
> >> Thanks, Thomas
> >>
> >
>
>

From rkennke at openjdk.java.net  Mon Sep 27 11:04:40 2021
From: rkennke at openjdk.java.net (Roman Kennke)
Date: Mon, 27 Sep 2021 11:04:40 GMT
Subject: [master] RFR: Load Klass* from header in interpreter (x86)
Message-ID: <PnZgSEcUGQNvXx-L6MSgAVGSmufJaVqTSPpFn9VS8I8=.b97312a0-84a8-4c88-b08d-bbfe6192cdad@github.com>

This implements loading the compressed Klass* from the object header, instead of the Klass* field in the x86 interpreter. It does the fast-path (unlocked object) in assembly, and calls into the runtime to deal with locked objects.

Testing:
 - [x] tier1
 - [x] tier2
 - [x] hotspot_gc

-------------

Commit messages:
 - Load Klass* from header in interpreter (x86)

Changes: https://git.openjdk.java.net/lilliput/pull/15/files
 Webrev: https://webrevs.openjdk.java.net/?repo=lilliput&pr=15&range=00
  Stats: 62 lines in 7 files changed: 50 ins; 1 del; 11 mod
  Patch: https://git.openjdk.java.net/lilliput/pull/15.diff
  Fetch: git fetch https://git.openjdk.java.net/lilliput pull/15/head:pull/15

PR: https://git.openjdk.java.net/lilliput/pull/15

From rkennke at openjdk.java.net  Mon Sep 27 12:04:22 2021
From: rkennke at openjdk.java.net (Roman Kennke)
Date: Mon, 27 Sep 2021 12:04:22 GMT
Subject: [master] RFR: Load Klass* from header in interpreter (x86) [v2]
In-Reply-To: <PnZgSEcUGQNvXx-L6MSgAVGSmufJaVqTSPpFn9VS8I8=.b97312a0-84a8-4c88-b08d-bbfe6192cdad@github.com>
References: <PnZgSEcUGQNvXx-L6MSgAVGSmufJaVqTSPpFn9VS8I8=.b97312a0-84a8-4c88-b08d-bbfe6192cdad@github.com>
Message-ID: <hX7g3XFBME57jigykSKL6PxoN4Z-u4Iyn6izVNGUjI0=.fc825e11-6419-46b7-bda1-91b08d353a0e@github.com>

> This implements loading the compressed Klass* from the object header, instead of the Klass* field in the x86 interpreter. It does the fast-path (unlocked object) in assembly, and calls into the runtime to deal with locked objects.
> 
> Testing:
>  - [x] tier1
>  - [x] tier2
>  - [x] hotspot_gc

Roman Kennke has updated the pull request incrementally with one additional commit since the last revision:

  Update comment about xorb/xorq encoding

-------------

Changes:
  - all: https://git.openjdk.java.net/lilliput/pull/15/files
  - new: https://git.openjdk.java.net/lilliput/pull/15/files/444464a1..99f9a36b

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=lilliput&pr=15&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=lilliput&pr=15&range=00-01

  Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/lilliput/pull/15.diff
  Fetch: git fetch https://git.openjdk.java.net/lilliput pull/15/head:pull/15

PR: https://git.openjdk.java.net/lilliput/pull/15