Far classes
    Thomas Stüfe 
    thomas.stuefe at gmail.com
       
    Sat Jul 13 07:57:51 UTC 2024
    
    
  
Hi John,
On Sat, Jul 13, 2024 at 1:44 AM John Rose <john.r.rose at oracle.com> wrote:
> On 12 Jul 2024, at 6:46, Thomas Stüfe wrote:
>
> > Hi John,
> >
> > Sorry for the late response, I have been busy with other Lilliput
> aspects.
> >
> > Thanks again for your great ideas!
>
> It’s a pleasure.
>
> > Let's see if I get this right. The base idea is to use joint encoding to
> > take the sting out of losing half of the whole nKlass value range for
> near
> > classes with the "is-far-class" signal bit; and the fact (obvious now but
> > it did not occur to me) that we can just insert the Klass* into *every*
> > large class, near or far, since the overhead paid depends on instance
> size.
>
> Yes, and your example is correct.  I would prefer to make one change
> in the joint encoding, because I think it would lead to faster decoding.
> See inline.
>
> > Let's say I have 16-bit nKlass - (I need a different name now for the
> datum
> > in the mark word since in case of a far class, this is not a nKlass ID.
> > klassWord?) - a 16-bit klassWord, and reserve 8 bits for the
> > offset-into-klass.
>
> The header bits are a union, a bit-encoded selection between far
> and near: if near, which near class, and if far, where the far
> pointer is.  Maybe a good name for that is klassSelector or
> klassCode or klassLocator.
>
> > In that model:
> > a) the largest offset I can represent is 0xFF; counting in words and
> adding
> > 1 since I don't need offset 0 (its the header), the farthest I can point
> at
> > is word offset 256
> > a) any (near and far) class that is larger than 256 words will have an
> > empty Klass* slot injected at word offset 256
> > b) a far class with a small instance size will have a Klass* slot
> injected
> > and populated at the end end of the object
> > c) a far class whose instance size is > 256 will populate the Klass* slot
> > prepared by (a)
> > d) all far classes will set their klassWord to: lower 8 bits zero,
> higher 8
> > bits the capped-at-256 Klass* slot offset
> > e) all near classes will set their klassWord to their nKlass ID as we do
> > today
> > f) No near class can have a nKlass ID with all eight lower bits zero -
> > aligned to 8 bit - which reduces the number of valid nKlass IDs to (64k -
> > 256)
>
> I would change d) in order to get faster decoding logic.  (You might
> have misread my suggested get_klass method?)
>
> For a far class, the UPPER 8 bits should be zero and the LOWER should
> be the word-scaled offset.
>
> Here are the two decoding methods for the two design choices:
>
> Klass* get_klass_1(uint16_t klassCode) {
>   if ((klassCode & ((1<<8)-1)) == 0) {  // LOWER bits zero
>     size_t wordOffset = klassCode >> 8;  //EXTRA SHIFT
>     return ((Klass**)this)[wordOffset];
>   } else {
>     size_t nKlass = klassCode;  //WASTED ENCODINGS
>     return NEAR_CLASSES[nKlass];
>   }
> }
>
> Klass* get_klass_2(uint16_t klassCode) {
>   if ((klassCode & (-1<<8)) == 0) {  // UPPER bits zero
>     size_t wordOffset = klassCode;
>     return ((Klass**)this)[wordOffset];
>   } else {
>     size_t nKlass = klassCode - (1<<8);
>     return NEAR_CLASSES[nKlass];
>   }
> }
>
> The second might also enable a compact flow-free decoding:
>
> Klass* get_klass_3(uint16_t klassCode) {
>   bool_t is_near = klassCode < (1<<8);  // 8 upper bits zero?
>   Klass** near_base = (Klass**) this;
>   constexpr Klass** far_base = NEAR_CLASSES[- (1<<8)];
>   Klass** base = is_near ? near_base : far_base;  //CMOV
>   return base[klassCode];
> }
>
Beautiful :)
Question though, should it not be reverse?
constexpr Klass** near_base = NEAR_CLASSES - 0x100;
Klass** far_base = this:
Just want to make sure I understand you.
> But that assumes a pointer-array NEAR_CLASSES, which you are not
> considering at present, AFAIK.
Things constantly change in my head.
The last two weeks I have been working on an idea to improve GC performance
in Lilliput. Only Roman knows this for now because I was not sure the work
would pan out.
The gist of the idea is to pre-calculate all information the GC needs to
know from Klass into a very dense, very cache-friendly array. Using this
table entry in hot GC paths to get Klass information all but eliminates the
need to dereference Klass*. It also sidesteps the Klass* hyperalignment
problem (which I still plan on solving) at least for GCs.
I am cautiously optimistic, seeing very good results already in artificial
workloads that suffered heavily from cpu cache contention. In some cases I
have close to 50% fewer L1 misses, and on average ~10% fewer L1 loads. In
these scenarios, it seems to outperform even a non-Lilliput JVM.
If this works out, this idea would combine well with Coleen's Klass pointer
table, since the cost for the load-Klass*-from-table could be mostly
avoided, at least during GCs. However, this is just the GC. E.g. I don't
understand the effect the additional indirection would have on hot non-GC
Klass* accesses, especially loads from vtable/itable.
> The “clever idea” is that the
> global pointer-array and the local “this” can both be viewed
> as having the same type, array of Klass* pointers.  The
> problem with that is probably that lookups through that
> global array would introduce delays, compared to “dead
> reckoning” in an appropriately scaled array of near-klass
> structs.  But it might be worth doing the experiment.
>
> An advantage would be, at the cost of the indexing array
> NEAR_CLASSES (one extra pointer per nKlass, to be chased
> every time) you can get (a) better density of the actual
> Klass structs, and (b) less D$ false sharing, because of
> a wider variety of nKlass base addresses.
>
Yes, the Klass table would indirectly solve the Klass* hyperalignment
problem. The way I would do it is to remove the Class space portion of
metaspace, and put Klass structures into normal Metaspace together with all
the other children. Its location would therefore be a lot more randomized
(and can even easily be randomized deliberately, since the underlying
allocators still are somewhat "rhythmic").
In general, decoupling allocation strategy from nKlass ID generation would
be nice. One example, in the context of my GC idea it would be nice to have
nKlass IDs of real classes numerically clustered for even better cache
density. So, shepherd them away from abstract/interface Klasses. Right now
I would have to convince the Metaspace arena to segregate these
allocations, which raises fragmentation overhead. With a Klass pointer
table, numerical games like this would be easy.
Some complexity would stay with us, though. E.g. All those optimizations we
currently have about how to perfectly place the class space so that its
immediate base can be materialized with very little fuss by the JIT would
still apply, but now for the NEAR_CLASSES base.
Klass pointer table slot allocation would face the same questions as
Metaspace allocations did back in the day of permgen removal. E.g. do you
have a global slot allocator that needs to be synchronized, or do you hand
out the slots in packets of N to loaders to minimize contention. But then
you face the same decisions, e.g. how large to make N. With a zillion
loaders, you don't want each loader to hog N Klass table slots. But the
boot loader would benefit from a very generous N. So, some complexity would
return in a new form.
Cheers, Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/lilliput-dev/attachments/20240713/1b2a8fa2/attachment-0001.htm>
    
    
More information about the lilliput-dev
mailing list