<div dir="ltr"><div dir="ltr">Hi John,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 13, 2024 at 1:44 AM John Rose <<a href="mailto:john.r.rose@oracle.com">john.r.rose@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">On 12 Jul 2024, at 6:46, Thomas Stüfe wrote:<br>

<br>

> Hi John,<br>

><br>

> Sorry for the late response, I have been busy with other Lilliput aspects.<br>

><br>

> Thanks again for your great ideas!<br>

<br>

It’s a pleasure.<br>

<br>

> Let's see if I get this right. The base idea is to use joint encoding to<br>

> take the sting out of losing half of the whole nKlass value range for near<br>

> classes with the "is-far-class" signal bit; and the fact (obvious now but<br>

> it did not occur to me) that we can just insert the Klass* into *every*<br>

> large class, near or far, since the overhead paid depends on instance size.<br>

<br>

Yes, and your example is correct.  I would prefer to make one change<br>

in the joint encoding, because I think it would lead to faster decoding.<br>

See inline.<br>

<br>

> Let's say I have 16-bit nKlass - (I need a different name now for the datum<br>

> in the mark word since in case of a far class, this is not a nKlass ID.<br>

> klassWord?) - a 16-bit klassWord, and reserve 8 bits for the<br>

> offset-into-klass.<br>

<br>

The header bits are a union, a bit-encoded selection between far<br>

and near: if near, which near class, and if far, where the far<br>

pointer is.  Maybe a good name for that is klassSelector or<br>

klassCode or klassLocator.<br>

<br>

> In that model:<br>

> a) the largest offset I can represent is 0xFF; counting in words and adding<br>

> 1 since I don't need offset 0 (its the header), the farthest I can point at<br>

> is word offset 256<br>

> a) any (near and far) class that is larger than 256 words will have an<br>

> empty Klass* slot injected at word offset 256<br>

> b) a far class with a small instance size will have a Klass* slot injected<br>

> and populated at the end end of the object<br>

> c) a far class whose instance size is > 256 will populate the Klass* slot<br>

> prepared by (a)<br>

> d) all far classes will set their klassWord to: lower 8 bits zero, higher 8<br>

> bits the capped-at-256 Klass* slot offset<br>

> e) all near classes will set their klassWord to their nKlass ID as we do<br>

> today<br>

> f) No near class can have a nKlass ID with all eight lower bits zero -<br>

> aligned to 8 bit - which reduces the number of valid nKlass IDs to (64k -<br>

> 256)<br>

<br>

I would change d) in order to get faster decoding logic.  (You might<br>

have misread my suggested get_klass method?)<br>

<br>

For a far class, the UPPER 8 bits should be zero and the LOWER should<br>

be the word-scaled offset.<br>

<br>

Here are the two decoding methods for the two design choices:<br>

<br>

Klass* get_klass_1(uint16_t klassCode) {<br>

  if ((klassCode & ((1<<8)-1)) == 0) {  // LOWER bits zero<br>

    size_t wordOffset = klassCode >> 8;  //EXTRA SHIFT<br>

    return ((Klass**)this)[wordOffset];<br>

  } else {<br>

    size_t nKlass = klassCode;  //WASTED ENCODINGS<br>

    return NEAR_CLASSES[nKlass];<br>

  }<br>

}<br>

<br>

Klass* get_klass_2(uint16_t klassCode) {<br>

  if ((klassCode & (-1<<8)) == 0) {  // UPPER bits zero<br>

    size_t wordOffset = klassCode;<br>

    return ((Klass**)this)[wordOffset];<br>

  } else {<br>

    size_t nKlass = klassCode - (1<<8);<br>

    return NEAR_CLASSES[nKlass];<br>

  }<br>

}<br>

<br>

The second might also enable a compact flow-free decoding:<br>

<br>

Klass* get_klass_3(uint16_t klassCode) {<br>

  bool_t is_near = klassCode < (1<<8);  // 8 upper bits zero?<br>

  Klass** near_base = (Klass**) this;<br>

  constexpr Klass** far_base = NEAR_CLASSES[- (1<<8)];<br>

  Klass** base = is_near ? near_base : far_base;  //CMOV<br>

  return base[klassCode];<br>

}<br></blockquote><div><br></div><div>Beautiful :)</div><div><br></div><div>Question though, should it not be reverse? </div><div><br></div><div>constexpr Klass** near_base = NEAR_CLASSES - 0x100;</div><div>Klass** far_base = this:</div><div><br></div><div>Just want to make sure I understand you.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

But that assumes a pointer-array NEAR_CLASSES, which you are not<br>

considering at present, AFAIK.  </blockquote><div><br></div><div>Things constantly change in my head.</div><div><br></div><div>The last two weeks I have been working on an idea to improve GC performance in Lilliput. Only Roman knows this for now because I was not sure the work would pan out.</div><div><br></div><div>The gist of the idea is to pre-calculate all information the GC needs to know from Klass into a very dense, very cache-friendly array. Using this table entry in hot GC paths to get Klass information all but eliminates the need to dereference Klass*. It also sidesteps the Klass* hyperalignment problem (which I still plan on solving) at least for GCs.</div><div><br></div><div>I am cautiously optimistic, seeing very good results already in artificial workloads that suffered heavily from cpu cache contention. In some cases I have close to 50% fewer L1 misses, and on average ~10% fewer L1 loads. In these scenarios, it seems to outperform even a non-Lilliput JVM.</div><div><br></div><div>If this works out, this idea would combine well with Coleen's Klass pointer table, since the cost for the load-Klass*-from-table could be mostly avoided, at least during GCs. However, this is just the GC. E.g. I don't understand the effect the additional indirection would have on hot non-GC Klass* accesses, especially loads from vtable/itable.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">The “clever idea” is that the<br>

global pointer-array and the local “this” can both be viewed<br>

as having the same type, array of Klass* pointers.  The<br>

problem with that is probably that lookups through that<br>

global array would introduce delays, compared to “dead<br>

reckoning” in an appropriately scaled array of near-klass<br>

structs.  But it might be worth doing the experiment.<br>

<br>

An advantage would be, at the cost of the indexing array<br>

NEAR_CLASSES (one extra pointer per nKlass, to be chased<br>

every time) you can get (a) better density of the actual<br>

Klass structs, and (b) less D$ false sharing, because of<br>

a wider variety of nKlass base addresses.<br></blockquote><div><br></div><div>Yes, the Klass table would indirectly solve the Klass* hyperalignment problem. The way I would do it is to remove the Class space portion of metaspace, and put Klass structures into normal Metaspace together with all the other children. Its location would therefore be a lot more randomized (and can even easily be randomized deliberately, since the underlying allocators still are somewhat "rhythmic").</div><div><br></div><div>In general, decoupling allocation strategy from nKlass ID generation would be nice. One example, in the context of my GC idea it would be nice to have nKlass IDs of real classes numerically clustered for even better cache density. So, shepherd them away from abstract/interface Klasses. Right now I would have to convince the Metaspace arena to segregate these allocations, which raises fragmentation overhead. With a Klass pointer table, numerical games like this would be easy.</div><div><br></div><div>Some complexity would stay with us, though. E.g. All those optimizations we currently have about how to perfectly place the class space so that its immediate base can be materialized with very little fuss by the JIT would still apply, but now for the NEAR_CLASSES base.</div><div><br></div><div>Klass pointer table slot allocation would face the same questions as Metaspace allocations did back in the day of permgen removal. E.g. do you have a global slot allocator that needs to be synchronized, or do you hand out the slots in packets of N to loaders to minimize contention. But then you face the same decisions, e.g. how large to make N. With a zillion loaders, you don't want each loader to hog N Klass table slots. But the boot loader would benefit from a very generous N. So, some complexity would return in a new form.</div><div><br></div><div>Cheers, Thomas</div></div></div>