Sharing the markword (aka Valhalla's markword use)

Sun Mar 10 05:53:25 UTC 2024

On 6 Mar 2024, at 10:47, Kennke, Roman wrote:

> I shall add that this is possible because we know at class-loading time whether or not we’re going to exceed the class-addressing-limit, and reserve the extra Klass* field at class-loading time, during field layout, and allocate all affected instances with that extra field right from the start, no stunts needed like with the compact I-hash stuff.

(What follows is speculation oriented towards an indefinite future…  Apologies in advance to Valhalla-dev for irrelevant stuff.)

Yes, I think this could work.  You could say that if the narrow-klass field is all-zero-bits, or if some separate mode bit is set, then we have to fish the full-klass field out from side storage somewhere in the object.  It’s a new expanded mode of klass storage.

Where is that side storage?  Well, it can’t be a fixed location in the object, so there’s a gap in that design; just a mode bit or an all-zero-bits condition doesn’t help you.

(Why not a fixed location, like offset=4?  Because you might overflow the narrow-klass field’s encoding range, when loading a subclass of a class which already uses the desirable offset=4 slot for a regular instance field.  You can get around this by using offset=-4, but that opens different cans of worms. Better to assume instance fields will compete for offsets with the injected full-klass field.)

You might want to store the offset of the full-klass field in the narrow-klass field, and use your mode bit to decide which access mode to use.  Hence 15 for the narrow-klass, plus 1 for the expansion mode bit.

But, the joint bit encoding trick could be helpful here, to reduce the impact of the mode bit on header density.  Use 16 bits for the narrow-klass field.  Then, if the top 8 bits are nonzero, interpret the narrow-klass field as the identifier of one of the pre-eminent 65280 (2^16-2^8) klasses.  If the top 8 bits are zero, the field encodes an offset to find the full-klass field.  With a separate mode bit, you would only be able to represent 32768 distinct klasses in the narrow-klass field.  This is a typical result for joint bit encodings.

If the bottom 8 bits of the narrow-klass field are available to represent an offset (in the second mode), then the full-klass field can be anywhere in the instance as long as the instance size is at most 256 words.  (Assume scaled access, or else say the limit is 256 bytes.)  Surely if an instance is larger than that, we will know when we load the class into the VM, and before we allocate a klass ID.  If the class is non-final there is a hazard of a subclass requiring a full-klass field.  Thus, loading a non-final class of size 255 or larger requires the VM to allocate a full-klass field, whether it uses that field or not.  But, that’s a 0.4% overhead in instance size, so it’s tolerable.  The 0.4% parameter is a function of the number of bits (8) chosen for the joint bit encoding.

(So don’t take the example too literally.  It might not be 8 bits, and it might not be the actual top or bottom bits.  If your CPU doesn’t have a good scaled-load instruction, maybe the zero bits you want are the bottom 3 and the top 5 of a 16-bit narrow-klass field.  Making any such masked test is easy on all our CPUs.  That’s why I think joint bit encodings can be interesting today.)

However, it’s not clear to me whether the cost of a modal narrow-klass field is going to be bearable.  We’ll have to measure it when the time comes.  Testing one bit (or 8 bits) in an object header is one instruction, but it’s an extra instruction, and in general loading the klass of an instance will be a branchy operations with tiny klass fields, where it is not branchy today.  That could be a problem.  It will at least be a challenge to optimize.  If the increased branchiness is a problem, the solution might well be to stay with the larger 64-bit headers.  When we get there we can test.  Different CPUs will give different results.

Idea to reduce branchiness:  Always load the full-klass pointer, let’s say a 64-bit pointer, from an effective address computed from the object and its header.  The narrow-klass field ALWAYS, without branchiness, contributes the offset (scaled or not) to such a load.  The mode bit and/or joint bits contribute the base address.  The base address is a conditional-move instruction whose inputs are the current object and a constant base address where the first 65280 klasses are on display.  This would push the modality into a single CMOV instruction, removing all branching, at the cost of an extra unconditional indirection.

 hdr = *(int *)obj;
 base0 = (Klass *)obj;
 base1 = Universe::narrow_klass_table(); //constant
 base = is_narrow_klass(hdr) ? base0 : base1; //CMOV
 offset = narrow_klass_field(hdr);
 klass = base[ offset ]; // look folks, no branches!

Example optimization:  If static type information in the JIT can prove that an object reference has no subclasses with non-narrow klass IDs, then the JIT can assume that the “is_narrow_klass” query above is always true, and fold up the CMOV.

Another example optimization:  If I am doing some x instanceof C operation, and C is final, then I can compare the narrow-klass field of x against the narrow encoding of C, and take that as my final answer.  It’s OK if the narrow-klass field is an expansion signal to look for the full-klass field; I don’t need to do that fetch to finish this particular operation.  There may be several more techniques of this kind to remove branchiness from code that looks at object klasses.

As I said before, all this kind of trickery is relevant mainly to 32-bit headers, not 64-bit headers.  And 32-bit headers may well be in the future, relative to Valhalla.  At that future point, little or none of my speculation at this point may apply.  And yet I speculate, FTR.