Sharing the markword (aka Valhalla's markword use)

Mon Mar 11 12:58:34 UTC 2024

> 
> (What follows is speculation oriented towards an indefinite future…  Apologies in advance to Valhalla-dev for irrelevant stuff.)
> 

I removed valhalla-dev from this this thread.

> Yes, I think this could work.  You could say that if the narrow-klass field is all-zero-bits, or if some separate mode bit is set, then we have to fish the full-klass field out from side storage somewhere in the object.  It’s a new expanded mode of klass storage.
> 
> Where is that side storage?  Well, it can’t be a fixed location in the object, so there’s a gap in that design; just a mode bit or an all-zero-bits condition doesn’t help you.
> 

Quite the opposite, I think. It *has* to be a fixed offset. Why? Because if it depends on the field layout, then we’d have to find that first, and we can only find it only via the Klass.

> (Why not a fixed location, like offset=4?  Because you might overflow the narrow-klass field’s encoding range, when loading a subclass of a class which already uses the desirable offset=4 slot for a regular instance field.  You can get around this by using offset=-4, but that opens different cans of worms. Better to assume instance fields will compete for offsets with the injected full-klass field.)

I don’t think it’s a problem. Either we agree that narrow-32-encoding range is enough, then we can put it at 4, or, if we *really* think we’d need the whole 64 range, then we stick it at offset 8. Since we know all this at class-loading-time, we can lay out the fields around this. Putting at 4 seems preferable because of smaller chance of gap and higher chance to save more memory, but I don’t think it matters all that much.

This way, we can use all-zeroes to indicate ‘look for Klass* elsewhere, and not spend an extra bit for the special case (that is a neat idea!).

Unless I am missing something.

> However, it’s not clear to me whether the cost of a modal narrow-klass field is going to be bearable.

I don’t think it is a problem. Current Lilliput already tests the header bits, and branches when the object is monitor-locked. That cost is not measurable (need to make sure that the test-and-branch is laid out in a way that does not mess up static branch prediction, but that is easy, using stubs). The monitor-test will go away, but we can have an all-zeroes check in that same place. Handling the all-zeroes case would be slightly more costly, but not much, if we can use a fixed offset.

>  We’ll have to measure it when the time comes.  Testing one bit (or 8 bits) in an object header is one instruction, but it’s an extra instruction, and in general loading the klass of an instance will be a branchy operations with tiny klass fields, where it is not branchy today.  That could be a problem.  It will at least be a challenge to optimize.  If the increased branchiness is a problem, the solution might well be to stay with the larger 64-bit headers.  When we get there we can test.  Different CPUs will give different results.
> 
> Idea to reduce branchiness:  Always load the full-klass pointer, let’s say a 64-bit pointer, from an effective address computed from the object and its header.  The narrow-klass field ALWAYS, without branchiness, contributes the offset (scaled or not) to such a load.  The mode bit and/or joint bits contribute the base address.  The base address is a conditional-move instruction whose inputs are the current object and a constant base address where the first 65280 klasses are on display.  This would push the modality into a single CMOV instruction, removing all branching, at the cost of an extra unconditional indirection.
> 
> hdr = *(int *)obj;
> base0 = (Klass *)obj;
> base1 = Universe::narrow_klass_table(); //constant
> base = is_narrow_klass(hdr) ? base0 : base1; //CMOV
> offset = narrow_klass_field(hdr);
> klass = base[ offset ]; // look folks, no branches!
> 
> Example optimization:  If static type information in the JIT can prove that an object reference has no subclasses with non-narrow klass IDs, then the JIT can assume that the “is_narrow_klass” query above is always true, and fold up the CMOV.
> 
> Another example optimization:  If I am doing some x instanceof C operation, and C is final, then I can compare the narrow-klass field of x against the narrow encoding of C, and take that as my final answer.  It’s OK if the narrow-klass field is an expansion signal to look for the full-klass field; I don’t need to do that fetch to finish this particular operation.  There may be several more techniques of this kind to remove branchiness from code that looks at object klasses.
> 
> As I said before, all this kind of trickery is relevant mainly to 32-bit headers, not 64-bit headers.  And 32-bit headers may well be in the future, relative to Valhalla.  At that future point, little or none of my speculation at this point may apply.  And yet I speculate, FTR.

Let’s keep all of that in mind (or better yet, the wiki ;-) )

Thanks for all your suggestions and clarifications, that’s all very useful.

Roman

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879