Sharing the markword (aka Valhalla's markword use)
Kennke, Roman
rkennke at amazon.de
Mon Mar 11 12:58:34 UTC 2024
>
> (What follows is speculation oriented towards an indefinite future… Apologies in advance to Valhalla-dev for irrelevant stuff.)
>
I removed valhalla-dev from this this thread.
> Yes, I think this could work. You could say that if the narrow-klass field is all-zero-bits, or if some separate mode bit is set, then we have to fish the full-klass field out from side storage somewhere in the object. It’s a new expanded mode of klass storage.
>
> Where is that side storage? Well, it can’t be a fixed location in the object, so there’s a gap in that design; just a mode bit or an all-zero-bits condition doesn’t help you.
>
Quite the opposite, I think. It *has* to be a fixed offset. Why? Because if it depends on the field layout, then we’d have to find that first, and we can only find it only via the Klass.
> (Why not a fixed location, like offset=4? Because you might overflow the narrow-klass field’s encoding range, when loading a subclass of a class which already uses the desirable offset=4 slot for a regular instance field. You can get around this by using offset=-4, but that opens different cans of worms. Better to assume instance fields will compete for offsets with the injected full-klass field.)
I don’t think it’s a problem. Either we agree that narrow-32-encoding range is enough, then we can put it at 4, or, if we *really* think we’d need the whole 64 range, then we stick it at offset 8. Since we know all this at class-loading-time, we can lay out the fields around this. Putting at 4 seems preferable because of smaller chance of gap and higher chance to save more memory, but I don’t think it matters all that much.
This way, we can use all-zeroes to indicate ‘look for Klass* elsewhere, and not spend an extra bit for the special case (that is a neat idea!).
Unless I am missing something.
> However, it’s not clear to me whether the cost of a modal narrow-klass field is going to be bearable.
I don’t think it is a problem. Current Lilliput already tests the header bits, and branches when the object is monitor-locked. That cost is not measurable (need to make sure that the test-and-branch is laid out in a way that does not mess up static branch prediction, but that is easy, using stubs). The monitor-test will go away, but we can have an all-zeroes check in that same place. Handling the all-zeroes case would be slightly more costly, but not much, if we can use a fixed offset.
> We’ll have to measure it when the time comes. Testing one bit (or 8 bits) in an object header is one instruction, but it’s an extra instruction, and in general loading the klass of an instance will be a branchy operations with tiny klass fields, where it is not branchy today. That could be a problem. It will at least be a challenge to optimize. If the increased branchiness is a problem, the solution might well be to stay with the larger 64-bit headers. When we get there we can test. Different CPUs will give different results.
>
> Idea to reduce branchiness: Always load the full-klass pointer, let’s say a 64-bit pointer, from an effective address computed from the object and its header. The narrow-klass field ALWAYS, without branchiness, contributes the offset (scaled or not) to such a load. The mode bit and/or joint bits contribute the base address. The base address is a conditional-move instruction whose inputs are the current object and a constant base address where the first 65280 klasses are on display. This would push the modality into a single CMOV instruction, removing all branching, at the cost of an extra unconditional indirection.
>
> hdr = *(int *)obj;
> base0 = (Klass *)obj;
> base1 = Universe::narrow_klass_table(); //constant
> base = is_narrow_klass(hdr) ? base0 : base1; //CMOV
> offset = narrow_klass_field(hdr);
> klass = base[ offset ]; // look folks, no branches!
>
> Example optimization: If static type information in the JIT can prove that an object reference has no subclasses with non-narrow klass IDs, then the JIT can assume that the “is_narrow_klass” query above is always true, and fold up the CMOV.
>
> Another example optimization: If I am doing some x instanceof C operation, and C is final, then I can compare the narrow-klass field of x against the narrow encoding of C, and take that as my final answer. It’s OK if the narrow-klass field is an expansion signal to look for the full-klass field; I don’t need to do that fetch to finish this particular operation. There may be several more techniques of this kind to remove branchiness from code that looks at object klasses.
>
> As I said before, all this kind of trickery is relevant mainly to 32-bit headers, not 64-bit headers. And 32-bit headers may well be in the future, relative to Valhalla. At that future point, little or none of my speculation at this point may apply. And yet I speculate, FTR.
Let’s keep all of that in mind (or better yet, the wiki ;-) )
Thanks for all your suggestions and clarifications, that’s all very useful.
Roman
Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
More information about the lilliput-dev
mailing list