Adding Valhalla bits to Lilliput headers
john.r.rose at oracle.com
Thu May 11 18:23:09 UTC 2023
Since Lilliput is becoming very real (yay!) the question arises, how will Valhalla be impacted, if at all?
Since we use object header bits for special new state (4 at current count), we will need to rethink how those work with Lilliput’s reorganization of that same header.
Here are a few thoughts on this set of problems:
Comments? Anything I should add or remove?
And here’s an inline copy, with some formatting stripped:
Adding Valhalla bits to Lilliput headers
John Rose, 5/2023
Valhalla makes new fundamental distinctions between types, one between identity objects and value objects, and another between variables which are not (easily) flattenable and others which are null-free values that are flattenable.
Some of these distinctions are relevant on the hottest paths in the JVM, including the acmp instruction (which has a new meaning for value objects) and the aaload instruction (which has a special behavior for flattened arrays).
Valhalla prototyping work indicates that gating the new behaviors, with best performance along hot paths, requires allocation of new flag bits in the object header.
In fact, there are currently four header bits to gate:
One bit VAL declares that “this object has no identity because it is a value”. (This is the key bit; it gates the acmp fast path.)
Two bits ARFL declare that “this array object has special handling for flat data”. (One says “it’s all flat in here”, while one says “not flat, but convert nulls to flat zeroes”.)
One bit LARV declares that “this object is a larval buffer for Unsafe”. (That’s internal magic.)
That last magic bit encodes a subtle internal distinction used only within code that synthesizes value objects reflectively (e.g., during deserialization). LARV is temporary marker on a value object, to the effect that the object is operating as a non-shareable and mutable private buffer, instead of according to its standard semantics. This is internal “larval” state is hidden from all users, except clients of the Unsafe API.
Meanwhile, Lilliput is busy shrinking object footprint, under the heading “Compact Object Headers” (see JEP 450 for current work). A compact object header uses 64 bits (and eventually maybe even 32 bits) to hold per-object metadata that currently occupies 96 bits. The 64 bits are carefully laid out according to this division, from LSB to MSB:
2 bits for object locking state.
5 more bits (used with the first 2) for internal GC states.
25 bits to store the identity hash code (after it is computed).
32 bits for the class pointer, in a compressed state.
A future version of Lilliput may restore the hash code to its previous glory of 31 bits, at the direct cost of reducing the class pointer to 26 bits. This will probably require adjusting the class pointer compression scheme to use fewer bits for similar numbers of classes, so as to support those few Java applications which use dynamic class creation to make huge numbers of classes. It is a tricky tradeoff.
This leads to a conflict: In Lilliput’s current (and likely future) accounting, there are no more bits to hand out to Valhalla for it to use on its hot paths.
Mad props to David Simms, Fredric Parain, and Tobias Hartmann for inventing this stuff, and giving helpful discussions on the shape of this problem. And respect to Roman Kennke and his merry band of Lilliputians for reimagining object layouts. All errors in this document are sole property of the author, but he will part with them upon notification of their presence.
To maintain its performance model as prototyped, Valhalla must remove up to four bits from the other compressed header components, or else compromise performance.
The likely form of compromised performance is pushing the bits down into into the Klass structure, at the cost of a Klass decode and indirection on hot paths.
For the record, an unlikely form of compromised performance is adding a new field to Valhalla values, just to carry the extra bits. This move would directly conflict with a key Valhalla goal, shared with Lilliput, which is to reduce footprint. Lilliput shrinks object headers, and Valhalla removes them (for flattened or scalarized variables).
Let’s not compromise, but rather let us use the header bits, and reinterpret them creatively.
The fundamental distinction of value vs. identity must be either a stand-alone bit VAL or somehow derived from the class pointer, by decoding. (…Or by indirectly storing VAL in the Klass, as discussed just above.)
Is it a zero-sum game? Must we steal bandwidth from the compressed class ID field to encode VAL instead?
Yes, it’s true, we must steal bandwidth. And the simplest answer is to reduce the theoretical maximum number of identity classes by 50%.
If we believe that in the far future the mix of value vs. identity classes is 50/50, this is even an optimal step, since we could contrive to keep two spaces of classes, of equal size.
But there are gentler ways to siphon off bandwidth. If we believe that, in the foreseeable future, at most %12.5 of classes will be value classes (a reasonable limit) then we can take three bits to jointly encode the VAL condition. All class IDs with three low bits zero would be reserved to represent value classes. That’s 12.5% of them, and the other 87.5% would be reserved to represent identity classes (the legacy case).
if ((object->_header._klass & VAL_BITS3_MASK) != 0)
Such a decision can be internally adjusted as Valhalla is adopted, but there is likely a setting (3 bits? 2 bits? 1 bit?) which suffices for all workloads. It is unlikely that more than 3 bits is needed, simply because if an application has so many classes that it breaks the encoding limit, raising that limit by a few percent (by going to 4 or 5 bits) is unlikely to stave off OOM for long.
Once the VAL condition is encoded, the larval state bit (LARV) can be allocated easily, by observing that (a) value objects are never locked, and (b) larval objects are never shared. The simplest way to implement the LARV state is to identify it with a locking state. This state needs no associated monitor block; in some designs it corresponds to an “anonymously locked” state.
if ((object->_header._klass & VAL_BITS3_MASK) != 0)
const LARV_BIT_MASK = LOCK_MASK_IN_PLACE;
if ((object->_header._klass & LARV_BIT_MASK) != 0)
What about all those hash bits? A value object has no identity hash code (because it has no identity) but there is an overloading of the concept which assigns a structural hash code to a value object instead. This value is likely to be cached in the object header, just like a true identity hash code. If it is not cached in that way, then all those bits open up for use, just for value objects. The LARV bit could be placed there instead.
And the possibilities would be endless: The GC could cache layout summaries in those freed-up bits, for tracing value objects (when on heap) without loading class metadata. It’s tempting. But in the end, it’s likely that Valhalla will use a common “container” for caching both identity and structural hash codes. And that “container” will go somewhere else, anyway, if and when Lilliput reaches the 32 bit limit for headers.
This leaves the array bits to consider. Where to put them? There are two choices (besides the third choice of pushing them down into the Klass, at the cost of extra indirections in aaload).
First, try the “gentle bandwidth siphon” again: Play the 3-bit trick (for some value of “3”, as discussed above) to take another 12.5% of the class encodings and reserve them for specially-marked arrays. If there are two (or N) special states to worry about within specially-marked arrays, divide that space further into 2 (or N) parts.
In fact, it might be good to use the 3-bit encoding trick to regularly mark all arrays, with a distinctive encoding in their headers. That would make array checks even faster than the current fast path, which loads a descriptor word (the “layout helper”) from the Klass. That in turn might possibly make some array-sensitive hot paths go faster; it’s hard to tell without doing the prototyping and testing.
If we change the focus to treating all arrays as special, we can discuss taking some the header’s hash code field to encode array characteristics. This would degrade the quality of array hashes, since all arrays with a common set of characteristics would be required to share a common subset of hash codes.
This might seem bad, but note that array hashes are relatively useless, since they do not take array content into account, unlike List hashes.
If there were four equivalence classes, and an identity hash map had up to 2^25 arrays all of type (say) int, then it would see excess collisions after only 2^23 arrays were inserted.
This interesting discussion would also include the question: What workloads operate on large number of directly hashed arrays? And since those workloads will have the same failure mode after the full 2^25 arrays are entered, do we care that doom comes a little sooner if we steal bits for array characteristics?
If and when hash codes are put back to 31 bits, the same discussion moves out to that scale, which is even less likely.
If we “steal” array bits from the identity hash code, then array parts of the “layout helper” descriptor, which describes the scale factor of the array and its basic element type in about 4-5 bits, could be hoisted into the repurposed identity hash bits. Those bits could continue to be part of the hash as well. An extra SHIFT/XOR, perhaps from the class ID field, could help obscure their presence.
In any case, it seems likely that some special encoding of compressed class IDs for arrays will save the day, for array fast paths, and specifically for the special processing paths for arrays that are new in Valhalla. And stealing bits from the hash field, for arrays, seems worth further discussion as well.
More information about the valhalla-dev