<html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <font size="4"><font face="monospace">Here's a *first draft* of a

        document to go into the SoV docset on the performance model.  <br>

        <br>

      </font></font><br>

    <div style="color: #d4d4d4;background-color: #1e1e1e;font-family: Consolas, 'Courier New', monospace;font-weight: normal;font-size: 16px;line-height: 22px;white-space: pre;"><div><span style="color: #569cd6;font-weight: bold;"># State of Valhalla</span></div><div><span style="color: #569cd6;font-weight: bold;">## Part 5: Performance Model {.subtitle}</span></div>

<div><span style="color: #569cd6;font-weight: bold;">#### Brian Goetz {.author}</span></div><div><span style="color: #569cd6;font-weight: bold;">#### June 2022 {.date}</span></div>

<div><span style="color: #d4d4d4;">This document describes performance considerations for value classes under</span></div><div><span style="color: #d4d4d4;">Project Valhalla.  While we describe the optimizations that we expect the</span></div><div><span style="color: #d4d4d4;">HotSpot JVM to routinely make, other JVMs may make their own choices, and of</span></div><div><span style="color: #d4d4d4;">course these choices may vary over time and situations.</span></div>

<div><span style="color: #569cd6;font-weight: bold;">## Flattening</span></div>

<div><span style="color: #d4d4d4;">Project Valhalla has two broad classes of goals.  The first is </span><span style="color: #d4d4d4;font-style: italic;">_unification_</span><span style="color: #d4d4d4;">:</span></div><div><span style="color: #d4d4d4;">unifying the treatment of primitives and references in the Java type system.</span></div><div><span style="color: #d4d4d4;">The second is </span><span style="color: #d4d4d4;font-style: italic;">_performance_</span><span style="color: #d4d4d4;">: enabling the declaration of aggregates with</span></div><div><span style="color: #d4d4d4;font-style: italic;">_flatter_</span><span style="color: #d4d4d4;"> and </span><span style="color: #d4d4d4;font-style: italic;">_denser_</span><span style="color: #d4d4d4;"> layouts than the layout we get with today's identity</span></div><div><span style="color: #d4d4d4;">classes.</span></div>

<div><span style="color: #d4d4d4;">By </span><span style="color: #d4d4d4;font-style: italic;">_flatness_</span><span style="color: #d4d4d4;">, we mean the number of memory indirections that must be traversed</span></div><div><span style="color: #d4d4d4;">to get to the leaf data in an object graph.  If all object references are</span></div><div><span style="color: #d4d4d4;">implemented as pointers -- as they almost always are for identity objects --</span></div><div><span style="color: #d4d4d4;">then each object becomes an "island" of data, requiring indirections each time</span></div><div><span style="color: #d4d4d4;">we hop to another island.  Flatness begets density; each indirection requires an</span></div><div><span style="color: #d4d4d4;">object header, and eliminating indirections also reduces the number of object</span></div><div><span style="color: #d4d4d4;">headers.  Flatness also reduces garbage collection costs, since there are fewer</span></div><div><span style="color: #d4d4d4;">objects in the heap to process.  </span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Heap flattening</span></div>

<div><span style="color: #d4d4d4;">The form of flattening that comes most readily to mind is flattening on the</span></div><div><span style="color: #d4d4d4;">heap, </span><span style="color: #d4d4d4;font-style: italic;">_inlining_</span><span style="color: #d4d4d4;"> the layout of some objects into that of other objects (and</span></div><div><span style="color: #d4d4d4;">arrays), which eliminates island-hopping.  If </span><span style="color: #ce9178;">`C`</span><span style="color: #d4d4d4;"> is an identity class, then a</span></div><div><span style="color: #d4d4d4;">field of type </span><span style="color: #ce9178;">`C`</span><span style="color: #d4d4d4;"> is laid out as a pointer, and an array of type </span><span style="color: #ce9178;">`C[]`</span><span style="color: #d4d4d4;"> is laid</span></div><div><span style="color: #d4d4d4;">out as an array of pointers.  If </span><span style="color: #ce9178;">`V`</span><span style="color: #d4d4d4;"> is a value type, we have the option to lay</span></div><div><span style="color: #d4d4d4;">out fields of type </span><span style="color: #ce9178;">`V`</span><span style="color: #d4d4d4;"> by inlining the fields of </span><span style="color: #ce9178;">`V`</span><span style="color: #d4d4d4;"> into the layout of the</span></div><div><span style="color: #d4d4d4;">enclosing type, and lay out arrays of type </span><span style="color: #ce9178;">`V[]`</span><span style="color: #d4d4d4;"> as repeating (aligned) groups</span></div><div><span style="color: #d4d4d4;">of the fields of </span><span style="color: #ce9178;">`V`</span><span style="color: #d4d4d4;">.  These layout choices reduce the number of indirections to</span></div><div><span style="color: #d4d4d4;">get to the fields of </span><span style="color: #ce9178;">`V`</span><span style="color: #d4d4d4;"> by one hop.  </span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Calling convention flattening</span></div>

<div><span style="color: #d4d4d4;">A less obvious, but also important form of flattening, is flattening in _method</span></div><div><span style="color: #d4d4d4;">calling conventions_.  If a method argument or return is a reference to </span><span style="color: #ce9178;">`C`</span><span style="color: #d4d4d4;">,</span></div><div><span style="color: #d4d4d4;">where </span><span style="color: #ce9178;">`C`</span><span style="color: #d4d4d4;"> is an identity class or polymorphic type (interfaces and abstract</span></div><div><span style="color: #d4d4d4;">classes), the argument will usually be passed as a pointer on the stack or in a</span></div><div><span style="color: #d4d4d4;">register.  If </span><span style="color: #ce9178;">`V`</span><span style="color: #d4d4d4;"> is a value type, we have another option: to </span><span style="color: #d4d4d4;font-style: italic;">_scalarize_</span><span style="color: #d4d4d4;"> </span><span style="color: #ce9178;">`V`</span></div><div><span style="color: #d4d4d4;">(explode it into its fields) and pass its fields on the stack or in registers.</span></div><div><span style="color: #d4d4d4;">(Perhaps surprisingly, under some situations we have the same option if </span><span style="color: #ce9178;">`V`</span><span style="color: #d4d4d4;"> is a</span></div><div><span style="color: #d4d4d4;">strongly typed reference to a value object (a </span><span style="color: #ce9178;">`V.ref`</span><span style="color: #d4d4d4;">) as well.)</span></div>

<div><span style="color: #d4d4d4;">Both heap layouts and calling convention are determined fairly early in the</span></div><div><span style="color: #d4d4d4;">execution of a program.  This means that any information needed to make these</span></div><div><span style="color: #d4d4d4;">flattening choices must be available early in the execution as well.  The </span><span style="color: #ce9178;">`Q`</span></div><div><span style="color: #d4d4d4;">descriptors used by value types act as a preload signal, as does the </span><span style="color: #ce9178;">`Preload`</span></div><div><span style="color: #d4d4d4;">attribute used for reference companions of value classes.</span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Locals</span></div>

<div><span style="color: #d4d4d4;">Locals variables have even more latitude over representation, because unlike</span></div><div><span style="color: #d4d4d4;">with layouts and calling conventions, there is no need for separately compiled</span></div><div><span style="color: #d4d4d4;">code to agree on a representation.  Values and references to values may be</span></div><div><span style="color: #d4d4d4;">routinely scalarized in local variables.  </span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Which is more important?  </span></div>

<div><span style="color: #d4d4d4;">It is tempting to assume that heap flattening is more important, but this is a</span></div><div><span style="color: #d4d4d4;">bias we need to overcome.  Developers tend to be more aware of heap allocation</span></div><div><span style="color: #d4d4d4;">(we can see the </span><span style="color: #ce9178;">`new`</span><span style="color: #d4d4d4;"> in the code) and heap utilization is more easily measured</span></div><div><span style="color: #d4d4d4;">with monitoring tools.  But this is mostly observability bias.  Both are</span></div><div><span style="color: #d4d4d4;">important to performance, and serve complementary goals.  </span></div>

<div><span style="color: #d4d4d4;">Stack flattening is what makes much of the cost of using boxing and wrapper</span></div><div><span style="color: #d4d4d4;">classes like </span><span style="color: #ce9178;">`Optional`</span><span style="color: #d4d4d4;"> go away.  As developers, we all flinch a bit when we</span></div><div><span style="color: #d4d4d4;">have to return a wrapper like </span><span style="color: #ce9178;">`Optional`</span><span style="color: #d4d4d4;"> or a record type that the client is</span></div><div><span style="color: #d4d4d4;">just going to unpack; this feels like "needless motion".  Stack flattening</span></div><div><span style="color: #d4d4d4;">allows us to get the benefits of these abstractions without paying this cost,</span></div><div><span style="color: #d4d4d4;">whis shows up as a streamlining of general computational costs.  </span></div>

<div><span style="color: #d4d4d4;">Heap flattening serves a different role; it is about flattening and compacting</span></div><div><span style="color: #d4d4d4;">object graphs.  This has a bigger impact on data-intensive code, allowing us to</span></div><div><span style="color: #d4d4d4;">pack more data into a given sized heap and traverse data in the heap more</span></div><div><span style="color: #d4d4d4;">efficiently.  It also means that the garbage collector has less work to do,</span></div><div><span style="color: #d4d4d4;">making more CPU cycles and memory bandwidth available for business calculation.  </span></div>

<div><span style="color: #569cd6;font-weight: bold;">## Additional considerations</span></div>

<div><span style="color: #d4d4d4;">There are two additional considerations that affect performance indirectly:</span></div><div><span style="color: #d4d4d4;">nullity and tearing.  </span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Nulls</span></div>

<div><span style="color: #d4d4d4;">Nullity is a property of references; null is how a reference refers to no</span></div><div><span style="color: #d4d4d4;">instance at all.  Values (historically primitives, but now also value types) are</span></div><div><span style="color: #d4d4d4;">never null, so are directly amenable to scalarization.  Perhaps surprisingly,</span></div><div><span style="color: #d4d4d4;font-style: italic;">_references_</span><span style="color: #d4d4d4;"> to value types may also be scalarized by adjoining a synthetic</span></div><div><span style="color: #d4d4d4;">boolean </span><span style="color: #d4d4d4;font-style: italic;">_null channel_</span><span style="color: #d4d4d4;"> to represent whether or not the reference is null.  (If</span></div><div><span style="color: #d4d4d4;">the null channel indicates the reference is null, the data in the other channels</span></div><div><span style="color: #d4d4d4;">should be ignored.)  This null channel may require additional space, since many</span></div><div><span style="color: #d4d4d4;">value types (e.g., </span><span style="color: #ce9178;">`int`</span><span style="color: #d4d4d4;">) use all their bit patterns and therefore would need</span></div><div><span style="color: #d4d4d4;">additional bits to represent nullity.  However, the JVM has a variety of</span></div><div><span style="color: #d4d4d4;">possible tricks at its disposal to eliminate this extra footprint in some cases,</span></div><div><span style="color: #d4d4d4;">such as using slack in pointers, booleans, or the alignment shadow.</span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Tearing</span></div>

<div><span style="color: #d4d4d4;">Whether or not to flatten heap-based variables has an additional consideration:</span></div><div><span style="color: #d4d4d4;">the possibility for </span><span style="color: #d4d4d4;font-style: italic;">_tearing_</span><span style="color: #d4d4d4;">.  Tearing occurs when a read of a logical quantity</span></div><div><span style="color: #d4d4d4;">(such as a 64-bit integer) is broken up into multiple physical reads (such as</span></div><div><span style="color: #d4d4d4;">two 32-bit reads), and the result of those reads correspond to different writes</span></div><div><span style="color: #d4d4d4;">of the logical quantity.  </span></div>

<div><span style="color: #d4d4d4;">The Java platform has always allowed for some form of tearing: reads and writes</span></div><div><span style="color: #d4d4d4;">of 64-bit primitives (</span><span style="color: #ce9178;">`long`</span><span style="color: #d4d4d4;"> and </span><span style="color: #ce9178;">`double`</span><span style="color: #d4d4d4;">) may be broken up into two 32-bit</span></div><div><span style="color: #d4d4d4;">reads and writes.  In the presence of a data race (which is a logic error),</span></div><div><span style="color: #d4d4d4;">these two reads could return data corresponding to two different logical writes.</span></div><div><span style="color: #d4d4d4;">This possible inconsistency was permitted because at the time, most hardware</span></div><div><span style="color: #d4d4d4;">lacked the ability to perform atomic 64 bit operations efficiently, and this</span></div><div><span style="color: #d4d4d4;">problem only occurs in concurrent programs that already have a serious</span></div><div><span style="color: #d4d4d4;">concurrency bug.  (The recommended cure is to declare the field </span><span style="color: #ce9178;">`volatile`</span><span style="color: #d4d4d4;">, but</span></div><div><span style="color: #d4d4d4;">any technique that eliminates the data race, such as guarding the data by a</span></div><div><span style="color: #d4d4d4;">lock, will also work.)</span></div>

<div><span style="color: #d4d4d4;">On modern hardware, most JVMs now use atomic 64 bit instructions for reads and</span></div><div><span style="color: #d4d4d4;">writes of </span><span style="color: #ce9178;">`long`</span><span style="color: #d4d4d4;"> and </span><span style="color: #ce9178;">`double`</span><span style="color: #d4d4d4;"> variables, as the performance of these</span></div><div><span style="color: #d4d4d4;">instructions has caught up and so JVMs can provide greater safety at negligible</span></div><div><span style="color: #d4d4d4;">cost.  However, with the advent of value classes, tearing under race again</span></div><div><span style="color: #d4d4d4;">becomes a possibility, since one can easily declare a value class whose layout</span></div><div><span style="color: #d4d4d4;">exceeds the maximum atomic load and store size of any hardware.</span></div>

<div><span style="color: #d4d4d4;">Because values are aggregates, some value classes may be less tolerant of</span></div><div><span style="color: #d4d4d4;">tearing than others.  Specifically, value classes that have representational</span></div><div><span style="color: #d4d4d4;">invariants across their fields (e.g., a </span><span style="color: #ce9178;">`Range`</span><span style="color: #d4d4d4;"> class that requires the lower</span></div><div><span style="color: #d4d4d4;">bound not exceed the upper bound), and exposing code to instances that do not</span></div><div><span style="color: #d4d4d4;">respect these invariants may be surprising or dangerous.  Accordingly, some</span></div><div><span style="color: #d4d4d4;">value classes may be declared with stronger or weaker atomicity requirements</span></div><div><span style="color: #d4d4d4;">(e.g., </span><span style="color: #ce9178;">`non-atomic`</span><span style="color: #d4d4d4;">) that affect whether or not instances may tear under race --</span></div><div><span style="color: #d4d4d4;">and which potentially constrains how these are flattened in the heap.  (Tearing</span></div><div><span style="color: #d4d4d4;">is not an issue for local variables or method parameters or returns, as these</span></div><div><span style="color: #d4d4d4;">are entirely within-thread and therefore are not at risk for data races.)  </span></div>

<div><span style="color: #569cd6;font-weight: bold;">## Layout </span></div>

<div><span style="color: #d4d4d4;">Today, object layout is simple: reference types are represented as pointers, and</span></div><div><span style="color: #d4d4d4;">primitive types are represented directly (flat); similarly, arrays of reference</span></div><div><span style="color: #d4d4d4;">types are arrays of pointers, and arrays of primitives are flattened.  These</span></div><div><span style="color: #d4d4d4;">layout choices are common to heap, calling convention, and local representation.  </span></div>

<div><span style="color: #d4d4d4;">References to identity objects, and the built-in primitives, will surely</span></div><div><span style="color: #d4d4d4;">continue to use this layout.  But we have additional latitude with value types</span></div><div><span style="color: #d4d4d4;">and references to value objects.  The choice of layout for these new types will</span></div><div><span style="color: #d4d4d4;">depend on a number of factors: whether they have atomicity requirements, their</span></div><div><span style="color: #d4d4d4;">size, the context (heap, stack, or local), and mutability.</span></div>

<div><span style="color: #d4d4d4;">There are effectively three possible flattening strategies available, which</span></div><div><span style="color: #d4d4d4;">we'll call non-, low-, and full-flat.  Non-flat is the same old strategy as for</span></div><div><span style="color: #d4d4d4;">identity objects: pointers.  JVMs are free to fall back to non-flat in any</span></div><div><span style="color: #d4d4d4;">situation.   Full-flat is full inlining of the layout into the enclosing class</span></div><div><span style="color: #d4d4d4;">or array layout, as we get with primitives today.  Low-flat chooses between</span></div><div><span style="color: #d4d4d4;">these based on the size of the object layout and the hardware -- if the object</span></div><div><span style="color: #d4d4d4;">fits into cheap-enough atomic loads and stores, flatten as per full-flat,</span></div><div><span style="color: #d4d4d4;">otherwise fall back to non-flat. (In addition to requiring suitable atomic loads</span></div><div><span style="color: #d4d4d4;">and stores, the low-flat treatment may also require compiler heroics to support</span></div><div><span style="color: #d4d4d4;">reading and writing multiple fields in a single memory access.)</span></div>

<div><span style="color: #d4d4d4;">The following table outlines the best we can do based on the desired semantics:</span></div>

<div><span style="color: #d4d4d4;">| Kind                  | Stack                         | Heap                         |</span></div><div><span style="color: #d4d4d4;">| --------------------- | ----------------------------- | ---------------------------- |</span></div><div><span style="color: #d4d4d4;">| Identity object       | Non-flat                      | Non-flat                     |</span></div><div><span style="color: #d4d4d4;">| Primitive             | Full-flat                     | Full-flat                    |</span></div><div><span style="color: #d4d4d4;">| Non-atomic value type | Full-flat                     | Full-flat                    |</span></div><div><span style="color: #d4d4d4;">| Atomic value type     | Full-flat                     | Low-flat (unless final)      |</span></div><div><span style="color: #d4d4d4;">| Ref to value type     | Full-flat (with null channel) | Low-flat (with null channel) |</span></div>

<div><span style="color: #d4d4d4;">There are two significant attributes of this chart.  First, note that we can</span></div><div><span style="color: #d4d4d4;">still get full or partial flattening even for some reference types.  The other</span></div><div><span style="color: #d4d4d4;">is that we can flatten more uniformly on the stack (calling convention, locals)</span></div><div><span style="color: #d4d4d4;">than we can on the heap (fields, arrays).  This is because of the intrusion of</span></div><div><span style="color: #d4d4d4;">atomicity and the possibility of data races.  The stack is immune from data</span></div><div><span style="color: #d4d4d4;">races since it is confined to a single thread, but in the heap, there is always</span></div><div><span style="color: #d4d4d4;">the possibility of concurrent access.  If there are atomicity requirements</span></div><div><span style="color: #d4d4d4;">(</span><span style="color: #ce9178;">`non-atomic`</span><span style="color: #d4d4d4;"> value types, and all references), then the best we can do is to</span></div><div><span style="color: #d4d4d4;">flatten up to the threshold of atomicity.  </span></div>

<div><span style="color: #d4d4d4;">For references to value types, the footprint cost may be larger to accomodate</span></div><div><span style="color: #d4d4d4;">the null channel (absent heroics to encode the null channel in slack bits.)</span></div>

<div><span style="color: #d4d4d4;">For final fields, we may be able to upgrade to full-flat even for atomic value</span></div><div><span style="color: #d4d4d4;">types and references to value types, because final fields are not subject to</span></div><div><span style="color: #d4d4d4;">tearing.  (Technically, they are if the receiver escapes construction, but in</span></div><div><span style="color: #d4d4d4;">this case the Java Memory Model voids the initialization safety guarantees that</span></div><div><span style="color: #d4d4d4;">would prevent tearing.)</span></div>

<div><span style="color: #d4d4d4;">For very large values, we would likely choose non-flat even when we could</span></div><div><span style="color: #d4d4d4;">otherwise flatten; flattening a 1000-field value type is likely to be</span></div><div><span style="color: #d4d4d4;">counterproductive.  </span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Expected performance</span></div>

<div><span style="color: #d4d4d4;">The table above reveals some insights.  It means that we can routinely expect</span></div><div><span style="color: #d4d4d4;">full flattening in locals and calling convention, regardless of atomicity,</span></div><div><span style="color: #d4d4d4;">nullity, or reference-ness -- the only thing we need to avoid is identity.  In</span></div><div><span style="color: #d4d4d4;">turn, this means the performance difference between </span><span style="color: #ce9178;">`V.ref`</span><span style="color: #d4d4d4;"> and </span><span style="color: #ce9178;">`V.val`</span><span style="color: #d4d4d4;"> for</span></div><div><span style="color: #d4d4d4;">method parameters, returns, and locals, is minimal (though there are some</span></div><div><span style="color: #d4d4d4;">second-order effects due to the null channel, such as register pressure and null</span></div><div><span style="color: #d4d4d4;">check instructions).  </span></div>

<div><span style="color: #d4d4d4;">The real difference between </span><span style="color: #ce9178;">`V.ref`</span><span style="color: #d4d4d4;"> and </span><span style="color: #ce9178;">`V.val`</span><span style="color: #d4d4d4;"> shows up in the heap, which</span></div><div><span style="color: #d4d4d4;">means fields and array elements.  (Arrays are particularly important because the</span></div><div><span style="color: #d4d4d4;">same structure is repeated many times, so any footprint and indirection costs</span></div><div><span style="color: #d4d4d4;">are multiplied by the array size.)  While identity was an absolute impediment to</span></div><div><span style="color: #d4d4d4;">flattening in the heap, once that is removed, the next impediment reveals</span></div><div><span style="color: #d4d4d4;">itself: atomicity.  Java has long offered a powerful safety guarantee of</span></div><div><span style="color: #d4d4d4;font-style: italic;">_initialization safety_</span><span style="color: #d4d4d4;"> for final fields, even when the object reference is</span></div><div><span style="color: #d4d4d4;">published via race.  This is where the oft-repeated wisdom of "immutable objects</span></div><div><span style="color: #d4d4d4;">are automatically thread-safe" comes from; it relies on the atomicity of loading</span></div><div><span style="color: #d4d4d4;">object references.  To avoid consistency surprises, value types provide</span></div><div><span style="color: #d4d4d4;">atomicity by default, but can be marked as </span><span style="color: #ce9178;">`non-atomic`</span><span style="color: #d4d4d4;"> to achieve greater</span></div><div><span style="color: #d4d4d4;">flattening (at the cost of tearing under race.)</span></div>

<div><span style="color: #d4d4d4;">Mutable variables of atomic types -- atomic value types, and references to all</span></div><div><span style="color: #d4d4d4;">value types -- will only be flattened in the heap if they are small enough to</span></div><div><span style="color: #d4d4d4;">fit into the atomic memory operations available.  For references, if there is</span></div><div><span style="color: #d4d4d4;">any additional footprint required to represent null, this additional footprint</span></div><div><span style="color: #d4d4d4;">is included in size for purposes of evaluating "small enough."</span></div>

<div><span style="color: #569cd6;font-weight: bold;">### Coding guidelines</span></div>

<div><span style="color: #d4d4d4;">On the stack (method parameters and returns, locals) the performance difference</span></div><div><span style="color: #d4d4d4;">between using </span><span style="color: #ce9178;">`V.ref`</span><span style="color: #d4d4d4;"> and </span><span style="color: #ce9178;">`V.val`</span><span style="color: #d4d4d4;"> is minimal; disavowing identity is enough.</span></div><div><span style="color: #d4d4d4;">Among other things, this means that migrating value-based classes to true value</span></div><div><span style="color: #d4d4d4;">classes should provide an immediate boost with no code changes.  (Further gains</span></div><div><span style="color: #d4d4d4;">can be had by using the </span><span style="color: #ce9178;">`.val`</span><span style="color: #d4d4d4;"> companion in the heap, but it is generally not</span></div><div><span style="color: #d4d4d4;">necessary to use it in method signatures or locals.)</span></div>

<div><span style="color: #d4d4d4;">The most important consideration for heap flattening is whether the type</span></div><div><span style="color: #d4d4d4;">disavows not only identity, but atomicity.  If a type disavows atomicity, then</span></div><div><span style="color: #d4d4d4;">its value companion will get full flattening in the heap.  </span></div>


</div>

  </body>

</html>