Valhalla performance model

Thu Jun 30 21:24:46 UTC 2022

Here's a *first draft* of a document to go into the SoV docset on the 
performance model.

# State of Valhalla
## Part 5: Performance Model {.subtitle}
#### Brian Goetz {.author}
#### June 2022 {.date}
This document describes performance considerations for value classes under
Project Valhalla.  While we describe the optimizations that we expect the
HotSpot JVM to routinely make, other JVMs may make their own choices, and of
course these choices may vary over time and situations.
## Flattening
Project Valhalla has two broad classes of goals.  The first is 
_unification_:
unifying the treatment of primitives and references in the Java type system.
The second is _performance_: enabling the declaration of aggregates with
_flatter_and _denser_layouts than the layout we get with today's identity
classes.
By _flatness_, we mean the number of memory indirections that must be 
traversed
to get to the leaf data in an object graph.  If all object references are
implemented as pointers -- as they almost always are for identity objects --
then each object becomes an "island" of data, requiring indirections 
each time
we hop to another island.  Flatness begets density; each indirection 
requires an
object header, and eliminating indirections also reduces the number of 
object
headers.  Flatness also reduces garbage collection costs, since there 
are fewer
objects in the heap to process.
### Heap flattening
The form of flattening that comes most readily to mind is flattening on the
heap, _inlining_the layout of some objects into that of other objects (and
arrays), which eliminates island-hopping.  If `C`is an identity class, 
then a
field of type `C`is laid out as a pointer, and an array of type `C[]`is laid
out as an array of pointers.  If `V`is a value type, we have the option 
to lay
out fields of type `V`by inlining the fields of `V`into the layout of the
enclosing type, and lay out arrays of type `V[]`as repeating (aligned) 
groups
of the fields of `V`.  These layout choices reduce the number of 
indirections to
get to the fields of `V`by one hop.
### Calling convention flattening
A less obvious, but also important form of flattening, is flattening in 
_method
calling conventions_.  If a method argument or return is a reference to `C`,
where `C`is an identity class or polymorphic type (interfaces and abstract
classes), the argument will usually be passed as a pointer on the stack 
or in a
register.  If `V`is a value type, we have another option: to _scalarize_`V`
(explode it into its fields) and pass its fields on the stack or in 
registers.
(Perhaps surprisingly, under some situations we have the same option if 
`V`is a
strongly typed reference to a value object (a `V.ref`) as well.)
Both heap layouts and calling convention are determined fairly early in the
execution of a program.  This means that any information needed to make 
these
flattening choices must be available early in the execution as well. 
  The `Q`
descriptors used by value types act as a preload signal, as does the 
`Preload`
attribute used for reference companions of value classes.
### Locals
Locals variables have even more latitude over representation, because unlike
with layouts and calling conventions, there is no need for separately 
compiled
code to agree on a representation.  Values and references to values may be
routinely scalarized in local variables.
### Which is more important?
It is tempting to assume that heap flattening is more important, but 
this is a
bias we need to overcome.  Developers tend to be more aware of heap 
allocation
(we can see the `new`in the code) and heap utilization is more easily 
measured
with monitoring tools.  But this is mostly observability bias.  Both are
important to performance, and serve complementary goals.
Stack flattening is what makes much of the cost of using boxing and wrapper
classes like `Optional`go away.  As developers, we all flinch a bit when we
have to return a wrapper like `Optional`or a record type that the client is
just going to unpack; this feels like "needless motion".  Stack flattening
allows us to get the benefits of these abstractions without paying this 
cost,
whis shows up as a streamlining of general computational costs.
Heap flattening serves a different role; it is about flattening and 
compacting
object graphs.  This has a bigger impact on data-intensive code, 
allowing us to
pack more data into a given sized heap and traverse data in the heap more
efficiently.  It also means that the garbage collector has less work to do,
making more CPU cycles and memory bandwidth available for business 
calculation.
## Additional considerations
There are two additional considerations that affect performance indirectly:
nullity and tearing.
### Nulls
Nullity is a property of references; null is how a reference refers to no
instance at all.  Values (historically primitives, but now also value 
types) are
never null, so are directly amenable to scalarization.  Perhaps 
surprisingly,
_references_to value types may also be scalarized by adjoining a synthetic
boolean _null channel_to represent whether or not the reference is null. 
  (If
the null channel indicates the reference is null, the data in the other 
channels
should be ignored.)  This null channel may require additional space, 
since many
value types (e.g., `int`) use all their bit patterns and therefore would 
need
additional bits to represent nullity.  However, the JVM has a variety of
possible tricks at its disposal to eliminate this extra footprint in 
some cases,
such as using slack in pointers, booleans, or the alignment shadow.
### Tearing
Whether or not to flatten heap-based variables has an additional 
consideration:
the possibility for _tearing_.  Tearing occurs when a read of a logical 
quantity
(such as a 64-bit integer) is broken up into multiple physical reads 
(such as
two 32-bit reads), and the result of those reads correspond to different 
writes
of the logical quantity.
The Java platform has always allowed for some form of tearing: reads and 
writes
of 64-bit primitives (`long`and `double`) may be broken up into two 32-bit
reads and writes.  In the presence of a data race (which is a logic error),
these two reads could return data corresponding to two different logical 
writes.
This possible inconsistency was permitted because at the time, most hardware
lacked the ability to perform atomic 64 bit operations efficiently, and this
problem only occurs in concurrent programs that already have a serious
concurrency bug.  (The recommended cure is to declare the field 
`volatile`, but
any technique that eliminates the data race, such as guarding the data by a
lock, will also work.)
On modern hardware, most JVMs now use atomic 64 bit instructions for 
reads and
writes of `long`and `double`variables, as the performance of these
instructions has caught up and so JVMs can provide greater safety at 
negligible
cost.  However, with the advent of value classes, tearing under race again
becomes a possibility, since one can easily declare a value class whose 
layout
exceeds the maximum atomic load and store size of any hardware.
Because values are aggregates, some value classes may be less tolerant of
tearing than others.  Specifically, value classes that have representational
invariants across their fields (e.g., a `Range`class that requires the lower
bound not exceed the upper bound), and exposing code to instances that 
do not
respect these invariants may be surprising or dangerous.  Accordingly, some
value classes may be declared with stronger or weaker atomicity requirements
(e.g., `non-atomic`) that affect whether or not instances may tear under 
race --
and which potentially constrains how these are flattened in the heap. 
  (Tearing
is not an issue for local variables or method parameters or returns, as 
these
are entirely within-thread and therefore are not at risk for data races.)
## Layout
Today, object layout is simple: reference types are represented as 
pointers, and
primitive types are represented directly (flat); similarly, arrays of 
reference
types are arrays of pointers, and arrays of primitives are flattened.  These
layout choices are common to heap, calling convention, and local 
representation.
References to identity objects, and the built-in primitives, will surely
continue to use this layout.  But we have additional latitude with value 
types
and references to value objects.  The choice of layout for these new 
types will
depend on a number of factors: whether they have atomicity requirements, 
their
size, the context (heap, stack, or local), and mutability.
There are effectively three possible flattening strategies available, which
we'll call non-, low-, and full-flat.  Non-flat is the same old strategy 
as for
identity objects: pointers.  JVMs are free to fall back to non-flat in any
situation.   Full-flat is full inlining of the layout into the enclosing 
class
or array layout, as we get with primitives today.  Low-flat chooses between
these based on the size of the object layout and the hardware -- if the 
object
fits into cheap-enough atomic loads and stores, flatten as per full-flat,
otherwise fall back to non-flat. (In addition to requiring suitable 
atomic loads
and stores, the low-flat treatment may also require compiler heroics to 
support
reading and writing multiple fields in a single memory access.)
The following table outlines the best we can do based on the desired 
semantics:
| Kind                  | Stack                         | Heap           
               |
| --------------------- | ----------------------------- | 
---------------------------- |
| Identity object       | Non-flat                      | Non-flat       
               |
| Primitive             | Full-flat                     | Full-flat     
                |
| Non-atomic value type | Full-flat                     | Full-flat     
                |
| Atomic value type     | Full-flat                     | Low-flat 
(unless final)      |
| Ref to value type     | Full-flat (with null channel) | Low-flat (with 
null channel) |
There are two significant attributes of this chart.  First, note that we can
still get full or partial flattening even for some reference types.  The 
other
is that we can flatten more uniformly on the stack (calling convention, 
locals)
than we can on the heap (fields, arrays).  This is because of the 
intrusion of
atomicity and the possibility of data races.  The stack is immune from data
races since it is confined to a single thread, but in the heap, there is 
always
the possibility of concurrent access.  If there are atomicity requirements
(`non-atomic`value types, and all references), then the best we can do is to
flatten up to the threshold of atomicity.
For references to value types, the footprint cost may be larger to 
accomodate
the null channel (absent heroics to encode the null channel in slack bits.)
For final fields, we may be able to upgrade to full-flat even for atomic 
value
types and references to value types, because final fields are not subject to
tearing.  (Technically, they are if the receiver escapes construction, 
but in
this case the Java Memory Model voids the initialization safety 
guarantees that
would prevent tearing.)
For very large values, we would likely choose non-flat even when we could
otherwise flatten; flattening a 1000-field value type is likely to be
counterproductive.
### Expected performance
The table above reveals some insights.  It means that we can routinely 
expect
full flattening in locals and calling convention, regardless of atomicity,
nullity, or reference-ness -- the only thing we need to avoid is 
identity.  In
turn, this means the performance difference between `V.ref`and `V.val`for
method parameters, returns, and locals, is minimal (though there are some
second-order effects due to the null channel, such as register pressure 
and null
check instructions).
The real difference between `V.ref`and `V.val`shows up in the heap, which
means fields and array elements.  (Arrays are particularly important 
because the
same structure is repeated many times, so any footprint and indirection 
costs
are multiplied by the array size.)  While identity was an absolute 
impediment to
flattening in the heap, once that is removed, the next impediment reveals
itself: atomicity.  Java has long offered a powerful safety guarantee of
_initialization safety_for final fields, even when the object reference is
published via race.  This is where the oft-repeated wisdom of "immutable 
objects
are automatically thread-safe" comes from; it relies on the atomicity of 
loading
object references.  To avoid consistency surprises, value types provide
atomicity by default, but can be marked as `non-atomic`to achieve greater
flattening (at the cost of tearing under race.)
Mutable variables of atomic types -- atomic value types, and references 
to all
value types -- will only be flattened in the heap if they are small 
enough to
fit into the atomic memory operations available.  For references, if 
there is
any additional footprint required to represent null, this additional 
footprint
is included in size for purposes of evaluating "small enough."
### Coding guidelines
On the stack (method parameters and returns, locals) the performance 
difference
between using `V.ref`and `V.val`is minimal; disavowing identity is enough.
Among other things, this means that migrating value-based classes to 
true value
classes should provide an immediate boost with no code changes. 
  (Further gains
can be had by using the `.val`companion in the heap, but it is generally not
necessary to use it in method signatures or locals.)
The most important consideration for heap flattening is whether the type
disavows not only identity, but atomicity.  If a type disavows 
atomicity, then
its value companion will get full flattening in the heap.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/valhalla-spec-observers/attachments/20220630/576fd120/attachment-0001.htm>