for the record: heat based sorting, driven by training runs

Wed Aug 28 22:26:55 UTC 2024

The following thoughts are “for the record” not “for
immediate implementation”.  They are an example of good
ideas that come up in our conversations, are given some
thought, and are then prioritized down the list.  For later.
Maybe even for never.  The important thing is to have a
reasonably complete record, so we don’t have to rediscover
and rediscuss such points.  There are ever so very many
of such points, we would be at it forever rediscovering
and rediscussing.

With that said…

It is well known that some workloads (in many languages)
sometimes benefit from changing the order of fields in
structs (or objects or whatever they are called in whatever
language).  In simple terms, if X.A is accessed very often,
and X.Z is accessed seldom, then there is little or no
penalty to putting them on opposite ends of a struct
layout, even though the struct might lie across a data
cache boundary, so that X.A and X.Z must incur distinct
data cache fill events.

The same point is true if the struct is used heavily in
two different ways (different program phases), so that
X.Z is sometimes accessed frequently, but never near the
time when X.A was accessed frequently.

A similar point holds if X.A and X.B are always accessed
together.  You don’t want them to be far away from each
other, far enough to make it likely that they end up on
distinct cache lines (requiring double the number of
fill events for a use of both).

For Java, the object header is, for some classes, accessed
very frequently indeed.  And for others it is almost never
accessed, except during creation and GC!  It depends on
whether the class in question is used solely as a static
type.  There are plenty of examples; the internal nodes
of hash tables work that way.  In any case, the header
works like just another field X.H, in this analysis,
and may be accessed seldom or frequently, and coupled
with other fields, or independently.

There is a distinct NP-hard feel about the optimization
problems here, and yet very simple tactics can lead to
solid benefits (often in the single digit percentage
range).  Sorting fields by frequency of access is well
known.  It covers the case where X.A and X.B are heavily
used together.  It doesn’t cover the case where both are
heavily used but at different times.  That doesn’t matter;
it covers enough cases to be beneficial as a standard
move.

A similar standard move is sorting functions in a code
heap by hotness.  You don’t necessarily have to look at
their caller/callee interrelationships, although that’s
a next step.  And a third similar standard move is to
sort the basic blocks within a function, again by hotness.
For that one, you need to respect their intrinsic in/out
relations, from the start.  What I say about struct layout
can be cross-applied to these other problem areas as well.
Let’s just call all of them “heat based sorting”, whether
or not the heat is the primary sort key or whether it is
a “secondary” influence on the resulting order.

A good practical study of how heat-based sorting can help is
here: https://dl.acm.org/doi/abs/10.1109/CGO51591.2021.9370314
This is the Facebook HipHop VM, not a JVM.  But some of their
experiences are relevant to us.

For object layout, the need for holes to get word alignment
will make a pure heat-based sort impractical, but a simple
heat-based sorting pass, followed by some hole-filling that
violates the heat-based order, might work out fine.

The object header should probably go next to the hottest
field, in most cases.  But in cases where the object header
is cold, the object layout could start with cold fields.

In order to make this work, the training run would have to
gather heat statistics.  I imagine a per-class array of
counters, indexed by offset (byte offset) within the class’s
current object layout.  Reading or writing a field (or the
header) increments the counter.  (Extra points for probabilistic
or indirect counters which don’t drive the memory fabric insane,
making the application dynamics non-representative of real,
non-instrumented deployments.  Heisenberg and all that.)

The assembly phase will then take stock of the training run
outputs and re-load everything into a nicely curated bundle
for the production runs.  As it reloads classes, it evaluates
whether their layouts should change (because of interesting
observed dynamics).  Not all should change, but some might,
if they have enough variance across their field statistics.

At that point it is a “mere matter of programming” to set
up the adjusted (heat-based) layouts.  The Valhalla project
is adding extra “programmability” to layout logic, so perhaps
we will have some new tools from them.  (Thanks, Fredric.)

A similar point goes for function ordering in the code cache,
on that day (when it comes) when the assembly phase gets
fine-grained control of function placement in the code cache.

Just as alignment considerations require us to modify a pure
heat-based sorting principle, it may be better to put related
functions (of approximately similar heat) back-to-back, even
if the pure heat-based sorting would shuffle together unrelated
functions.

A similar point goes for basic-block sorting, again with caveats
about putting sequential blocks in sequence, unless there is
evidence that sequence has no benefit (because of large variance
in heat).  In the case of basic-block sorting, we have a sticky
problem (noted by the HipHop people) that regeneration of code
will probably obliterate basic block identities.  In order to
reorder basic blocks (in the output phase, for example, which
is where C2 chooses order), the ordering heuristics need to
access profile data for a previous occurrence (in a previous
run) of that SAME block.  We don’t currently have a block
“sameness” criterion.  HipHop does this by having a very
low level BB-based IR, from which they assemble twice,
once with profile counters, and once USING those counters.
We could do this (perhaps with deep C2 cuts) or we could
contrive to “name” our basic blocks in such a way that
they OFTEN (perhaps not always) correspond to IR generation
tasks that work from a common source (classfile, inline
tree) but do not necessarily create exactly the same set
of basic blocks.  That might be enough, depending the
“naming scheme” we come up with to refer to basic
blocks as they are generated (during training for
profiling) and later on a second time (during assembly).
I don’t have a clear solution to propose yet, but that
is the way I see this particular problem space.

Although it is not urgent yet, at some point somebody
(maybe a graduate student?) might wish to try out a targeted
experiment with field reordering, and see what pops out.
We don’t have time to depend on this now, and we have enough
high-priority (cheaper and more rewarding) tasks to fill our
plate.  But, these are my thoughts for the record…

— John