for the record: heat based sorting, driven by training runs
The following thoughts are “for the record” not “for immediate implementation”. They are an example of good ideas that come up in our conversations, are given some thought, and are then prioritized down the list. For later. Maybe even for never. The important thing is to have a reasonably complete record, so we don’t have to rediscover and rediscuss such points. There are ever so very many of such points, we would be at it forever rediscovering and rediscussing. With that said… It is well known that some workloads (in many languages) sometimes benefit from changing the order of fields in structs (or objects or whatever they are called in whatever language). In simple terms, if X.A is accessed very often, and X.Z is accessed seldom, then there is little or no penalty to putting them on opposite ends of a struct layout, even though the struct might lie across a data cache boundary, so that X.A and X.Z must incur distinct data cache fill events. The same point is true if the struct is used heavily in two different ways (different program phases), so that X.Z is sometimes accessed frequently, but never near the time when X.A was accessed frequently. A similar point holds if X.A and X.B are always accessed together. You don’t want them to be far away from each other, far enough to make it likely that they end up on distinct cache lines (requiring double the number of fill events for a use of both). For Java, the object header is, for some classes, accessed very frequently indeed. And for others it is almost never accessed, except during creation and GC! It depends on whether the class in question is used solely as a static type. There are plenty of examples; the internal nodes of hash tables work that way. In any case, the header works like just another field X.H, in this analysis, and may be accessed seldom or frequently, and coupled with other fields, or independently. There is a distinct NP-hard feel about the optimization problems here, and yet very simple tactics can lead to solid benefits (often in the single digit percentage range). Sorting fields by frequency of access is well known. It covers the case where X.A and X.B are heavily used together. It doesn’t cover the case where both are heavily used but at different times. That doesn’t matter; it covers enough cases to be beneficial as a standard move. A similar standard move is sorting functions in a code heap by hotness. You don’t necessarily have to look at their caller/callee interrelationships, although that’s a next step. And a third similar standard move is to sort the basic blocks within a function, again by hotness. For that one, you need to respect their intrinsic in/out relations, from the start. What I say about struct layout can be cross-applied to these other problem areas as well. Let’s just call all of them “heat based sorting”, whether or not the heat is the primary sort key or whether it is a “secondary” influence on the resulting order. A good practical study of how heat-based sorting can help is here: https://dl.acm.org/doi/abs/10.1109/CGO51591.2021.9370314 This is the Facebook HipHop VM, not a JVM. But some of their experiences are relevant to us. For object layout, the need for holes to get word alignment will make a pure heat-based sort impractical, but a simple heat-based sorting pass, followed by some hole-filling that violates the heat-based order, might work out fine. The object header should probably go next to the hottest field, in most cases. But in cases where the object header is cold, the object layout could start with cold fields. In order to make this work, the training run would have to gather heat statistics. I imagine a per-class array of counters, indexed by offset (byte offset) within the class’s current object layout. Reading or writing a field (or the header) increments the counter. (Extra points for probabilistic or indirect counters which don’t drive the memory fabric insane, making the application dynamics non-representative of real, non-instrumented deployments. Heisenberg and all that.) The assembly phase will then take stock of the training run outputs and re-load everything into a nicely curated bundle for the production runs. As it reloads classes, it evaluates whether their layouts should change (because of interesting observed dynamics). Not all should change, but some might, if they have enough variance across their field statistics. At that point it is a “mere matter of programming” to set up the adjusted (heat-based) layouts. The Valhalla project is adding extra “programmability” to layout logic, so perhaps we will have some new tools from them. (Thanks, Fredric.) A similar point goes for function ordering in the code cache, on that day (when it comes) when the assembly phase gets fine-grained control of function placement in the code cache. Just as alignment considerations require us to modify a pure heat-based sorting principle, it may be better to put related functions (of approximately similar heat) back-to-back, even if the pure heat-based sorting would shuffle together unrelated functions. A similar point goes for basic-block sorting, again with caveats about putting sequential blocks in sequence, unless there is evidence that sequence has no benefit (because of large variance in heat). In the case of basic-block sorting, we have a sticky problem (noted by the HipHop people) that regeneration of code will probably obliterate basic block identities. In order to reorder basic blocks (in the output phase, for example, which is where C2 chooses order), the ordering heuristics need to access profile data for a previous occurrence (in a previous run) of that SAME block. We don’t currently have a block “sameness” criterion. HipHop does this by having a very low level BB-based IR, from which they assemble twice, once with profile counters, and once USING those counters. We could do this (perhaps with deep C2 cuts) or we could contrive to “name” our basic blocks in such a way that they OFTEN (perhaps not always) correspond to IR generation tasks that work from a common source (classfile, inline tree) but do not necessarily create exactly the same set of basic blocks. That might be enough, depending the “naming scheme” we come up with to refer to basic blocks as they are generated (during training for profiling) and later on a second time (during assembly). I don’t have a clear solution to propose yet, but that is the way I see this particular problem space. Although it is not urgent yet, at some point somebody (maybe a graduate student?) might wish to try out a targeted experiment with field reordering, and see what pops out. We don’t have time to depend on this now, and we have enough high-priority (cheaper and more rewarding) tasks to fill our plate. But, these are my thoughts for the record… — John
P.S. Also FTR, I sent a related set of ideas to Lilliput. The point there is that the dynamics of certain PARTS of object headers can be predicted by training run profiles, and there may be possible uses for such profiles. Those are far more speculative than the ideas of field, method, and BB ordering. for the record: header bit dynamics could possibly change after Leyden training runs https://mail.openjdk.org/pipermail/lilliput-dev/2024-August/001819.html
participants (1)
-
John Rose