<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">


Hi,


<div><br>


</div>


<div>I heard from Stefan Karlsson that there was some interest among several committers external to Oracle</div>


<div> (Thomas Stüfe, and some more) from OCW about what I’m currently doing with a new streaming object</div>


<div>loader for CDS. Sounded like some people would like a bit more visibility into it. So this isn’t at all a</div>


<div>review request, but more of a heads-up about what I’m working on, for any interested readers.</div>


<div><br>


</div>


<div>First of all, the state of my work is that it’s a prototype that I’m trying to iron out rough corners on, and</div>


<div>upstream it to mainline. I’m getting closer, but I don’t mind giving a sneak peak at the current state right</div>


<div>now. It might look a bit messy though.</div>


<div>PR:  <a href="https://github.com/fisk/jdk/commits/8310823_object_streaming/">https://github.com/fisk/jdk/commits/8310823_object_streaming/</a></div>


<div>Bug tracking the changes: <a href="https://bugs.openjdk.org/browse/JDK-8310823">https://bugs.openjdk.org/browse/JDK-8310823</a></div>


<div><br>


</div>


<div>The spirit of the new mechanism is being GC agnostic and loading and linking on object granularity, </div>


<div>letting the GC decide how to lay it all out in the heap. Indeed, it works for all our GCs. However, the</div>


<div> initial plan I have, is to enable it for ZGC. We don’t have any object loader that works with ZGC at all </div>


<div>today, so this will be an obvious improvement when using ZGC. As for the other GCs with a loader, we</div>


<div>have something that already works, and I’d like to give the new mechanism some mileage, especially in </div>


<div>combination with things cooking in Leyden, before we make any decisions at all about what to do about</div>


<div> the existing solution. Heuristically though, I’m turning on the new mode when dumping an archive without</div>


<div> compressed oops, because then anyone can load from that format, regardless of GC choice, indeed</div>


<div> including ZGC. That seems to make sense to me.</div>


<div><br>


</div>


<div>


<div>The streaming archive heap loader loads Java objects using normal allocations. It requires the objects</div>


<div>to be ordered in DFS order already at dump time, given the set of roots into the archived heap.</div>


<div>Since the objects are ordered in DFS order, that means that walking them linearly through the archive</div>


<div>is equivalent to performing a DFS traversal, but without pushing and popping anything.</div>


<div><br>


</div>


<div>The advantage of this pre-ordering, other than the obvious locality improvement, is that we can have</div>


<div>a separate thread, the CDSThread, perform this walk, in a way that allows us to split the archived</div>


<div>heap into three separate zones. The first zone contains objects that have been transitively materialized,</div>


<div>the second zone contains objects that are currently being materialized, and the last zone contains</div>


<div>objects that have not and are not about to be touched by the CDS thread.</div>


<div>Whenever a new root is traversed by the CDS thread, the zones are shifted atomically under a lock.</div>


<div><br>


</div>


<div>ASCII visualization of the three zones:</div>


<div><br>


</div>


<div>|      transitively materialized        |       currently materializing        |        not yet materialized         |</div>


<div><br>


</div>


<div>Being able to split the memory into these three zones, allows the bootstrapping thread and potential</div>


<div>other threads to be able to, under a lock, traverse a root, and know how to coordinate with the</div>


<div>concurrent CDS thread. Whenever the traversal finds an object in the "transitively materialized"</div>


<div>zone, then we know such objects don't need any processing at all. As for "currently materializing",</div>


<div>we know that if we just stay out of the way and let the CDSThread finish its current root, then</div>


<div>the transitive closure of such objects will be materialized. And the CDSThread can materialize faster</div>


<div>then the rest as it doesn't need to perform any traversal. Finally, as for objects in the "not yet</div>


<div>materialized" zone, we know that we can trace through it without stepping on the feed of the CDSThread</div>


<div>which has published it won't be tracing anything in there.</div>


<div><br>


</div>


<div>What we get from this, is fast iterative traversal from the CDS thread (IterativeObjectLoader)</div>


<div>while allowing lazyness and concurrency with the rest of the program (TracingObjectLoader).</div>


<div>This way the CDS thread can remove the bulk of the work of materializing the Java objects from</div>


<div>the critical bootstrapping thread.</div>


<div><br>


</div>


<div>When we start materializing objects, we have not yet come to the point in the bootstrapping where</div>


<div>GC is allowed. This is a two edged sword. On the one hand side, we can materialize objects faster</div>


<div>when we know there is no GC to coordinate with, but on the other hand side, if we need to perform</div>


<div>a GC when allocating memory for archived objects, we will bring down the entire JVM. To deal with this,</div>


<div>the CDS thread asks the GC for a budget of bytes it is allowed to allocate before GC is allowed.</div>


<div>When we get to the point in the bootstrapping where GC is allowed, we resume materializing objects</div>


<div>that didn't fit in the budget. Before we let the application run, we force materialization of any</div>


<div>remaining objects that have not been materialized by the CDS thread yet, so that we don't get</div>


<div>surprising OOMs due to object materialization while the program is running.</div>


<div><br>


</div>


<div>The object format of the archived heap is similar to a normal object. However, references are encoded</div>


<div>as DFS indices, which in the end map to what index the object is in the buffer, as they are laid out</div>


<div>in DFS order. The DFS indices start at 1 for the first object, and hence the number 0 represents</div>


<div>null. The DFS index of objects is a core identifier of objects in this approach. From this index</div>


<div>it is possible to find out what offset the archived object has into the buffer, as well as finding</div>


<div>mappings to Java heap objects that have been materialized.</div>


<div><br>


</div>


<div>The table mapping DFS indices to Java heap objects is filled in when an object is allocated.</div>


<div>Materializing objects involves allocating the object, initializing it, and linking it with other</div>


<div>objects. Since linking the object requires whatever is being referenced to be at least allocated,</div>


<div>the iterative traversal will first allocate all of the objects in its zone being worked on, and then</div>


<div>perform initialization and linking in a second pass. What these passes have in common is that they</div>


<div>are trivially parallelizable, should we ever need to do that. The tracing materialization links</div>


<div>objects when going "back" in the DFS traversal.</div>


<div><br>


</div>


<div>The forwarding information for the mechanism contains raw oops before GC is allowed, and as we</div>


<div>enable GC in the bootstrapping, all raw oops are handleified using OopStorage. All handles are</div>


<div>handed back from the CDS thread when materialization has finished. The switch from raw oops to</div>


<div>using OopStorage handles, happens under a lock while no iteration nor tracing is allowed.</div>


<div><br>


</div>


<div>The initialization code is also performed in a faster way when the GC is not allowed. In particular,</div>


<div>before GC is allowed, we perform raw memcpy of the archived object into the Java heap. Then the</div>


<div>object is initialized with IS_DEST_UNINITIALIZED stores. The assumption made here is that before</div>


<div>any GC activity is allowed, we shouldn't have to worry about concurrent GC threads scanning the</div>


<div>memory and getting tripped up by that. Once GC is enabled, we revert to a bit more careful approach</div>


<div>that uses a pre-computed bitmap to find the holes where oops go, and carefully copy only the</div>


<div>non-oop information with memcpy, while the oops are set separately with HeapAccess stores that</div>


<div>should be able to cope well with concurrent activity.</div>


<div><br>


</div>


<div>The same bitmap that tracks where there are oops, is reused also for signalling which string</div>


<div>objects should be interned. From the dump, some referenced strings were interned. This is</div>


<div>really an identity property. We don't need to dump the entire string table as a way of communicating</div>


<div>this identity property. Instead we intern strings on-the-fly, exploiting the dynamic object</div>


<div>level linking that this approach has chosen to our advantage.</div>


</div>


<div><br>


</div>


<div>An advantage of separating the on-disk object format descriptions from the in-memory format in the</div>


<div>heap, is that the objects can also be compressed in the disk format. I haven’t explored this yet, but</div>


<div>I suspect it can be quite beneficial for cold starts, at a cost of warm starts probably performing a bit worse.</div>


<div></div>


<div><br>


</div>


<div>I’m currently working my way through tests to get things looking green, as well as cleaning up the</div>


<div>code to prepare it for eventual upstreaming.</div>


<div><br>


</div>


<div>In terms of results, as long as my CDS thread gets a core to run on, startup latency for the</div>


<div>bootstrapping thread is ~1 ms, even when the archive gets inflated to ~5 MB. The concurrent</div>


<div>thread soaks up ~11 ms worth of work that is performed efficiently way before we are about to</div>


<div>start running any user code really.</div>


<div><br>


</div>


<div>If anyone has any feedback or wants to help exploring for example compression or something like that,</div>


<div>feel free to help out if you think it sounds interesting. Another thing I know is very helpful is to map the</div>


<div>CDS regions with MAP_POPULATE when available. That’s also something I wouldn’t mind help with</div>


<div>upstreaming, if anyone is interested in having a look at that.</div>


<div><br>


</div>


<div>Either way, hope making the current thoughts on this visible in this relatively early stage, is helpful.</div>


<div>I’m open to any feedback, if there is any. Hope people like the approach.</div>


<div><br>


</div>


<div>Thanks,</div>


<div>/Erik</div>


</body>


</html>