RFC: CDS Streaming object loader

Mon Feb 5 15:35:21 UTC 2024

Hi,

I heard from Stefan Karlsson that there was some interest among several committers external to Oracle
 (Thomas Stüfe, and some more) from OCW about what I’m currently doing with a new streaming object
loader for CDS. Sounded like some people would like a bit more visibility into it. So this isn’t at all a
review request, but more of a heads-up about what I’m working on, for any interested readers.

First of all, the state of my work is that it’s a prototype that I’m trying to iron out rough corners on, and
upstream it to mainline. I’m getting closer, but I don’t mind giving a sneak peak at the current state right
now. It might look a bit messy though.
PR:  https://github.com/fisk/jdk/commits/8310823_object_streaming/
Bug tracking the changes: https://bugs.openjdk.org/browse/JDK-8310823

The spirit of the new mechanism is being GC agnostic and loading and linking on object granularity,
letting the GC decide how to lay it all out in the heap. Indeed, it works for all our GCs. However, the
 initial plan I have, is to enable it for ZGC. We don’t have any object loader that works with ZGC at all
today, so this will be an obvious improvement when using ZGC. As for the other GCs with a loader, we
have something that already works, and I’d like to give the new mechanism some mileage, especially in
combination with things cooking in Leyden, before we make any decisions at all about what to do about
 the existing solution. Heuristically though, I’m turning on the new mode when dumping an archive without
 compressed oops, because then anyone can load from that format, regardless of GC choice, indeed
 including ZGC. That seems to make sense to me.

The streaming archive heap loader loads Java objects using normal allocations. It requires the objects
to be ordered in DFS order already at dump time, given the set of roots into the archived heap.
Since the objects are ordered in DFS order, that means that walking them linearly through the archive
is equivalent to performing a DFS traversal, but without pushing and popping anything.

The advantage of this pre-ordering, other than the obvious locality improvement, is that we can have
a separate thread, the CDSThread, perform this walk, in a way that allows us to split the archived
heap into three separate zones. The first zone contains objects that have been transitively materialized,
the second zone contains objects that are currently being materialized, and the last zone contains
objects that have not and are not about to be touched by the CDS thread.
Whenever a new root is traversed by the CDS thread, the zones are shifted atomically under a lock.

ASCII visualization of the three zones:

|      transitively materialized        |       currently materializing        |        not yet materialized         |

Being able to split the memory into these three zones, allows the bootstrapping thread and potential
other threads to be able to, under a lock, traverse a root, and know how to coordinate with the
concurrent CDS thread. Whenever the traversal finds an object in the "transitively materialized"
zone, then we know such objects don't need any processing at all. As for "currently materializing",
we know that if we just stay out of the way and let the CDSThread finish its current root, then
the transitive closure of such objects will be materialized. And the CDSThread can materialize faster
then the rest as it doesn't need to perform any traversal. Finally, as for objects in the "not yet
materialized" zone, we know that we can trace through it without stepping on the feed of the CDSThread
which has published it won't be tracing anything in there.

What we get from this, is fast iterative traversal from the CDS thread (IterativeObjectLoader)
while allowing lazyness and concurrency with the rest of the program (TracingObjectLoader).
This way the CDS thread can remove the bulk of the work of materializing the Java objects from
the critical bootstrapping thread.

When we start materializing objects, we have not yet come to the point in the bootstrapping where
GC is allowed. This is a two edged sword. On the one hand side, we can materialize objects faster
when we know there is no GC to coordinate with, but on the other hand side, if we need to perform
a GC when allocating memory for archived objects, we will bring down the entire JVM. To deal with this,
the CDS thread asks the GC for a budget of bytes it is allowed to allocate before GC is allowed.
When we get to the point in the bootstrapping where GC is allowed, we resume materializing objects
that didn't fit in the budget. Before we let the application run, we force materialization of any
remaining objects that have not been materialized by the CDS thread yet, so that we don't get
surprising OOMs due to object materialization while the program is running.

The object format of the archived heap is similar to a normal object. However, references are encoded
as DFS indices, which in the end map to what index the object is in the buffer, as they are laid out
in DFS order. The DFS indices start at 1 for the first object, and hence the number 0 represents
null. The DFS index of objects is a core identifier of objects in this approach. From this index
it is possible to find out what offset the archived object has into the buffer, as well as finding
mappings to Java heap objects that have been materialized.

The table mapping DFS indices to Java heap objects is filled in when an object is allocated.
Materializing objects involves allocating the object, initializing it, and linking it with other
objects. Since linking the object requires whatever is being referenced to be at least allocated,
the iterative traversal will first allocate all of the objects in its zone being worked on, and then
perform initialization and linking in a second pass. What these passes have in common is that they
are trivially parallelizable, should we ever need to do that. The tracing materialization links
objects when going "back" in the DFS traversal.

The forwarding information for the mechanism contains raw oops before GC is allowed, and as we
enable GC in the bootstrapping, all raw oops are handleified using OopStorage. All handles are
handed back from the CDS thread when materialization has finished. The switch from raw oops to
using OopStorage handles, happens under a lock while no iteration nor tracing is allowed.

The initialization code is also performed in a faster way when the GC is not allowed. In particular,
before GC is allowed, we perform raw memcpy of the archived object into the Java heap. Then the
object is initialized with IS_DEST_UNINITIALIZED stores. The assumption made here is that before
any GC activity is allowed, we shouldn't have to worry about concurrent GC threads scanning the
memory and getting tripped up by that. Once GC is enabled, we revert to a bit more careful approach
that uses a pre-computed bitmap to find the holes where oops go, and carefully copy only the
non-oop information with memcpy, while the oops are set separately with HeapAccess stores that
should be able to cope well with concurrent activity.

The same bitmap that tracks where there are oops, is reused also for signalling which string
objects should be interned. From the dump, some referenced strings were interned. This is
really an identity property. We don't need to dump the entire string table as a way of communicating
this identity property. Instead we intern strings on-the-fly, exploiting the dynamic object
level linking that this approach has chosen to our advantage.

An advantage of separating the on-disk object format descriptions from the in-memory format in the
heap, is that the objects can also be compressed in the disk format. I haven’t explored this yet, but
I suspect it can be quite beneficial for cold starts, at a cost of warm starts probably performing a bit worse.

I’m currently working my way through tests to get things looking green, as well as cleaning up the
code to prepare it for eventual upstreaming.

In terms of results, as long as my CDS thread gets a core to run on, startup latency for the
bootstrapping thread is ~1 ms, even when the archive gets inflated to ~5 MB. The concurrent
thread soaks up ~11 ms worth of work that is performed efficiently way before we are about to
start running any user code really.

If anyone has any feedback or wants to help exploring for example compression or something like that,
feel free to help out if you think it sounds interesting. Another thing I know is very helpful is to map the
CDS regions with MAP_POPULATE when available. That’s also something I wouldn’t mind help with
upstreaming, if anyone is interested in having a look at that.

Either way, hope making the current thoughts on this visible in this relatively early stage, is helpful.
I’m open to any feedback, if there is any. Hope people like the approach.

Thanks,
/Erik
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-dev/attachments/20240205/535fe500/attachment-0001.htm>