Loom implementation design update

Tue Mar 13 11:26:08 UTC 2018

A major change in the implementation approach of Project Loom was made recently,
and I want to update the list about it.

We've discussed various options regarding the management of continuation stacks.
There were two major considerations: 1. where to store the stacks -- whether on
the Java heap, as Java objects, or on the C heap, using some specialized memory
management mechanism, and 2. whether the continuation stack layout should mirror
the ordinary layout used by frames on thread stacks, or a different one (for
example, using separate objects to store primitives and references). The two
main requirements I outlined in a previous email have directed our discussion,
namely, a. mounting/dismounting continuations must be very fast and b. we expect
hundreds of thousands or even millions of continuations.

As far as question 1 is concerned, we reached the conclusion that, as
continuation frames may hold references to the Java heap that would need to
somehow be traversed by the GC, and as the number of continuations is high and
so treating all continuation stacks as roots (as we do for thread stacks) is
unviable, managing continuation stacks in a separate memory region would require
what amounts to creating a new and non-trivial garbage collection mechanism, and
so wouldn't simplify matters; relying on the existing memory management
facilities of the Java heap would be easier.

Question 2, that of layout, turned out to be more troublesome, as all current
and upcoming HotSpot GCs cannot easily support objects that store a reference in
some memory slot and then, at some later time, contain a primitive at the same
slot. The restriction on dynamic layout imposed by the GCs meant that the
continuation stack layout would need to be substantially different from that of
the ordinary stack layout [1].

Regardless of our preference concerning both questions, we believed that the
requirement for fast task-switching meant we must execute continuation code
"inside" the continuation stack, meaning, by pointing the stack pointer (which
may have needed to be split into more than one pointer, depending on our chosen
layout) to the continuation stack. The need for a different layout combined with
the need to run "inside" the continuation stack would require drastically
different machine code to be generated (by the compilers and the interpreter)
for Java code running in a continuation. In addition, in order to avoid
performing various GC barriers whenever such code was executing (on the Java
heap), a rather complex handshake between the continuation and the GC would need
to be performed on each mount or dismount.

So executing code directly on the continuation stack would prove to be quite a
challenge, but executing code only on the thread stack and copying the
continuation stack back and forth between the heap and the stack on each
mount/dismount would likely be to costly, and it incurs a cost that is linear in
the entire depth of the continuation stack, regardless of how much work is done
while the continuation is mounted (even if no methods are pushed or popped).

As a result, we came up with a compromise solution, which we find appealing. We
call it "lazy copy", and the idea is as follows: Execution proceeds only on
ordinary thread stacks, and continuation frames are copied to and from the heap
when dis/mounting, but instead of copying the entire continuation frame when
mounting, we copy only the topmost frame (or some small batch of frames), and
install a "return barrier" that, when the bottom-most of the copied frames is
popped, copies over another frame from the heap. Upon dismount, those
continuation frames that are on the thread stack (and form the top portion of
the continuation stack) are copied over to the heap. While still based on
copying stack frames, we only need to copy however many frames we've actually
used.

With this solution, the question of the precise continuation stack layout
becomes secondary, as it only affects the mount/dismount code but not any
machine-code generation. It may also allow us to compress the stack frames and
store them on the heap in a less wasteful way. Finally, it may be possible to
reduce the task switching cost even further by storing some small number of
recently-used continuations in a cache of thread stacks (that would be treated
as ordinary thread stacks), to only be copied to the heap when evicted. The
effectiveness of caching depends, of course, on the cache-friendliness of the
the manner in which continuations are used -- which would need to be studied.

Ron

[1]: Unless we were to use a data structure of linked frame object, where a new
frame object would be allocated whenever the primitive/reference slots changed.
This would happen very often, and would entail GC pressure even when no
application objects are explicitly allocated.