Stack allocation prototype for C2

Thu Jul 9 19:28:01 UTC 2020

Hi Vladimir,

Thanks for reviewing the document and providing your feedback.

> From the design overview and the implementation, I'm concerned about 
> far-reaching consequences of the chosen approach. It's not limited just 
> to existing set of JVM features, but as Andrew noted will affect the 
> design of forthcoming functionality as well.
>
> I think it's worth to start a broad discussion (HotSpot-wide) and decide 
> how much JVM design complexity budged it is worth spending on such an 
>optimization.

This is a great suggestion, where and how should we start this discussion
to get feedback from the broader community?

> As we discussed off-line (right after FOSDEM), I do see the benefits of 
> in-memory representation for non-escaping objects: memory aliasing 
> (either indeterminate base or indexed access) imposes inherent 
> constraints on the escape analysis (both partial and conservative 
> approaches suffer from it). Nevertheless, some of the problematic cases 
> can be addressed by improving existing approach or introducing a more 
> powerful analysis: covering more cases and making the analysis 
> control-sensitive should improve the situation.

We would like to work to improve escape analysis as per your suggestions above.
If we can achieve the same allocation reductions with this solution, it would be a
better long-term solution. We would like to continue reviewing stack allocation
and start a sandbox project as Dalibor suggested, but work on improving escape
analysis and measure against the sandbox for a baseline.   

> Also, the alternative approach (called zone-based heap allocation) looks 
> very attractive to me. I haven't thought it through, but it looks like 
> keeping the objects on the Java heap can save us a lot of complexity on 
> the implementation side (more memory available for allocation - not 
> necessarily fixed amount, no need to migrate objects from stack to heap, 
> GC barriers are unaffected, etc.). For example, reserving a dedicated 
> TLAB (or a stack of TLABs?) and do nmethod-scoped allocations from C2 
> code looks attractive. It can simplify many aspects of the 
> implementation: much more space available, free migration of 
> non-escaping objects to heap on deoptimization.

We have been thinking about this idea since FOSDEM and we completely agree
with the pros of zone-based allocation. The biggest benefits are the removal of
the restrictions in compressed oops mode and that barriers would not have to be
modified. 

For this approach were you envisioning that objects allocated in a stack zone are
pinned until the method returns? Also, while that zone memory is pinned the GC
would not reclaim memory in that zone? That is what we were thinking, but we
are worried about the complexity of the changes and restrictions it might add to
the GC implementations. 

Another thought is about the added cost to method enter / exit. With the current
on stack approach there is no added instructions for entering / exiting a method
since the stack size is just larger. For the zone-based approach we would need to
have a few more instructions on enter and exit to get the space from the zone TLAB
and to return it. If the current zone TLAB is full we would need to do more work to
get another one. Hopefully the common case of satisfying the space requirements
from the current zone TLAB would on average be the same or less than the current
TLAB checks for fast path allocations.

A final consideration is the footprint cost for project Loom. In the zone-based approach
would each virtual thread (fibre) have its own zone TLAB (or stack of TLABs)? If each
virtual thread had a zone TLAB it may lead to more frequent GCs because a significant
portion of the heap is reserved for zone-based allocations.

We do not see any of these as showstoppers, but just be sure we have the full picture. 

> Another idea:
> 
> "When dealing with stack allocated objects in loops we need a lifetime 
> overlap check."
>
> It doesn't look specific to stack-allocated objects. Non-overlapping 
> live ranges can be coalesced the same way for on-heap freshly allocated 
> objects. It should get comparable reduction in allocation pressure 
> (single allocation per loop vs allocation per iteration) and doesn't 
> require stack allocation support at all (as an example [1]).
>
> If such improvements are enabled for non-escaping on-heap objects, how 
> much benefit will stack allocation bring on top of that? IMO the 
>performance gap should become much narrower.

We agree, it’s one of the first things we wanted to try after we submitted the initial stack
allocation code for review. Again, our approach would be to have the current stack allocation
prototype as a baseline and work to see if we can shrink the gap with other approaches.

Thanks again for providing valuable feedback and insight
Charlie and Nikola