Stack allocation prototype for C2

Fri Jul 10 18:42:25 UTC 2020

Hi Charlie,

> Thanks for reviewing the document and providing your feedback.

One request about improving the document: please, elaborate more on 
interactions with EA implementation in C2.

For example, stack allocation can be used for both non-scalarizable 
NoEscape and ArgEscape, but the latter requires GC barriers everywhere 
to check for stack allocated objects while in the former case it can be 
limited only to the current nmethod.

>>  From the design overview and the implementation, I'm concerned about
>> far-reaching consequences of the chosen approach. It's not limited just
>> to existing set of JVM features, but as Andrew noted will affect the
>> design of forthcoming functionality as well.
>>
>> I think it's worth to start a broad discussion (HotSpot-wide) and decide
>> how much JVM design complexity budged it is worth spending on such an
>> optimization.
> 
> This is a great suggestion, where and how should we start this discussion
> to get feedback from the broader community?

I suggest to initiate a new discussion on hotspot-dev at ojn and stress 
that it's not just about optimizations in JIT-compilers, but a proposal 
to enable object allocations on thread stack and discuss the effects on 
other JVM subsystems and features.

>> As we discussed off-line (right after FOSDEM), I do see the benefits of
>> in-memory representation for non-escaping objects: memory aliasing
>> (either indeterminate base or indexed access) imposes inherent
>> constraints on the escape analysis (both partial and conservative
>> approaches suffer from it). Nevertheless, some of the problematic cases
>> can be addressed by improving existing approach or introducing a more
>> powerful analysis: covering more cases and making the analysis
>> control-sensitive should improve the situation.
> 
> We would like to work to improve escape analysis as per your suggestions above.
> If we can achieve the same allocation reductions with this solution, it would be a
> better long-term solution. We would like to continue reviewing stack allocation
> and start a sandbox project as Dalibor suggested, but work on improving escape
> analysis and measure against the sandbox for a baseline.

Good idea! Keeping the up-to-date patches in a sandbox repository would 
be very convenient.

>> Also, the alternative approach (called zone-based heap allocation) looks
>> very attractive to me. I haven't thought it through, but it looks like
>> keeping the objects on the Java heap can save us a lot of complexity on
>> the implementation side (more memory available for allocation - not
>> necessarily fixed amount, no need to migrate objects from stack to heap,
>> GC barriers are unaffected, etc.). For example, reserving a dedicated
>> TLAB (or a stack of TLABs?) and do nmethod-scoped allocations from C2
>> code looks attractive. It can simplify many aspects of the
>> implementation: much more space available, free migration of
>> non-escaping objects to heap on deoptimization.
> 
> We have been thinking about this idea since FOSDEM and we completely agree
> with the pros of zone-based allocation. The biggest benefits are the removal of
> the restrictions in compressed oops mode and that barriers would not have to be
> modified.
> 
> For this approach were you envisioning that objects allocated in a stack zone are
> pinned until the method returns? Also, while that zone memory is pinned the GC
> would not reclaim memory in that zone? That is what we were thinking, but we
> are worried about the complexity of the changes and restrictions it might add to
> the GC implementations.

Just want to reiterate that I haven't thought the idea through, but my 
educated guess is there should be a way to implement it in an optimistic 
way and mostly transparent to runtime and GCs.

Just a sketch of the idea:
   (1) JIT can optimistically use a dedicated TLAB in some scope (e.g., 
nmethod-based: record a watermark at nmethod entry for future use);

   (2) when leaving the scope (e.g, on nmethod exit), JIT can try to 
free allocated space (up to some watermark), but has to verify that some 
per-thread invariant still holds;

   (3) runtime can break the invariant at any time, but has to ensure 
that all allocated objects end up in Java heap.

For example (assuming all TLABs are allocated on-heap): using "the same 
zone TLAB is registered with the thread" as the invariant and 
de-registering zone TLAB with the thread (allocating new TLAB / 
resetting it to NULL) should do the job. Plus, there's an option to 
promote zone TLAB to ordinary TLAB may reduce heap waste.

So far, I don't see any major problems, but it is pending some 
validation with an experiment to get an understanding how efficient 
proposed scheme is in reducing allocation rate.

> Another thought is about the added cost to method enter / exit. With the current
> on stack approach there is no added instructions for entering / exiting a method
> since the stack size is just larger. For the zone-based approach we would need to
> have a few more instructions on enter and exit to get the space from the zone TLAB
> and to return it. If the current zone TLAB is full we would need to do more work to
> get another one. Hopefully the common case of satisfying the space requirements
> from the current zone TLAB would on average be the same or less than the current
> TLAB checks for fast path allocations.

Allocating a TLAB per method looks wasteful: TLABs are normally quite 
large (hence more heap waste for deep thread stacks and large number of 
threads) and their allocation is expensive (requires a CAS).

> A final consideration is the footprint cost for project Loom. In the zone-based approach
> would each virtual thread (fibre) have its own zone TLAB (or stack of TLABs)? If each
> virtual thread had a zone TLAB it may lead to more frequent GCs because a significant
> portion of the heap is reserved for zone-based allocations.

IMO having a TLAB per virtual thread may cause too much waste: TLAB size 
can easily outweight the footprint of the virtual thread itself.

Sharing a TLAB from a carrier thread may help, but it can't be used 
across possible freeze points.

So, I don't have a clear picture what will be the best option there.

> We do not see any of these as showstoppers, but just be sure we have the full picture.

>> Another idea:
>>
>> "When dealing with stack allocated objects in loops we need a lifetime
>> overlap check."
>>
>> It doesn't look specific to stack-allocated objects. Non-overlapping
>> live ranges can be coalesced the same way for on-heap freshly allocated
>> objects. It should get comparable reduction in allocation pressure
>> (single allocation per loop vs allocation per iteration) and doesn't
>> require stack allocation support at all (as an example [1]).
>>
>> If such improvements are enabled for non-escaping on-heap objects, how
>> much benefit will stack allocation bring on top of that? IMO the
>> performance gap should become much narrower.
> 
> We agree, it’s one of the first things we wanted to try after we submitted the initial stack
> allocation code for review. Again, our approach would be to have the current stack allocation
> prototype as a baseline and work to see if we can shrink the gap with other approaches.

Sounds good!

Best regards,
Vladimir Ivanov