Scoping the stack allocation prototype for C2

Tue Jul 21 10:48:34 UTC 2020

Hi Nikola and Charlie,

On 2020-07-16 23:31, Nikola Grcevski wrote:
> Hello hotspot-dev,
>
> We recently posted a proposal [1] to implement stack allocation in C2 on the hotspot-compiler dev mailing list.
> Vladimir Ivanov [2] asked that we broaden the discussion by posting here as it is more than a compiler optimization!
> That is, enabling object allocations on thread stacks will impact other JVM subsystems now and in future designs.

It sounds like the high level goal is not "allocate on the stack", but 
rather allocate and free objects more efficiently based on compile-time 
knowledge about their scoped lifetime. Allocating them on the stack is 
one way of achieving that. The memory is allocated and reclaimed as part 
of the frame.

However, this scheme ties inherently tightly to the GC implementations. 
Custom GC-specific code is required to have GC barriers that were 
annotated in the C2 access API as IN_HEAP, to dynamically check if in 
fact the object is !IN_HEAP. This also comes with GC-specific barrier 
elision code that pattern matches accesses on known to be stack objects, 
removing their barriers, which looks quite hairy. Such code makes it 
increasingly painful to maintain the barriers when they need to change.

Other than being a bit ugly and painful to maintain, some GCs might have 
a harder time dealing with this than others.

== G1 ==
For G1 it is okay, as there are some suitable slow paths you can inject 
the stack object check into, before card table lookups are performed.

== ParallelGC and SerialGC ==
Parallel and Serial do not have such a slow path to add the check to, 
and a new filter would have to be written that makes every IN_HEAP 
object write slightly slower, checking if it is not IN_HEAP really. That 
might still be a net performance win, but is a performance tax paid that 
I would associate with the choice of placing these objects in the stack.

== ZGC ==
The ZGC IN_HEAP load barriers need oops to not be "bad". All the 
information required for the check is in the pointer itself (that is 
loaded). Stack pointers are not bad, because they can't have the 
reserved bad bits in their pointers. Therefore, I think everything will 
magically work well with ZGC... today. But we are designing new things 
that change the barriers, and it isn't entirely clear to me what the new 
constraints would mean.

== Shenandoah ==
The IN_HEAP load barriers of Shenandoah need to check that the loaded 
reference is not in the collection set. This is done with a table 
lookup, that only works for IN_HEAP references. I think you would have 
to add an additional check if it is a stack object here. It seems like 
adding such code to a load barrier could be quite nasty.

> If the proposal moves forward the current webrev [3] would need to be reviewed by multiple groups. This is a
> clear sign the wider OpenJDK community needs to provide input. An example of this proposal impacting future
> work, is Project Loom. Allocating objects on the stack would have to be limited in some scenarios. The implementation
> of stack allocated objects referencing other stack allocated objects would need to be changed *or* how Project Loom
> handles copying stacks would have to be modified.

Ron wants yielding to be very fast. Therefore, the common case does not 
walk the stack that is being copied to perform any per-frame processing 
at all. Instead, the raw stack is copied for the best possible 
throughput. A lot of time was spent optimizing stack walking with crazy 
tricks, but raw copy still turned out ~10x faster IIRC. So I think Ron 
would be a bit sad if the yield has to perform any per-frame processing 
to sort out the stack objects.

> While all the various solutions for Project Loom would be a net win at the end, they all come with drawbacks and
> possible losses in performance compared to our current measured performance improvements. These decisions
> would need to be discussed at length to decide what was best for Hotspot and the OpenJDK community.

I think moving the storage of the objects to some stack-like TLAB in the 
Java heap should be seriously considered, as proposed by Vladimir Ivanov 
and Vladimir Kozlov. It can similarly provide optimized object 
allocation and freeing, based on compile-time knowledge about the scoped 
lifetime of objects.

I see the following advantages:
* No new GC-specific dynamic barrier code to check if IN_HEAP accesses 
are in fact not IN_HEAP.
* Likely no impact or small impact on any upcoming projects, as the 
scoped objects are still regular IN_HEAP objects.
* In particular, Loom gets no significant extra processing.
* Nothing weird about using compressed oops.
* Larger allocations can be scoped still; no need to worry about stacks 
growing too much.
* No need for object migrations or special handling during deopt or when 
using stack walking APIs.

When this was raised as an alternative in compiler-dev, Charlie had 
concerns that the nmethod entry would need a few more instructions to do 
something similar to stack banging to ensure that the IN_HEAP local 
object stack is large enough for the nmethod (obviously only applicable 
for nmethods that have such local objects). Similarly, there would be an 
extra decrement when leaving the nmethod.
This makes it sound like storing the objects in the stack would have a 
performance edge, because you can elide the need for an extra object 
stack bang check and decrement from the entry/exit. But I think it 
should be stressed that this is not a free lunch; you get that 
performance edge by injecting dynamic stack object checks into GC 
barriers, with varying impact for different GCs, depending on how 
impacted they are by the existence of stack objects. For e.g. 
ParallelGC, that would likely imply an extra check for every IN_HEAP 
object write. It really is not obvious that the once per nmethod 
invocation cost of the object stack checks actually outweights the 
overhead of more expensive GC barriers.

Since I can see so many advantages with this approach, and not really 
any disadvantages if done right, I wonder if there are any clear reasons 
not to build it in that (general) way instead.I am curious to hear your 
thoughts about this.

If you agree this is probably a better way of achieving the same thing, 
then I have some ideas how best to implement it that we could discuss. 
But maybe we should settle on the direction first instead.Perhaps there 
are some nice quirks with stack allocations that still make them more 
desirable, even though I can not currently see it.

Thanks,
/Erik

> The benefits of allocating objects on the stack, based on our proposal [1], need to be weighed against the costs on
> current and future features in components that will be impacted. These decisions will need to be discussed at length
> to decide what was best for Hotspot and the OpenJDK community.
>
> We are optimistic that this work can provide benefits to the JVM without restricting future designs to heavily.
>
> We are looking forward to receiving all feedback on this proposal.
>
> Thanks!
> Nikola and Charlie
>
> [1] https://github.com/microsoft/openjdk-proposals/blob/master/stack_allocation/Stack_Allocation_JEP.md
> [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-July/038969.html
> [3] https://cr.openjdk.java.net/~adityam/charlie/stack_alloc/