Scoping the stack allocation prototype for C2
Erik Österlund
erik.osterlund at oracle.com
Tue Jul 21 10:48:34 UTC 2020
Hi Nikola and Charlie,
On 2020-07-16 23:31, Nikola Grcevski wrote:
> Hello hotspot-dev,
>
> We recently posted a proposal [1] to implement stack allocation in C2 on the hotspot-compiler dev mailing list.
> Vladimir Ivanov [2] asked that we broaden the discussion by posting here as it is more than a compiler optimization!
> That is, enabling object allocations on thread stacks will impact other JVM subsystems now and in future designs.
It sounds like the high level goal is not "allocate on the stack", but
rather allocate and free objects more efficiently based on compile-time
knowledge about their scoped lifetime. Allocating them on the stack is
one way of achieving that. The memory is allocated and reclaimed as part
of the frame.
However, this scheme ties inherently tightly to the GC implementations.
Custom GC-specific code is required to have GC barriers that were
annotated in the C2 access API as IN_HEAP, to dynamically check if in
fact the object is !IN_HEAP. This also comes with GC-specific barrier
elision code that pattern matches accesses on known to be stack objects,
removing their barriers, which looks quite hairy. Such code makes it
increasingly painful to maintain the barriers when they need to change.
Other than being a bit ugly and painful to maintain, some GCs might have
a harder time dealing with this than others.
== G1 ==
For G1 it is okay, as there are some suitable slow paths you can inject
the stack object check into, before card table lookups are performed.
== ParallelGC and SerialGC ==
Parallel and Serial do not have such a slow path to add the check to,
and a new filter would have to be written that makes every IN_HEAP
object write slightly slower, checking if it is not IN_HEAP really. That
might still be a net performance win, but is a performance tax paid that
I would associate with the choice of placing these objects in the stack.
== ZGC ==
The ZGC IN_HEAP load barriers need oops to not be "bad". All the
information required for the check is in the pointer itself (that is
loaded). Stack pointers are not bad, because they can't have the
reserved bad bits in their pointers. Therefore, I think everything will
magically work well with ZGC... today. But we are designing new things
that change the barriers, and it isn't entirely clear to me what the new
constraints would mean.
== Shenandoah ==
The IN_HEAP load barriers of Shenandoah need to check that the loaded
reference is not in the collection set. This is done with a table
lookup, that only works for IN_HEAP references. I think you would have
to add an additional check if it is a stack object here. It seems like
adding such code to a load barrier could be quite nasty.
> If the proposal moves forward the current webrev [3] would need to be reviewed by multiple groups. This is a
> clear sign the wider OpenJDK community needs to provide input. An example of this proposal impacting future
> work, is Project Loom. Allocating objects on the stack would have to be limited in some scenarios. The implementation
> of stack allocated objects referencing other stack allocated objects would need to be changed *or* how Project Loom
> handles copying stacks would have to be modified.
Ron wants yielding to be very fast. Therefore, the common case does not
walk the stack that is being copied to perform any per-frame processing
at all. Instead, the raw stack is copied for the best possible
throughput. A lot of time was spent optimizing stack walking with crazy
tricks, but raw copy still turned out ~10x faster IIRC. So I think Ron
would be a bit sad if the yield has to perform any per-frame processing
to sort out the stack objects.
> While all the various solutions for Project Loom would be a net win at the end, they all come with drawbacks and
> possible losses in performance compared to our current measured performance improvements. These decisions
> would need to be discussed at length to decide what was best for Hotspot and the OpenJDK community.
I think moving the storage of the objects to some stack-like TLAB in the
Java heap should be seriously considered, as proposed by Vladimir Ivanov
and Vladimir Kozlov. It can similarly provide optimized object
allocation and freeing, based on compile-time knowledge about the scoped
lifetime of objects.
I see the following advantages:
* No new GC-specific dynamic barrier code to check if IN_HEAP accesses
are in fact not IN_HEAP.
* Likely no impact or small impact on any upcoming projects, as the
scoped objects are still regular IN_HEAP objects.
* In particular, Loom gets no significant extra processing.
* Nothing weird about using compressed oops.
* Larger allocations can be scoped still; no need to worry about stacks
growing too much.
* No need for object migrations or special handling during deopt or when
using stack walking APIs.
When this was raised as an alternative in compiler-dev, Charlie had
concerns that the nmethod entry would need a few more instructions to do
something similar to stack banging to ensure that the IN_HEAP local
object stack is large enough for the nmethod (obviously only applicable
for nmethods that have such local objects). Similarly, there would be an
extra decrement when leaving the nmethod.
This makes it sound like storing the objects in the stack would have a
performance edge, because you can elide the need for an extra object
stack bang check and decrement from the entry/exit. But I think it
should be stressed that this is not a free lunch; you get that
performance edge by injecting dynamic stack object checks into GC
barriers, with varying impact for different GCs, depending on how
impacted they are by the existence of stack objects. For e.g.
ParallelGC, that would likely imply an extra check for every IN_HEAP
object write. It really is not obvious that the once per nmethod
invocation cost of the object stack checks actually outweights the
overhead of more expensive GC barriers.
Since I can see so many advantages with this approach, and not really
any disadvantages if done right, I wonder if there are any clear reasons
not to build it in that (general) way instead.I am curious to hear your
thoughts about this.
If you agree this is probably a better way of achieving the same thing,
then I have some ideas how best to implement it that we could discuss.
But maybe we should settle on the direction first instead.Perhaps there
are some nice quirks with stack allocations that still make them more
desirable, even though I can not currently see it.
Thanks,
/Erik
> The benefits of allocating objects on the stack, based on our proposal [1], need to be weighed against the costs on
> current and future features in components that will be impacted. These decisions will need to be discussed at length
> to decide what was best for Hotspot and the OpenJDK community.
>
> We are optimistic that this work can provide benefits to the JVM without restricting future designs to heavily.
>
> We are looking forward to receiving all feedback on this proposal.
>
> Thanks!
> Nikola and Charlie
>
> [1] https://github.com/microsoft/openjdk-proposals/blob/master/stack_allocation/Stack_Allocation_JEP.md
> [2] https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2020-July/038969.html
> [3] https://cr.openjdk.java.net/~adityam/charlie/stack_alloc/
More information about the hotspot-dev
mailing list