Stack allocation prototype for C2

Wed Jan 29 00:26:07 UTC 2020

Thanks for the explanation Ron, that makes sense. We'll change the protype code and run an experiment where we reject only safepoints that are nmethod calls. We should be able to produce some numbers soon and share with everyone.

Cheers,
Nikola

-----Original Message-----
From: Ron Pressler <ron.pressler at oracle.com> 
Sent: January 28, 2020 6:06 PM
To: Nikola Grcevski <Nikola.Grcevski at microsoft.com>; hotspot-compiler-dev at openjdk.java.net; Charlie Gracie <Charlie.Gracie at microsoft.com>
Subject: RE: Stack allocation prototype for C2

In the fast mode we cannot patch any frame contents because we don’t know what the frames are (or even where they begin and end) — that’s why that mode is fast. However, that mode is only relevant at nmethod calls, not any other kind of safepoint. So the only limitation is stack oops living across nmethod calls. If even that limitation significantly harms performance, we can weigh our options.

Ron

On 28 January 2020 at 22:52:38, Nikola Grcevski (nikola.grcevski at microsoft.com(mailto:nikola.grcevski at microsoft.com)) wrote:

> Hi Ron,
>  
> > Reading between the lines, the difference between scalar replacement and what you call "stack allocation" is that the latter creates an oop pointing into the stack frame. Is that correct?
>  
> Yes, you are correct. There will be oops on the stack that will point to stack locations in the same nmethod frame. We used the BoxLockNode in C2 and these stack allocated objects end up at the bottom of the frame.
>  
> > If so, there is another consideration to take into account. Loom continuations can relocate stack frames in memory, and currently, compiled frames do not contain pointers into the stack. This relocation can potentially happen at any safepoint, but the continuation algorithm has two modes. In the slow mode -- used when there are interpreted frames or, for forced preemption, at any arbitrary safepoint -- the stack frames are parsed and walked, and so pointers in the stack that point into the stack could be patched (I assume, from your description that you do not allow oops outside the stack to point into the stack). In the fast mode, we avoid parsing the stack and we just copy the frames as-is, and it can occur at any nmethod call. If your "stack-oops" can currently live across nmethod calls, how much would this optimization be compromised if you disallowed that (i.e. stack oops to stack-allocated objects would not live across nmethod calls)?
>  
> Thanks so much for this insight. It seems that we would have to add some code to fix-up/patch the stack oops only in the fast relocation mode, unless that's a dealbreaker? We currently describe the stack allocated object for deoptimization purposes on every safepoint where the objects are live, so perhaps we can use the same information to fix-up the stack on relocation. We initially disallowed stack allocating any objects that lived across safepoints until we implemented the deoptimization support, so yes the optimization would definitely be compromised with that limitation, quite a bit of opportunities will be rejected. Some of the benchmarks we quoted still improve with this limitation around safepoints, but we'd have to remeasure for final numbers.
>  
> > Also, if my understanding is correct, I'm not sure you need to limit the optimization to -UseCompressedOops. I believe that even when compressed oops are used, stack oops can be wide; i.e. we do find frames with mixed narrow and wide oops in them, and the oopmaps allow for that possibility.
>  
> Our present limitation was mainly related to autos that can point to either stack allocated object or heap object, depending on the context. With compressed oops enabled we saw the oop always undergo an EncodeP before it was stored on the stack, after a merge point. It would be great if there's a way to overcome this in the IR, we are both new to OpenJDK and could use guidance on how to approach the problem.
>  
> Thanks a lot for your insights and feedback!
>  
> Nikola
>  
> -----Original Message-----
> From: Ron Pressler
> Sent: January 28, 2020 8:46 AM
> To: hotspot-compiler-dev at openjdk.java.net; Charlie Gracie ; Nikola 
> Grcevski
> Subject: Re: Stack allocation prototype for C2
>  
>  
> Hi.
>  
> Reading between the lines, the difference between scalar replacement and what you call "stack allocation" is that the latter creates an oop pointing into the stack frame. Is that correct?
>  
> If so, there is another consideration to take into account. Loom continuations can relocate stack frames in memory, and currently, compiled frames do not contain pointers into the stack. This relocation can potentially happen at any safepoint, but the continuation algorithm has two modes. In the slow mode -- used when there are interpreted frames or, for forced preemption, at any arbitrary safepoint -- the stack frames are parsed and walked, and so pointers in the stack that point into the stack could be patched (I assume, from your description that you do not allow oops outside the stack to point into the stack). In the fast mode, we avoid parsing the stack and we just copy the frames as-is, and it can occur at any nmethod call. If your "stack-oops" can currently live across nmethod calls, how much would this optimization be compromised if you disallowed that (i.e. stack oops to stack-allocated objects would not live across nmethod calls)?
>  
> Also, if my understanding is correct, I'm not sure you need to limit the optimization to -UseCompressedOops. I believe that even when compressed oops are used, stack oops can be wide; i.e. we do find frames with mixed narrow and wide oops in them, and the oopmaps allow for that possibility.
>  
> Ron
>  
>  
> On 27 January 2020 at 15:58:43, Charlie Gracie wrote:
>  
> >
> > Hi,
> >
> > We have been prototyping Stack Allocation in C2. We are hoping to 
> > start a conversation on our approach and to see if there is interest 
> > in the community for us to submit our work as a patch set. We would appreciate any comments, concerns or feedback on our current approach or the idea in general. Thanks!
> >
> > Background on why we are investigating Stack Allocation:
> > While investigating the performance of different workloads (a lot of 
> > Scala but some plain Java), we noticed a lot of GC and heap pressure 
> > due to transient object allocations. Upon further digging we were 
> > able to see that C2’s Escape Analysis phase was correctly 
> > recognizing these allocations as Non Escaping. The allocations were 
> > not being Scalar Replaced due to a few different restrictions but the most common was due to control flow merge points before accessing the objects. Here is a simplified example of a case where an allocation cannot be Scalar Replaced even though it is recognized as Non Escaping.
> >
> > int foo(int x, MyObject[] vals) {
> > MyObject bar;
> > If (x > 127) {
> > bar = new MyObject(x);
> > } else {
> > bar = vals[x];
> > }
> > return bar.x;
> > }
> >
> > In the above example the allocation could successfully be replaced 
> > with a Stack Allocation. We believe that a lot of code has control 
> > flow merge points that cause Scalar Replacement to be rejected but 
> > Stack Allocation could work for many of them. The obvious benefit is the reduction in GC/heap pressure, but you also get the added benefit of improved cache hit rate since the stack allocated objects will likely be in the cache memory.
> > There are costs with Stack Allocation including increased compilation times and increased code cache size.
> > These would have to be carefully measured fully understand the benefits and costs.
> >
> > Our current approach:
> > 1. Only allow stack allocation in non-compressed oops mode
> > (-XX:-UseCompressedOops)
> > 1.1 This simplifies the prototype and we believe there are solutions 
> > for compressed oops 2. Focus on ParallelGC and G1GC for prototype
> > 2.1 Support for other collectors needs to be investigated 3. Current 
> > prototype is based on JDK 11
> > 3.1 We are going to work on moving it up to tip shortly
> > 3.2 We have been following the changes in tip and this work should 
> > apply easily 4. Use the results from Escape Analysis to decide which 
> > objects are eligible for Stack Allocation
> > 4.1 Scalar Replacement is the most important, so allow that to 
> > happen first
> > 4.2 Arrays cannot be stack allocated in the current prototype.
> > 4.2.1 Straight forward to implement
> > 4.3 Stack allocated objects cannot reference other stack allocated objects.
> > 4.3.1 More difficult than arrays but should be doable
> > 4.4 Reject allocations based on the reasons listed below in the 
> > details section 5. In macro expansion during expand allocation replace allocations eligible for Stack Allocation.
> > 5.1 Replace the AllocNode with a BoxLockNode with the correct stack 
> > location and remove the slow path and other unused IR
> > 5.2 Repurpose BoxLock nodes to represent the stack location for the 
> > object
> > 5.3 No Stack Allocation in methods that contain synchronization.
> > 5.3.1 This restriction should be easy to remove.
> > 5.4 Remove write barriers on stores into stack allocated objects 6.
> > Grow the frame size to handle the stack allocated objects 7. Modify 
> > OopMap scanning to handle stack allocated objects
> > 7.1 Currently handled via a generic OopClosure for all GCs
> > 7.2 More work required for full GC collector coverage 8. Safepoint 
> > support reuses SafePointScalarObject
> > 8.1 De-optimization is working and correctly allocating heap objects 
> > and initializing their fields
> >
> > Current Results:
> > 1. Microbenchmarks using JMH simply replace all allocations using 
> > Stack Allocation where Scalar Replacement 2. Workloads in DeCapo and 
> > Renaissance are showing benefits 3. We have not measured any 
> > negative impacts on application throughput 4. More detailed analysis 
> > on compilation times and code cache size are required
> >
> > Benchmark % Improvement
> > DeCapo tmt 15
> > Renaissance db-shootout 10
> > Renaissance fj-kmeans 5
> > Renaissance future-genetic 10
> > Renaissance movie-lens 5
> > Renaissance neo4j-analytics 45
> >
> > Details on why allocations should not be replaced with Stack Allocation:
> > Not all Non Escaping objects can be Stack Allocated so there will 
> > still be cases where Non Escaping allocations will need to be allocated on the heap:
> > 1. Large arrays and objects should likely be ignored as they could 
> > significantly impact the stack frame size 2. Similar to 1 limit the 
> > total amount of stack memory per frame that should be used for Stack 
> > Allocations 3. Objects and arrays without constant sizes/lengths 4.
> > Allocations with over lapping lifetimes from different loop 
> > iterations 5. A stack allocated object cannot be referenced by any 
> > other Non Escaping objects that are not scalar replaced or stack 
> > allocated as that would count as an escape
> >
> > Here is a simple example of what I was describing in 4.
> > int method(int[] vals) {
> > MyObject o1 = new MyObject(vals[0]); for (int i =1; i < vals.length; 
> > ++) { MyObject o2 = new MyObject(vals[i]); if (!o1.equals(o2)) {
> > o1 = o2;
> > …
> > }
> > …
> > }
> > …
> > }
> >
> > In the above example if o2 was stack allocated and you ever entered 
> > the “if” from that point on o1 and o2 would always be the same object. This is incorrect so o2 cannot be stack allocated.
> >
> > Thanks for reading and we look forward to your feedback!
> > Charlie Gracie & Nikola Grcevski
> >
> >
>  
>