RFC, C2 Partial Escape Analysis

Vladimir Ivanov vladimir.x.ivanov at oracle.com
Mon May 31 13:35:16 UTC 2021


Hi Jason,

Thanks a lot for sharing your thoughts and plans about Partial Escape 
Analysis in C2. I enjoyed reading the document. Though it's clearly not 
finished yet, it already covers a lot of ground.

It sparkled discussion among Compiler group members at Oracle and I'll 
try to summarise the main takeaways.

=====

1. First of all, the need to move allocations around poses the main 
implementaiton challenge for PEA implemantation in C2. At the moment, 
there's no reliable way to recompute JVM state when it is missing. So, 
the only feasible way is to capture JVM state at all important points in 
the graph during parsing (e.g., in form of SafePointNodes and easily 
discoverable) and keep the information until EA is over.

That would enable EA to move allocations away from original place (or 
even split a single allocation into multiple ones).

=====

2. Move vs split decisions. Though it may be desireable (to keep 
generated code size small) to reduce the number of allocation sites in 
the graph, there's no such requirement. So, a single allocation is 
allowed to be split into a multiple ones, but the actual placement 
should abide the invariants imposed by the program order and JVMS.

Since it is possible to insert arbitrary number of allocations, it 
becomes an optimization (and not correctness) problem to choose the 
actual placement.

Consider:

   MyPair p = new MyPair(1, 2);
   if (cond1) {
     ...
     if (cond2) {
        ... = p; // escapes
     }
     ...
     if (cond3) {
        ... = p; // escapes
     }
     ...
}

It can be either turned into

   if (cond1) {
     MyPair p = new MyPair(1, 2);
     if (cond2) {
        ... = p; // escapes
     }
     ...
     if (cond3) {
        ... = p; // escapes
     }
     ...
   }

or

   if (cond1) {
     ...
     if (cond2) {
        ... = new MyPair(1, 2); // escapes
     }
     ...
     if (cond3) {
        ... = new MyPair(1, 2); // escapes
     }
     ...
   }

=====

3. As you mention, EA is not only about tracking escaping events. Even 
for non-escaping allocations identity sensitive opetaions (e.g., pointer 
comparison) and memory aliasing (e.g., indexed accesses) pose some 
challenges and have to be tracked. (Possibly, requiring a JVM state for 
each one as well.)

=====

4. Escape sites separate the graph into 2 parts: before and after the 
instance escapes. In order to preserve identity invariants (and avoid 
identity paradoxes), PEA can't just put an allocation at every escape 
site. It should respect the order of escape events and ensure that the 
very same object is observed when multiple escape events happen.

Dynamic invariant can be formulated as: there should never be more than 
1 allocation at runtime per 1 eliminated allocation.

Considering non-escaping operations can force materialization on their 
own, it poses additional constraints.

MyPair p = new MyPair(1, 2);
if (cond1) {
   a.f = p; // escapes here
}
if (cond2) {
   b.f = p; // escapes here
}
...
if (a.f == b.f) { // true when (cond1 && cond2)


A possible simplifying assumption would be to put a single allocation at 
the LCA block for all escape sites, but it doesn't cover simple cases like:

MyPair p = new ...
if (cond) {
   ...
   if (cond_i) { // unlikely
     ... = p; // escape site
   }
} else {
   ...
   if (cond_j) { // unlikely
      ... = p; // escape site
   }
}

=====

5. The document doesn't mention CFG merges at all. But handling those 
properly (instead of bailing out the analysis) seem to be very important 
to make the optimization practical.

MyPair p = new MyPair(1, 2);
...
if (cond1) {
   ... = p; // escapes
}
// merge point
// no escape points beyound it

Even though there's only single escape point, it is illegal to assume 
that the object didn't escape beyond merge point.

The invariant is that at every CFG merge point either ALL or NONE 
predecessor blocks have the object materialized.

Some possible allocation placements are:

MyPair p = NULL;
... // may benefit from scalarization
p = new MyPair(1, 2);
if (cond1) {
   ... = p; // escapes
}
// merge point


MyPair p = NULL;
... // may benefit from scalarization of p
if (cond1) {
   ... // may benefit from scalarization of p
   p = new MyPair(1, 2);
   ... = p; // escapes
} else {
   ... // may benefit from scalarization of p
   p = new MyPair(1, 2);
}

// merge point: no p materialization allowed beyond this point

// p = phi(p_cond1, p_cond2)

Also, it poses some challenges to optimizing cases with loops when loop 
body contains an escape site. But it's not clear how much can be done 
anyway without reshaping the loop and extracting the escape site out of 
it (e.g., with loop predication or loop versioning).

=====

Overall, the general consensus was that it doesn't look feasible to 
introduce new optimization pass specifically for PAE.

Extending existing EA implementation with PEA capabilities looks the 
most promising way forward.

It can be impelemented as an iterative process where:
   a) EA constructs connection graph;
   b) for globally escaping objects:
       based on escape site info, PEA chooses alternative allocation 
placement;
   c) based on PEA results, globally escaping allocations are moved/split;
   d) EA is run again on the modified graph;

(And the process is repeated until it stabilizes or runs out of time.)

Best regards,
Vladimir Ivanov

On 27.05.2021 15:08, Tatton, Jason wrote:
> RFC, C2 Partial Escape Analysis
> Hello everyone!
> 
> We would like to invite discussion on a proposal for Partial Escape Analysis (PEA) to be added to the JVM C2 hotspot compiler.
> 
> Specifically, we are looking to introduce a new compilation phase in order to enable optimizations such as scalar replacement in situations currently not encompassed by the existing control flow insensitive Escape Analysis (EA) phase of C2.
> 
> PEA is a control flow sensitive variant of EA. We believe that by adding this compilation phase we will be able to improve compiled code performance and reduce heap size requirements by up to 10% and 8% respectively.
> 
> At AWS, we have been prototyping a partial implementation of PEA with some success so far. We are at the stage now where we feel wider community involvement would be beneficial to the initiative.
> Background: What is Escape Analysis
> Escape analysis (EA) enables the HotSpot JIT C2 compiler to detect cases where objects do not "escape" the method/thread which created them. Whilst EA itself is not an optimization, the results of the analysis are used in order to enable a number of optimization techniques (as an alternative to conventional object allocation on the heap) including:
> 
>    *   Stack allocation. Instead of allocating an object on the heap, it is instead allocated on the stack. In this way it is not considered eligible for garbage collection. This reduces the amount of work which must be done by garbage collectors, improving performance. C2 does not currently implement this optimization, though some other JVM implementations do such as IBM J9.
>    *   Scalar replacement. This goes a step further than stack allocation, here the object allocation can be eliminated altogether and all object field interaction isreplaced by local variables. Eliminating object allocation, field referencing and reducing load on the garbage collector all improve performance and reduce memory footprint.
>    *   Lock Elision. If a Java object monitor is used but the object but does not escape, then the lock can never be contented. In this case synchronization on said lock can be removed.
> 
> Existing Limitations
> The current EA implementation is flow insensitive. Let's examine what this means in practice...
> 
> Here is an example of an object which does not escape the bounds of the method/thread creating it and is therefore considered non escaping and eligible for optimization. The current implementation of EA in C2 detects this case
> 
> public class Pair{
>      public int a;
>      public int b;
>      public Pair(int a, int b){
>          this.a = a;
>          this.b = b;
>       }
> }
> 
> public static int foo(int a, int b){
>      Pair mypair = new Pair(a, b);
>      return mypair.a;
> }
> and will enable scalar replacement such that the above code in foo can be transformed into the following form, which saves us from an instance of object allocation of Pair on every invocation of foo.
> 
> public class Pair{
>      public int a;
>      public int b;
>      public Pair(int a, int b){
>          this.a = a;
>          this.b = b;
>       }
> }
> 
> public static int foo(int a, int b){
>      // Pair mypair = new Pair(12, 13);
>      return a; // don't bother with object allocation of Pair
> }
> However, EA is control flow insensitive and will not enable scalar replacement of fields of an object where there is branching code with a path having the possibility of the object escaping the lifetime of the thread/method which created it, regardless of how unlikely that branch is to execute at runtime... For example:
> 
> public static int foo(){
>      Pair mypair = new Pair(12, 13);
>      if(something()){
>          // maybe this branch is hit only 0.01% of the time
>          return usesAndPersistsObject(mypair); // escapes here
>      }else{
>          return mypair.a; // no scalar replacement here
>      }
> }
> Above we see that in one branch the Pair instance object mypair escapes, therefore the mypair instance object is considered 'globally' escaping and must always be allocated on the heap. This means that scalar replacement cannot be used even when the escaping branch is only entered into very infrequently.
> 
> PEA aims to largely solve this problem.
> Partial Escape Analysis
> Partial Escape Analysis (PEA) is a control flow sensitive variant of EA. Like EA it acts upon the ideal graph IR. With PEA we propose introducing the concept of a differed object allocation (DOA). We propose considering only objects which have been identified as escaping in the EA phase of compilation. With PEA we propose to defer object allocation of those objects to the latest point in program/branch execution before which that object has been identified as escaping. Scalar replacement and lock elision can then be applied up to that point of deferred object allocation. Sometimes the latest point for a particular object may resolve to be at the point of initial object allocation, in which case PEA will offer no benefit.
> High level algorithm
> We are interested in only a subset of ideal graph nodes. Specifically nodes related to escaping objects:
> 
>    *   Allocation(A): Allocation and Initialization nodes.
>    *   Branch(B): We are interested in branching nodes; IfNode, IfTrue, IfFalse nodes and phi nodes (e.g. to capture the case of: staticfield = condition?obj1:obj2 as well as loops, jumps related to break/continue etc) .
>    *   Escape(E): Nodes concerning: object assigned to static field, object used as parameter in method invocation
>    *   Usage(U): Nodes concerning: store to object field, loading from object field.
>    *   Return(R): Behave similarly to Escape nodes, Return/ReThrow nodes.
> We propose PEA to operate in via the following steps:
> 
>    1.  Graph reduction. This stage visits every node in the ideal graph and for the aforementioned nodes of interest, produces a reduced graph structure (example below) consisting of a reduced control flow sensitive set of nodes. The non-branching nodes of this reduced graph form are structured into basic blocks with IfTrue/IfFalse nodes used to denote the branches. The subset of nodes is used in step 2 and 3 below.
>    2.  Differed Object Allocation Algorithm. This stage figures out where opportunities for DOA exist for each allocated object that can escape. It determines the very last point in which an object can be differed allocated, aka materialized on the heap in a branch so as to maintain program semantics by considering branches and object usage. There may be more than one branch path resulting in a materialization of an allocated object. There may also be instances where the object is required to be in a materialized state before a branch node has been encountered - in this case then no further optimization can be enabled by PEA and subsequent processing can be skipped.
>    3.  Graph transformation. After applying DOA in step 2 we know the latest points at which an object may be materialized on the heap. With this knowledge all usage nodes prior to these points can be scalar replaced and all usage/escaping nodes after a materialization of a node need to be repointed to the nodes resulting from the materializations. Some repointing of the internal state of differed objects on the heap may be necessary at the point of materialization so as to maintain program semantics. We must also ensure consistency in the case of deoptimization.
> Throughout all these mini phases, in the case of potential failure, the algorithm will skip subsequent processing.
> Example
> Let us now examine a case where PEA works well:
> 
> 0:  public static boolean RAND  = true; // set externally
> 1:  public static boolean RAND2 = true; // set externally
> 2:
> 3:  public static int infrequentBranch(int f1, int f2) {
> 4:     Pair intPair = new Pair(f1, f2);
> 5:
> 6:     if (RAND) { // external call sets this to true 10% of the time
> 7:         intPair.a += 1; // example 'Use' of intPair
> 8:         Global.lastThing = intPair; // escapes here
> 9:         return 1;
> 10:    }
> 11:
> 12:    // scalar replacable usage...
> 13:    intPair.a += 12;
> 14:
> 15:    if (RAND2) { // external call sets this to true 10% of the time
> 16:        intPair.a += 10; // example 'Use' of intPair
> 17:        Global.lastThing = intPair; // escapes here
> 18:        return 2;
> 19:    }
> 20:
> 21:    // if we get to this point there should have been
> 22:    // no object allocation (with PEA enabled)
> 23:    return 3;
> 24: }
> Above is an example of code containing two branching paths in which the intPair object escapes. If execution of the above code gets to the end of the method, with PEA enabled, intPair will not be allocated on the heap.
> 
> Let us now look at the reduced graph of the above method:
> 
> 4: A:Initialize[40]
> 6: B:If[110]
>   -> IfTrue: (bb: 1)
>   7: U:StoreI[121] of: Initialize[40]
>   8: E:StoreN[126] of: Initialize[40]
>   9: R:Return[130]
>   -> IfFalse: (bb: 2)
>   X
> 13: U:StoreI[147] of: Initialize[40]
> 15: B:If[158]
>   -> IfTrue: (bb: 3)
>   16: U:StoreI[170] of: Initialize[40]
>   17: E:StoreN[173] of: Initialize[40]
>   18: R:Return[180]
>   -> IfFalse: (bb: 4)
>   X
> 23: R:Return[194]
> The reduced graph is composed of a root basic block (bb) (with id of 0) and four child bb's (1,2,3,4) corresponding to the code in the two if statements. Each if statement generates an IfTrue and IfFalse node each having a basic block associated. For readability, line number information is provided above (our prototype, discussed shortly, also includes bci information).
> 
> The graph transformation mini phase is interested in the results of opportunities for differed allocation introduced by the DOA algorithm. In this example, that corresponds to the Initialize[40] node seen at the root bb. Observe that there are two DOA opportunities in each of the IfTrue branches at lines 8 and 17. Since both of these occur within branches which result in early termination of the method, the U:StoreI[147] node (corresponding to: intPair.a += 12;) may be scalar replaced. The nodes: U:StoreI[121] and U:StoreI[170] may optionally be scalar replaced, but there is little benefit because soon after the they are referenced the object which they reference will be materialized onto the heap.
> 
> The DOA algorithm of PEA is able to able to handle more complex cases, for example say we have the following code:
> 
> 0:  public static boolean RAND  = true; // set externally
> 1:  public static boolean RAND2 = true; // set externally
> 2:
> 3:  public static int reallyInfrequentBranch(int f1, int f2) {
> 4:     Pair intPair = new Pair(f1, f2);
> 5:
> 6:     if (RAND) { // external call sets this to true 1% of the time
> 7:        if (RAND2) { // external call sets this to true 1% of the time
> 8:           Global.lastThing = intPair; // escapes here
> 9:           unkownCall(); // non inlined call - maybe call triggering separate thread interacting with the state of Global.lastThing
> 10:          intPair.a += 12; // operates on materialized object
> 11:       }
> 12:    }
> 13:
> 16:    return intPair.a;
> 17: }
> At first glance it would appear that the above code would benefit from PEA since the escape of intPair occurs within not just one but two infrequently branching nodes. However, PEA cannot be used here because when the intPair object escapes at line 8, in this example the the unkownCall() is not inlined and maybe performs an operation on the state of IntPair (maybe in a separate thread), as such to maintain program semantics the object needs to be in a materialized state at the point of initial declaration (i.e. created on the heap as normal).
> 
> As far as the DOA algorithm is concerned for the above code, when it is processing the bb associated with the If statement on line 7 it will first mark the intPair object as escaping and materialized within the subsequent context of the branch, thus rendering the call on line 10 as not scalar replaceable. When DOA has finished processing this bb and moved on to the root bb it will come to the return statement on line 16, since the intPair is marked as being in a escaped and previously materialized state within a previous nested branch bb, this materialization will be "promoted" up to the root bb, which also just happens to be the location of the initial allocation thus eliminating this as an opportunity for PEA.
> Progress at AWS
> So far in our internal prototype, in the interests of failing fast, we have implemented up to step 2 of the outlined algorithm and are currently evaluating performance of a typical workload in which differed object allocation can be applied in an attempt to estimate the typical improvement in performance. Our initial prototype is implemented as a separate compilation phase which takes place after conventional, flow insensitive EA.
> Flags
> In support of this compilation phase we propose adding three JVM flags:
> Flag
> 
> Notes
> 
> Default
> 
> Notes
> 
> DoPartialEscapeAnalysis
> 
> perform partial escape analysis
> 
> true
> 
> global switch (set to false in preview build)
> 
> PartialEALogLevel
> 
> PEA log level
> 
> 0 - nothing
> 
> [0-5] more detail per log level - including the option to output the reduced graph representation above in a human readable format. this aids development and produciton debugging
> 
> PartialEAOnly
> 
> Restrict pea to only this method
> 
> '' - empty string
> 
> Equivalent to setting DoPartialEscapeAnalysis to true
> 
> Challenges
> 
>    *   Generally speaking, operating upon the ideal graph representation is challenging as it is a paradigm unique to the JVM and documentation is sparing. Tools such as the ideal graph visualizer are however excellent for improving engineer productivity when interacting with the ideal graph.
>    *   When speaking in terms of this initiative, the most challenging aspect we have faced so far is in building reduced graph representations of complex phi node interactions, e.g. code such as: (RAND?holder1:holder2).held = RAND2?intPair1:intPair2; . We expect the majority of bugs in step 2 of our outlined algorithm to reside in this space.
>    *   Another challenging area is that of "deadly embraces" where two objects have fields which point to one another and at least one of those objects is marked as escaping thus so rendering the other. For instance: object1.friend = object2; object2.friend = object1; We have not determined how to solve this problem yet but some mechanism to detect these cyclic relationships is required as we believe that these relationships are common in many of the java.util.* classes.
> 
> Impact
> We have seen that PEA can have a positive impact in JVM's,. The implementation in the GraalVM compiler<https://ssw.jku.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf> is reported as having a positive benefit on improving performance and memory allocated in standardized benchmarks by 10% and 8% respectfully. For some benchmarks, the improvement in performance and memory allocated can be up to 58.5% and 33% respectively.
> Risks
> 
>    *   Increased C2 compilation time. Care must be applied to ensure an optimal implementation of this compilation phase so as to not impact JVM startup and code generation times.
>    *   The additional graph transformation adds complexity to the idea graph and can ultimately result in larger and more complex generated machine code.
> 
> Alternatives
> As an alternative to PEA, effort could be invested in making the current EA implementation of C2 be control flow sensitive. However, we estimate that this would in practice look very much like a separate compilation phase anyway, as such we recommend, at least initially, implementing PEA as a separate phase, and considering merging if it is proven to be successful.
> Relation to other projects
> 
>    *   As with many other compiler optimizations, operating upon a larger portion of a program can improve efficiency of the optimization. As such we envisage that improvements in inlining will have a direct benefit on opportunities to introduce differed object allocation.
>    *   If stack allocation is to be introduced to the JVM then PEA should be able to productively interact with this optimization.
>    *   We do not envisage any negative impact upon project Loom.
> 
> Further work
> We believe that the scheme presented here is relatively conservative and should have the aforementioned positive impact. One extension to this work would be the introduction of a fast/slow path to be used in locations where scalar replacement would be possible on fields of objects which are very infrequently escaped (such as in the reallyInfrequentBranch example above). This comes with two challenges however: 1). being additional size/complexity of code generated and 2). the slow/fast path logic itself will have a performance impact so this needs to be used in moderation - perhaps applying this approach for cases where the escaping probability is low (<10%) and monitoring this using ongoing branch entry profiling information would be a reasonable tradeoff.
> 
> Comments welcome!
> -Jason and Xin
> 


More information about the hotspot-compiler-dev mailing list