RFC, C2 Partial Escape Analysis

Thu May 27 12:08:32 UTC 2021

RFC, C2 Partial Escape Analysis
Hello everyone!

We would like to invite discussion on a proposal for Partial Escape Analysis (PEA) to be added to the JVM C2 hotspot compiler.

Specifically, we are looking to introduce a new compilation phase in order to enable optimizations such as scalar replacement in situations currently not encompassed by the existing control flow insensitive Escape Analysis (EA) phase of C2.

PEA is a control flow sensitive variant of EA. We believe that by adding this compilation phase we will be able to improve compiled code performance and reduce heap size requirements by up to 10% and 8% respectively.

At AWS, we have been prototyping a partial implementation of PEA with some success so far. We are at the stage now where we feel wider community involvement would be beneficial to the initiative.
Background: What is Escape Analysis
Escape analysis (EA) enables the HotSpot JIT C2 compiler to detect cases where objects do not "escape" the method/thread which created them. Whilst EA itself is not an optimization, the results of the analysis are used in order to enable a number of optimization techniques (as an alternative to conventional object allocation on the heap) including:

  *   Stack allocation. Instead of allocating an object on the heap, it is instead allocated on the stack. In this way it is not considered eligible for garbage collection. This reduces the amount of work which must be done by garbage collectors, improving performance. C2 does not currently implement this optimization, though some other JVM implementations do such as IBM J9.
  *   Scalar replacement. This goes a step further than stack allocation, here the object allocation can be eliminated altogether and all object field interaction isreplaced by local variables. Eliminating object allocation, field referencing and reducing load on the garbage collector all improve performance and reduce memory footprint.
  *   Lock Elision. If a Java object monitor is used but the object but does not escape, then the lock can never be contented. In this case synchronization on said lock can be removed.

Existing Limitations
The current EA implementation is flow insensitive. Let's examine what this means in practice...

Here is an example of an object which does not escape the bounds of the method/thread creating it and is therefore considered non escaping and eligible for optimization. The current implementation of EA in C2 detects this case

public class Pair{
    public int a;
    public int b;
    public Pair(int a, int b){
        this.a = a;
        this.b = b;
     }
}

public static int foo(int a, int b){
    Pair mypair = new Pair(a, b);
    return mypair.a;
}
and will enable scalar replacement such that the above code in foo can be transformed into the following form, which saves us from an instance of object allocation of Pair on every invocation of foo.

public class Pair{
    public int a;
    public int b;
    public Pair(int a, int b){
        this.a = a;
        this.b = b;
     }
}

public static int foo(int a, int b){
    // Pair mypair = new Pair(12, 13);
    return a; // don't bother with object allocation of Pair
}
However, EA is control flow insensitive and will not enable scalar replacement of fields of an object where there is branching code with a path having the possibility of the object escaping the lifetime of the thread/method which created it, regardless of how unlikely that branch is to execute at runtime... For example:

public static int foo(){
    Pair mypair = new Pair(12, 13);
    if(something()){
        // maybe this branch is hit only 0.01% of the time
        return usesAndPersistsObject(mypair); // escapes here
    }else{
        return mypair.a; // no scalar replacement here
    }
}
Above we see that in one branch the Pair instance object mypair escapes, therefore the mypair instance object is considered 'globally' escaping and must always be allocated on the heap. This means that scalar replacement cannot be used even when the escaping branch is only entered into very infrequently.

PEA aims to largely solve this problem.
Partial Escape Analysis
Partial Escape Analysis (PEA) is a control flow sensitive variant of EA. Like EA it acts upon the ideal graph IR. With PEA we propose introducing the concept of a differed object allocation (DOA). We propose considering only objects which have been identified as escaping in the EA phase of compilation. With PEA we propose to defer object allocation of those objects to the latest point in program/branch execution before which that object has been identified as escaping. Scalar replacement and lock elision can then be applied up to that point of deferred object allocation. Sometimes the latest point for a particular object may resolve to be at the point of initial object allocation, in which case PEA will offer no benefit.
High level algorithm
We are interested in only a subset of ideal graph nodes. Specifically nodes related to escaping objects:

  *   Allocation(A): Allocation and Initialization nodes.
  *   Branch(B): We are interested in branching nodes; IfNode, IfTrue, IfFalse nodes and phi nodes (e.g. to capture the case of: staticfield = condition?obj1:obj2 as well as loops, jumps related to break/continue etc) .
  *   Escape(E): Nodes concerning: object assigned to static field, object used as parameter in method invocation
  *   Usage(U): Nodes concerning: store to object field, loading from object field.
  *   Return(R): Behave similarly to Escape nodes, Return/ReThrow nodes.
We propose PEA to operate in via the following steps:

  1.  Graph reduction. This stage visits every node in the ideal graph and for the aforementioned nodes of interest, produces a reduced graph structure (example below) consisting of a reduced control flow sensitive set of nodes. The non-branching nodes of this reduced graph form are structured into basic blocks with IfTrue/IfFalse nodes used to denote the branches. The subset of nodes is used in step 2 and 3 below.
  2.  Differed Object Allocation Algorithm. This stage figures out where opportunities for DOA exist for each allocated object that can escape. It determines the very last point in which an object can be differed allocated, aka materialized on the heap in a branch so as to maintain program semantics by considering branches and object usage. There may be more than one branch path resulting in a materialization of an allocated object. There may also be instances where the object is required to be in a materialized state before a branch node has been encountered - in this case then no further optimization can be enabled by PEA and subsequent processing can be skipped.
  3.  Graph transformation. After applying DOA in step 2 we know the latest points at which an object may be materialized on the heap. With this knowledge all usage nodes prior to these points can be scalar replaced and all usage/escaping nodes after a materialization of a node need to be repointed to the nodes resulting from the materializations. Some repointing of the internal state of differed objects on the heap may be necessary at the point of materialization so as to maintain program semantics. We must also ensure consistency in the case of deoptimization.
Throughout all these mini phases, in the case of potential failure, the algorithm will skip subsequent processing.
Example
Let us now examine a case where PEA works well:

0:  public static boolean RAND  = true; // set externally
1:  public static boolean RAND2 = true; // set externally
2:
3:  public static int infrequentBranch(int f1, int f2) {
4:     Pair intPair = new Pair(f1, f2);
5:
6:     if (RAND) { // external call sets this to true 10% of the time
7:         intPair.a += 1; // example 'Use' of intPair
8:         Global.lastThing = intPair; // escapes here
9:         return 1;
10:    }
11:
12:    // scalar replacable usage...
13:    intPair.a += 12;
14:
15:    if (RAND2) { // external call sets this to true 10% of the time
16:        intPair.a += 10; // example 'Use' of intPair
17:        Global.lastThing = intPair; // escapes here
18:        return 2;
19:    }
20:
21:    // if we get to this point there should have been
22:    // no object allocation (with PEA enabled)
23:    return 3;
24: }
Above is an example of code containing two branching paths in which the intPair object escapes. If execution of the above code gets to the end of the method, with PEA enabled, intPair will not be allocated on the heap.

Let us now look at the reduced graph of the above method:

4: A:Initialize[40]
6: B:If[110]
 -> IfTrue: (bb: 1)
 7: U:StoreI[121] of: Initialize[40]
 8: E:StoreN[126] of: Initialize[40]
 9: R:Return[130]
 -> IfFalse: (bb: 2)
 X
13: U:StoreI[147] of: Initialize[40]
15: B:If[158]
 -> IfTrue: (bb: 3)
 16: U:StoreI[170] of: Initialize[40]
 17: E:StoreN[173] of: Initialize[40]
 18: R:Return[180]
 -> IfFalse: (bb: 4)
 X
23: R:Return[194]
The reduced graph is composed of a root basic block (bb) (with id of 0) and four child bb's (1,2,3,4) corresponding to the code in the two if statements. Each if statement generates an IfTrue and IfFalse node each having a basic block associated. For readability, line number information is provided above (our prototype, discussed shortly, also includes bci information).

The graph transformation mini phase is interested in the results of opportunities for differed allocation introduced by the DOA algorithm. In this example, that corresponds to the Initialize[40] node seen at the root bb. Observe that there are two DOA opportunities in each of the IfTrue branches at lines 8 and 17. Since both of these occur within branches which result in early termination of the method, the U:StoreI[147] node (corresponding to: intPair.a += 12;) may be scalar replaced. The nodes: U:StoreI[121] and U:StoreI[170] may optionally be scalar replaced, but there is little benefit because soon after the they are referenced the object which they reference will be materialized onto the heap.

The DOA algorithm of PEA is able to able to handle more complex cases, for example say we have the following code:

0:  public static boolean RAND  = true; // set externally
1:  public static boolean RAND2 = true; // set externally
2:
3:  public static int reallyInfrequentBranch(int f1, int f2) {
4:     Pair intPair = new Pair(f1, f2);
5:
6:     if (RAND) { // external call sets this to true 1% of the time
7:        if (RAND2) { // external call sets this to true 1% of the time
8:           Global.lastThing = intPair; // escapes here
9:           unkownCall(); // non inlined call - maybe call triggering separate thread interacting with the state of Global.lastThing
10:          intPair.a += 12; // operates on materialized object
11:       }
12:    }
13:
16:    return intPair.a;
17: }
At first glance it would appear that the above code would benefit from PEA since the escape of intPair occurs within not just one but two infrequently branching nodes. However, PEA cannot be used here because when the intPair object escapes at line 8, in this example the the unkownCall() is not inlined and maybe performs an operation on the state of IntPair (maybe in a separate thread), as such to maintain program semantics the object needs to be in a materialized state at the point of initial declaration (i.e. created on the heap as normal).

As far as the DOA algorithm is concerned for the above code, when it is processing the bb associated with the If statement on line 7 it will first mark the intPair object as escaping and materialized within the subsequent context of the branch, thus rendering the call on line 10 as not scalar replaceable. When DOA has finished processing this bb and moved on to the root bb it will come to the return statement on line 16, since the intPair is marked as being in a escaped and previously materialized state within a previous nested branch bb, this materialization will be "promoted" up to the root bb, which also just happens to be the location of the initial allocation thus eliminating this as an opportunity for PEA.
Progress at AWS
So far in our internal prototype, in the interests of failing fast, we have implemented up to step 2 of the outlined algorithm and are currently evaluating performance of a typical workload in which differed object allocation can be applied in an attempt to estimate the typical improvement in performance. Our initial prototype is implemented as a separate compilation phase which takes place after conventional, flow insensitive EA.
Flags
In support of this compilation phase we propose adding three JVM flags:
Flag

Notes

Default

Notes

DoPartialEscapeAnalysis

perform partial escape analysis

true

global switch (set to false in preview build)

PartialEALogLevel

PEA log level

0 - nothing

[0-5] more detail per log level - including the option to output the reduced graph representation above in a human readable format. this aids development and produciton debugging

PartialEAOnly

Restrict pea to only this method

'' - empty string

Equivalent to setting DoPartialEscapeAnalysis to true

Challenges

  *   Generally speaking, operating upon the ideal graph representation is challenging as it is a paradigm unique to the JVM and documentation is sparing. Tools such as the ideal graph visualizer are however excellent for improving engineer productivity when interacting with the ideal graph.
  *   When speaking in terms of this initiative, the most challenging aspect we have faced so far is in building reduced graph representations of complex phi node interactions, e.g. code such as: (RAND?holder1:holder2).held = RAND2?intPair1:intPair2; . We expect the majority of bugs in step 2 of our outlined algorithm to reside in this space.
  *   Another challenging area is that of "deadly embraces" where two objects have fields which point to one another and at least one of those objects is marked as escaping thus so rendering the other. For instance: object1.friend = object2; object2.friend = object1; We have not determined how to solve this problem yet but some mechanism to detect these cyclic relationships is required as we believe that these relationships are common in many of the java.util.* classes.

Impact
We have seen that PEA can have a positive impact in JVM's,. The implementation in the GraalVM compiler<https://ssw.jku.at/Research/Papers/Stadler14/Stadler2014-CGO-PEA.pdf> is reported as having a positive benefit on improving performance and memory allocated in standardized benchmarks by 10% and 8% respectfully. For some benchmarks, the improvement in performance and memory allocated can be up to 58.5% and 33% respectively.
Risks

  *   Increased C2 compilation time. Care must be applied to ensure an optimal implementation of this compilation phase so as to not impact JVM startup and code generation times.
  *   The additional graph transformation adds complexity to the idea graph and can ultimately result in larger and more complex generated machine code.

Alternatives
As an alternative to PEA, effort could be invested in making the current EA implementation of C2 be control flow sensitive. However, we estimate that this would in practice look very much like a separate compilation phase anyway, as such we recommend, at least initially, implementing PEA as a separate phase, and considering merging if it is proven to be successful.
Relation to other projects

  *   As with many other compiler optimizations, operating upon a larger portion of a program can improve efficiency of the optimization. As such we envisage that improvements in inlining will have a direct benefit on opportunities to introduce differed object allocation.
  *   If stack allocation is to be introduced to the JVM then PEA should be able to productively interact with this optimization.
  *   We do not envisage any negative impact upon project Loom.

Further work
We believe that the scheme presented here is relatively conservative and should have the aforementioned positive impact. One extension to this work would be the introduction of a fast/slow path to be used in locations where scalar replacement would be possible on fields of objects which are very infrequently escaped (such as in the reallyInfrequentBranch example above). This comes with two challenges however: 1). being additional size/complexity of code generated and 2). the slow/fast path logic itself will have a performance impact so this needs to be used in moderation - perhaps applying this approach for cases where the escaping probability is low (<10%) and monitoring this using ongoing branch entry profiling information would be a reasonable tradeoff.

Comments welcome!
-Jason and Xin