RFC: Partial Escape Analysis in HotSpot C2

Tue Nov 1 04:34:42 UTC 2022

Hi,

I would like to update on this. Last week, I posted Example-1 which is a
poster child partial Escape
case(https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79).
It shows that it is possible to conduct Standler's PEA in parser and we
can punt the obsolete object to C2 EA/SR.

I made some progress on
Example-2(https://gist.github.com/navyxliu/ee4465e2146ef99c5ae1fa1ba6b70e25).
It's not partial escaped case but parser still needs to handle it
correctly. I think it can explain better why the algorithm can guarantee
'dynamic variant'. 27 Allocate dominates 106/129(clones) like I
explained before.

A new application of PEA emerges from the prior discussion(JDK-8267532).
The Try-Catch idiom divides control-flow into hot-cold paths by nature.
Method calls such as ‘obj.close()’ in cold path may not be inlined.
There are 2 possibilities: (1)callee itself is too big (2) inliner is
out of budget in favor of the hot paths. Non-inlined callsites will
impede EA/SR because the object escapes as a receiver.

We found that we can leverage the technique ‘object cloning’ developed
in the PEA algorithm to tackle this issue. In essence, PEA would split
the original object to 2 copies: one is for the hot path and the other
one is for the cold path. This is like I did in Example-2 but along with
exceptional flow. Then C2 would conduct scalar replacement in the hot
path because the object in try-block is non-escaped!

Therefore, I would like to treat ResourceScopeCloseMin as 'Example-3'
and try it out. We know that there are different solutions for this
idiom. C2 parser could use uncommon_trap for the exceptional paths, or
we can adjust inliner policy for this case. I am not suggesting PEA is
the right way to solve it. I just want to explore a possibility.

Paul commented that I bloat code by cloning objects in Example-2. I need
to develop an optimize to recognize the 'common subexpression' and
eliminate them. I am aware of it and I mentioned it in the Risk section
of RFC. I would like to put aside of it because it's an optimization. I
think we can revert to the original allocation if the transformation
turns out not beneficial. I guess we can revert objects after Iterative
EA. JDK-8267532 is an example that too early restoration may miss the
chance of EA/SR.

thanks,
--lx

On 10/24/22 1:11 PM, Liu, Xin wrote:
> hi, Vladimir Ivanov,
> 
> Your email is my starting point. It's thorough and very insightful.
> We spent a lot of time trying to crack your questions. The RFC is the
> summary of what we got. I own your a big thank!
> 
> Sorry, I still haven't had a clear answer for "2. Move vs Split
> decision" yet. Stadler's algorithm just materialize a virtual object on
> demand. After then, its state changes from virtual to materialized.
> IMHO, I don't think it is optimal placement. I feel the optimal
> placement should be along domination frontier of the original
> AllocateNode and Exit. It is like the minimal phi construction and we
> might borrow the idea from it. I put aside it because it's indeed an
> optimization problem. Maybe it's not a big deal in common cases.
> 
> 
> I try to answer your questions inline.
> 
> On 10/20/22 5:26 PM, Vladimir Ivanov wrote:
>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>
>>
>>
>> Hi,
>>
>>> I would like to update on this. I manage to get PEA work in Vladimir
>>> Ivanov's testcase. I put the testcase, assembly and graphs here[1].
>>>
>>> Even though it is a quite simple case, I think it demonstrates that the
>>> RFC is practical in C2. I proposed 3 major differences from Graal.
>>
>> Nice! Also, a very similar (but a much more popular case) should be
>> escape sites in catch blocks (as reported by JDK-8267532 [1]).
>>
>>> 1. The algorithm runs in parser instead of optimizer.
>>> 2. Prefer clone-and-eliminate strategy rather than
>>> virtualize-and-materialize.
>>> 3. Refrain from scalar replacement on-the-fly.
>>
>> I don't understand how you plan to implement it solely during parsing.
>> You could do some bookkeeping during parsing and capture JVM state, but
>> I don't see how to do EA that early.
>>
>> Also, please, elaborate on #3. It's not clear to me what do you mean there.
>>
> I added a PEAState for each basic block[2].
> 
> If we need to materialize a virtual object, we just create a new
> AllocateNode and its children in-place[3].
> 
> 
> In Stadler's algorithm, it inherently performs "scalar replacement". Its
> first step is to delete the original allocation node. PEA phase replaces
> LoadField nodes with scalars. Because I propose to focus on 'escaping
> object' only, we don't perform 'scalar replacement'. I leave them to the
> C2 EA/SR. that's why I say "I refrain from on-the-fly scalar replacement".
> 
> 
>>> The test excises them all. I pasted 3 graphs here[2]. When we
>>> materialize an object, we just clone it with the right JVMState. It
>>> shows that C2 IterEA can automatically picks up the obsolete object and
>>> get rid of it, as we expected.
>>>
>>> It turns out cloning an object isn't as complex as I thought. I mainly
>>> spent time on adjusting JVMState for the cloned AllocateNode. Not only
>>> to call sync_jvm(), I also need to 1) kill dead locals 2) clean stack
>>> and even avoid reexecution that bci.
>>>
>>>    JVMState* jvms = parser->sync_jvms();
>>>    SafePointNode* map = jvms->map();
>>>    parser->kill_dead_locals();
>>>    parser->clean_stack(jvms->sp());
>>>    jvms->set_should_reexecute(false);
>>>
>>> Clearly, the algorithm hasn't completed yet. I am still working on
>>> MergeProcessor, general classes fields and loop construct.
>>
>> There was a previous discussion on PEA for C2 back in 2021 [2] [3]. One
>> interesting observation related to your current experiments was:
>>
>> "4. Escape sites separate the graph into 2 parts: before and after the
>> instance escapes. In order to preserve identity invariants (and avoid
>> identity paradoxes), PEA can't just put an allocation at every escape
>> site. It should respect the order of escape events and ensure that the
>> very same object is observed when multiple escape events happen.
>>
>> Dynamic invariant can be formulated as: there should never be more than
>> 1 allocation at runtime per 1 eliminated allocation.
>>
>> Considering non-escaping operations can force materialization on their
>> own, it poses additional constraints."
>>
>> So, when you clone an allocation, you should ensure that only a single
>> instance can be observed. And safepoints can be escape points as well
>> (rematerialization in case of deoptimization event).
>>
> This is fairly complex. That's why I suggest to focus on 'escaping
> objects' only in C2 PEA. Please assume that all non-escape objects
> remain intact after our PEA.
> 
> I think we can guarantee dynamic invariant here. First of all, we
> traverse basic blocks in reverse-post-order(RPO). It's in the same
> direction of execution. An object allocation is virtual initially. Once
> a virtual object becomes materialized, it won't change back. We track
> allocation states so we know that. The following appearances of an
> materialized object won't cause 'materialization' again. It will be
> treated as a ordinary object.
> 
> Here is Example2 and I am still working on it. Beside the place where x
> escapes to _cache, we also need to materialize the virtual object before
> merging two basic blocks. this is described in 5.3 Merge nodes of
> Stadler's CGO paper.
> 
> 
> class Example2 {
>     private Object _cache;
>     public Object foo(boolean cond) {
>         Object x = new Object();
> 
>         blackhole();
> 
>         if (cond) {
>             _cache = x;
>         }
>         return x;
>     }
> 
>     public static void blackhole() {}
> ...
> }
> 
> We expect to see code as follows after PEA. "x2 = new Object()" is the
> result of materialization at merging point.
> 
>     public Object foo(boolean cond) {
>         Object x0 = new Object();
> 
>         blackhole();
> 
>         if (cond) {
>             x1 = new Object();
>             _cache = x1;
>         }
>         x3 = phi(x2 = new Object(), x1);
>         return x3;
>     }
> 
> We've proved that the obsolete object is either dead or scalar
> replaceable after PEA[4]. We expect C2 EA/SR to get rid of x0 down the
> road. Please note that x0(the obsolete obj) dominates all clones(x1 and
> x2) because we materialize them before merging, we can guarantee dynamic
> invariant.
> 
> I plan to mark the original AllocateNode 'obsolete' if PEA does
> materialize it. I am going to assert in MacroExpansion that it won't
> expand an obsolete object. If it did, it would introduce redundancy and
> violate the 'dynamic invariant'.
> 
> 
> [1]https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047536.html
> [2]https://github.com/navyxliu/jdk/blob/PEA_parser/src/hotspot/share/opto/parse.hpp#L171
> 
> [3]
> https://github.com/navyxliu/jdk/blob/PEA_parser/src/hotspot/share/opto/parseHelper.cpp#L367
> 
> [4]https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2022-October/059432.html
> 
> 
>>> I haven't figured out how to test PEA in a reliable way. It is not easy
>>> for IR framework to capture node movement. If we measure allocation
>>> rate, it will be subject to CPU capability and also the sampling rate. I
>>> came up with an idea so-called 'Epsilon-Test'. We create a JVM with
>>> EpsilonGC and a fixed Java heap. Because EpsilonGC never replenish the
>>> java heap, we can count how many iterations a test can run before OOME.
>>> The less allocation made in a method, the more iterations HotSpot can
>>> execute the method. This isn't perfect either. I found that hotspot
>>> can't guarantee to execute the final-block in this case[3]. So far, I
>>> just measure execution time instead.
>>
>> It sounds more like a job for benchmarks, but focused on measuring
>> allocation rate (per iteration). ("-prof gc" mode in JMH terms.)
>>
>> Personally, I very much liked the IR framework-based approach Cesar used
>> in the unit test for allocation merges [4]. Do you see any problems with
>> that?
>>
>> Best regards,
>> Vladimir Ivanov
>>
> 
> Okay. I will follow this direction.
> 
> thanks,
> --lx
> 
>> [1] https://bugs.openjdk.org/browse/JDK-8267532
>> [2]
>> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047486.html
>> [3]
>> https://mail.openjdk.org/pipermail/hotspot-compiler-dev/2021-May/047536.html
>> [4] https://github.com/openjdk/jdk/pull/9073
>>
>>
>>>
>>> Appreciate your feedbacks or you spot any redflag.
>>>
>>> [1] https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79
>>>
>>> [2]
>>> https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79?permalink_comment_id=4341838#gistcomment-4341838
>>>
>>> [3] https://gist.github.com/navyxliu/9c325d5c445899c02a0d115c6ca90a79#file-example1-java-L43
>>>
>>> thanks,
>>> --lx
>>>
>>>
>>>
>>>
>>> On 10/12/22 11:17 AM, Vladimir Kozlov wrote:
>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>>
>>>>
>>>>
>>>> On 10/12/22 7:58 AM, Liu, Xin wrote:
>>>>> hi, Vladimir,
>>>>>> You should show that your implementation can rematirealize an object
>>>>> at any escape site.
>>>>>
>>>>> My understanding is I suppose to 'materialize' an object at any escape site.
>>>>
>>>> Words ;^)
>>>>
>>>> Yes, I mistyped and misspelled.
>>>>
>>>> Vladimir K
>>>>
>>>>>
>>>>> 'rematerialize' refers to 'create an scalar-replaced object on heap' in
>>>>> deoptimization. It's for interpreter as if the object was created in the
>>>>> first place. It doesn't apply to an escaped object because it's marked
>>>>> 'GlobalEscaped' in C2 EA.
>>>>>
>>>>>
>>>>> Okay. I will try this idea!
>>>>>
>>>>> thanks,
>>>>> --lx
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 10/11/22 3:12 PM, Vladimir Kozlov wrote:
>>>>>> Also in your test there should be no merge at safepoint2 because `obj` is "not alive" (not referenced) anymore.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xB9D934C61E047B0D.asc
Type: application/pgp-keys
Size: 3675 bytes
Desc: OpenPGP public key
URL: <https://mail.openjdk.org/pipermail/hotspot-compiler-dev/attachments/20221031/f51fda5a/OpenPGP_0xB9D934C61E047B0D-0001.asc>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <https://mail.openjdk.org/pipermail/hotspot-compiler-dev/attachments/20221031/f51fda5a/OpenPGP_signature-0001.sig>