RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Wed Jul 22 08:20:15 UTC 2020

Hi Goetz,

> I'll answer to the obvious things in this mail now.
> I'll go through the code thoroughly again and write 
> a review of my findings thereafter.

Sure. If trimmed my citations to relevant parts.

> > The delta includes many changes in comments, renaming of names, etc. So
> > I'd like to summarize
> > functional changes:
> > 
> > * Collected all the code for the testing feature DeoptimizeObjectsALot in
> > compileBroker.cpp and reworked it.
> Thanks, this makes it much more compact.

> >   With DeoptimizeObjectsALot enabled internal threads are started that
> > deoptimize frames and
> >   objects. The number of threads started are given with
> > DeoptimizeObjectsALotThreadCountAll and
> >   DeoptimizeObjectsALotThreadCountSingle. The former targets all existing
> > threads whereas the
> >   latter operates on a single thread selected round robin.
> > 
> >   I removed the mode where deoptimizations were performed at every nth
> > exit from the runtime. I never used it.

> Do I get it right? You have a n:1 and a n:all test scenario.
>  n:1: n threads deoptimize 1 Jana thread    where n = DOALThreadCountSingle
>  n:m: n threads deoptimize all Java threads where n = DOALThreadCountAll?

Not quite.

-XX:+DeoptimizeObjectsALot // required
-XX:DeoptimizeObjectsALotThreadCountAll=m
-XX:DeoptimizeObjectsALotThreadCountSingle=n

Will start m+n threads. Each operating on all existing JavaThreads using EscapeBarriers. The
difference between the 2 thread types is that one distinct EscapeBarrier targets either just a
single thread or all exisitng threads at onece. If just one single thread is targeted per
EscapeBarrier, then it is not always the same thread, but threads are selected round robin. So there
will be n threads selecting independently single threads round robin per EscapeBarrier and m threads
that target all threads in every EscapeBarrier.

> > * EscapeBarrier::sync_and_suspend_one(): use a direct handshake and
> > execute it always independently
> >   of is_thread_fully_suspended().
> Is this also a performance optimization?

Maybe a minor one.

> > * JavaThread::wait_for_object_deoptimization():
> >   - Bugfix: the last check of is_obj_deopt_suspend() must be /after/ the
> > safepoint check! This
> >     caused issues with not walkable stacks with DeoptimizeObjectsALot.
> OK. As I understand, there was one safepoint check in the old version, 
> now there is one in each iteration.  I assume this is intended, right?

Yes it is. The important thing here is (A) a safepoint check is needed /after/ leaving a safe state
(_thread_in_native, _thread_blocked). (B) Shared variables that are modified at safepoints or with
handshakes need to be reread /after/ the safepoint check.

BTW: I only noticed now that since JDK-8240918 JavaThreads themselves must disarm their polling
page. Originally (before handshakes) this was done by the VM thread. With handshakes it was done by
the thread executing the handshake op. This was change for OrderAccess::cross_modify_fence() where
the poll is left armed if the thread is in native and sice JDK-8240918 it is always left armed. So
when a thread leaves a safe state (native, blocked) and there was a handshake/vm op, it will always
call SafepointMechanism::block_if_requested_slow(), even if the handshake/vm operation have been
processed already and everybody else is happyly executing bytecodes :)

Still (A) and (B) hold.

> >   - Added limited spinning inspired by HandshakeSpinYield to fix regression in
> > microbenchmark [1]
> Ok.  Nice improvement, nice catch!

Yes. It certainly took some time to find out.

> > 
> > I refer to some more changes answering your questions and comments inline
> > below.
> > 
> > Thanks,
> > Richard.
> > 
> > [1] Microbenchmark:
> > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbenchmark/
> > 

> > > I understand you annotate at safepoints where the escape analysis
> > > finds out that an object is "better" than global escape.
> > > This are the cases where the analysis identifies optimization
> > > opportunities. These annotations are then used to deoptimize
> > > frames and the objects referenced by them.
> > > Doesn't this overestimate the optimized
> > > objects?  E.g., eliminate_alloc_node has many cases where it bails
> > > out.
> > 
> > Yes, the implementation is conservative, but it is comparatively simple and
> > the additional debug
> > info is just 2 flags per safepoint. 
> Thanks. It also helped that you explained to me offline that 
> there are more optimizations than only lock elimination and scalar
> replacement done based on the ea information.
> The ea refines the IR graph with allows follow up optimizations 
> which can not easily be tracked back to the escaping objects or 
> the call sites where they do not escape. 
> Thus, if there are non-global escaping objects, you have to 
> deoptimize the frame.
> Did I repeat that correctly?

Mostly, but there are also cases, where deoptimization is required if and only if ea-local objects
are passed as arguments. This is the case, when values are not read directely from a frame, but from
a callee frame.

> With this understanding, a row of my proposed renamings/comments
> are obsolete.

Ok.

> > On the other hand, those JVMTI operations
> > that really trigger
> > deoptimizations are expected to be comparatively infrequent such that
> > switching to the interpreter
> > for a few microseconds will hardly have an effect.
> That sounds reasonable.

> > I've done microbenchmarking to check this.
> > 
> > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbe
> > nchmark/
> > 
> > I found that in the worst case performance can be impacted by 10%. If the
> > agent is extremely active
> > and does relevant JVMTI calls like GetOwnedMonitorStackDepthInfo() every
> > millisecond or more often,
> > then the performance impact can be 30%. But I would think that this is not
> > realistic. These calls
> > are issued in interactive sessions to analyze deadlocks.
> Ok. 

> > We could get more precise deoptimizations by adding a third flag per
> > safepoint for ea-local objects
> > among the owned monitors. This would help improve the worst case in the
> > benchmark. But I'm not
> > convinced, if it is worth it.
> > 
> > Refer to the README.txt of the microbenchmark for a more detailled
> > discussion.

> > > pcDesc.hpp
> > >
> > > I would like to see some documentation of the methods. 
> > Done. I didn't take your text, though, because I only noticed it after writing
> > my own. Let me know if you are not ok with it.
> That's fine. My texts were only proposals, you as author know better
> what goes on anyways.

Ok.

> > > scopeDesc.cpp
> > >
> > >   Besides refactoring copy escape info from pcDesc to scopeDesc
> > >   and add accessors. Trivial.
> > >
> > >   In scopeDesc.hpp you talk about NoEscape and ArgEscape.
> > >   This are opto terms, but scopeDesc is a shared datastructure
> > >   that does not depend on a specific compiler.
> > >   Please explain what is going on without using these terms.
> > 
> > Actually these are not too opto specific terms. They are used in the paper
> > referenced in
> > escape.hpp. Also you can easily google them. I'd rather keep the comments
> > as they are.
> Hmm, I'm not really happy with this, as also the papers
> are for the compiler community, and probably not familiar to 
> others that work with HotSpot.
> But stay with your terms if you think it makes it clearer.
> Anyways, with now understanding why you use conservative
> Information (see above), the descriptions I had in mind are not precise.

Ok.

> > > callnode.hpp
> > >
> > > You add functionality to annotate callnodes with escape information
> > > This is carried through code generation to final output where it is
> > > added to the compiled methods meta information.
> > >
> > > At Safepoints in general jvmti can access
> > >   - Objects that were scalar replaced. They must be reallocated.
> > >     (Flag EliminateAllocations)
> > >   - Objects that should be locked but are not because they never
> > >     escape the thread. They need to be relocked.
> > >
> > > At calls, Objects where locks have been removed escape to callees.
> > > We must persist this information so that if jvmti accesses the
> > > object in a callee, we can determine by looking at the caller that
> > > it needs to be relocked.
> > 
> > Note that the ea-optimization must not be at the current location, it can also
> > follow when control
> > returns to the caller. Lock elimination isn't the only relevant optimization.
> Yes, I understood now, see above. Thanks for explaining.

Ok.

> > Accesses to instance
> > members or array elements can be optimized as well.
> You mean the compiler can/will ignore volatile or memory ordering
> requirements for non-escaping objects? Sounds reasonable to do.

Yes, for instance. Also without volatile modifiers it will eliminate accesses. Here is an example:
Method A has a NoEscape allocation O that is not scalar replaced. A calls Method B, which is not
inlined. When you use your debugger to break in B, then modify a field of O, then this modification
would have no effect without deoptimization, because the jit assumes that B cannot modify O without
a reference to it.

> > You are right, it is not correct how flags are checked. Especially if only
> > running with the JVMCI compiler.
> >
> > I changed Deoptimization::deoptimize_objects_internal() to make
> > reallocation and relocking dependent
> > on similar checks as in Deoptimization::fetch_unroll_info_helper().
> > Furthermore EscapeBarriers are
> > conditionally activated depending on the following (see EscapeBarrier ctors):
> > 
> > JVMCI_ONLY(UseJVMCICompiler) NOT_JVMCI(false)
> > COMPILER2_PRESENT(|| DoEscapeAnalysis)
> > 
> > So the enhancement can be practically completely disabled by disabling
> > DoEscapeAnalysis, which is
> > what C2 currently does if JVMTI capabilities that allow access to local
> > references are taken.
> Thanks for fixing. 

Thanks for finding :)

> > I went for the latter.
> > 
> > > In fetch_unroll_info_helper, I don't understand why you need
> > >  && !EscapeBarrier::objs_are_deoptimized(thread, deoptee.id())) {
> > > for eliminated locks, but not for skalar replaced objects?
> > 
> > In short reallocation is idempotent, relocking is not.
> > 
> > Without the enhancement Deoptimization::realloc_objects() can already be
> > called more than once for a frame:
> > 
> > First call in materializeVirtualObjects() (also iterateFrames()).
> > 
> > Second (indirect) call in fetch_unroll_info_helper().
> > 
> > The objects from the first call are saved as jvmti deferred updates when
> > realloc_objects()
> > returns. Note that there is no relationship to jvmti. The thing in common is
> > that updates cannot be
> > directely installed into a compiled frame, it is necessary to deoptimize the
> > frame and defer the
> > updates until the compiled frame gets replaced. Every time the vframes
> > corresponding to the owner
> > frame are iterated, they get the deferred updates. So in
> > fetch_unroll_info_helper() the
> > GrowableArray<compiledVFrame*>* chunk reference them too. All
> > references to the objects created by
> > the second (indirect) call to realloc_objects() are never used, because
> > compiledVFrame accessors to
> > locals, expressions, and monitors override them with the deferred updates.
> > The objects become
> > unreachable and get gc'ed.
> OK, so repeatedly computed vFrames always have the first version of 
> reallocated objects by construction, so it needs not be handled here.
> But also due to construction, objects might be allocated just to be
> discarded.

Yes.

> > materializeVirtualObjects() does not bother with relocking.
> > deoptimize_objects_internal(), which is
> > introduced by the enhancement, does relock objects, after all the lock
> > elimination becomes illegal 
> > with the change in escape state. Relocking twice does not work, so the
> > enhancement avoids it by
> > checking EscapeBarrier::objs_are_deoptimized(thread, deoptee.id()).
> > 
> > Note that materializeVirtualObjects() can be called more than once and will
> > always return the very
> > same objects, even though it calls realloc_objects() again.
> Ok.

> > > I would guess it is because the eliminated locks can be applied to
> > > argEscape, but scalar replacement only to noescape objects?
> > > I.e. it might have been done before?
> > >
> > > But why isn't this the case for eliminate_allocations?
> > > deoptimize_objects_internal does both unconditionally,
> > > so both can happen to inner frames, right?
> > 
> > Sorry, I don't quite understand. Hope the explanation above helps.
> Yes.  I was guessing wrong :)

Ok, good :)

> > 
> > > Code will get much more simple if BiasedLocking is removed.
> > >
> > > EscapeBarrier:: ...
> > >
> > > (This class maybe would qualify for a file of its own.)
> > >
> > > deoptimize_objects()
> > > I would mention escape analysis only as side remark.  Also, as I understand,
> > > there is only one frame at given depth?
> > > // Deoptimize frames with optimized objects. This can be omitted locks and
> > > // objects not allocated but replaced by scalars. In C2, these optimizations
> > > // are based on escape analysis.
> > > // Up to depth, deoptimize frames with any optimized objects.
> > > // From depth to entry_frame, deoptimize only frames that
> > > // pass optimized objects to their callees.
> > > (First part similar for the comment above
> > EscapeBarrier::deoptimize_objects_internal().)
> > 
> > I've reworked the comment. Let me know if you still think it needs to be
> > improved.
> Good now, thanks (maybe break the long line ...)

Ok. Will do in next webrev.7

> > > Syncronization: looks good. I think others had a look at this before.
> > >
> > > EscapeBarrier::deoptimize_objects_internal()
> > >   The method name is misleading, it is not used by
> > >   deoptimize_objects().
> > >   Also, method with the same name is in Deopitmization.
> > >   Proposal: deoptimize_objects_thread() ?
> > 
> > Sorry, but I don't see, why it would be misleading.
> > What would be the meaning of 'deoptimize_objects_thread'? I don't
> > understand that name.
> 1. I have no idea why it's called "_internal". Because it is private?
>    By the name, I would expect that EscapeBarrier::deoptimize_objects()
>    calls it for some internal tasks. But it does not.

Well, I'd say it is pretty internal, what's happening in that method. So IMHO the suffix _internal
is a match.

> 2. My proposal: deoptimize_objects_all_threads() iterates all threads
> and calls deoptimize_objects(_one)_thread(thread) for each of these.
> That's how I would have named it. 
> But no bike shedding, if you don't see what I mean it's not obvious.

Ok. We could have a quick call, too, if you like.

> > > Renaming deferred_locals to deferred_updates is good, as well as
> > > adding a datastructure for it.
> > > (Adding this data structure might be a breakout, too.)
> > >
> > > good.
> > >
> > > thread.cpp
> > >
> > > good.
> > >
> > > vframe.cpp
> > >
> > > Is this a bug in existing code?
> > > Makes sense.
> > 
> > Depends on your definition of bug. There are no references to
> > vframe::is_entry_frame() in the
> > existing code. I would think it is a bug.
> So it is :)

I'm just afraid it could get fixed by removing the class entryVFrame.

> > 
> > >
> > > vframe_hp.hpp
> > > (What stands _hp for? helper? The file should be named
> > compiledVFrame ...)
> > >
> > > not_global_escape_in_scope() ...
> > > Again, you mention escape analysis here. Comments above hold, too.
> > 
> > I think it is the right name, because it is meaningful and simple.
> Ok, accepted ... given my understandings from above.

Ok.

> > 
> > > You introduce JvmtiDeferredUpdates. Good.
> > >
> > > vframe_hp.cpp
> > >
> > > Changes for JvmtiDeferredUpdates, escape state accessors,
> > >
> > > line 422:
> > > Would an assertion assert(!info->owner_is_scalar_replaced(), ...) hold here?
> > >
> > >
> > > macros.hpp
> > >   Good.
> > >
> > >
> > > Test coding
> > > ============
> > >
> > > compileBroker.h|cpp
> > >
> > > You introduce a third class of threads handled here and
> > > add a new flag to distinguish it. Before, the two kinds
> > > of threads were distinguished implicitly by passing in
> > > a compiler for compiler threads.
> > > The new thread kind is only used for testing in debug.
> > >
> > > make_thread:
> > > You could assert (comp != NULL...) to assure previous
> > > conditions.
> > 
> > If replaced the if-statements with a switch-statement, made sure all enum-
> > elements are covered, and
> > added the assertion you suggested.
> > 
> > > line 989 indentation broken
> > 
> > You are referring to this block I assume:
> > (from
> > http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5/src/hots
> > pot/share/compiler/compileBroker.cpp.frames.html)
> > 
> >  976   if (MethodFlushing) {
> >  977     // Initialize the sweeper thread
> >  978     Handle thread_oop = create_thread_oop("Sweeper thread", CHECK);
> >  979     jobject thread_handle = JNIHandles::make_local(THREAD,
> > thread_oop());
> >  980     make_thread(sweeper_t, thread_handle, NULL, NULL, THREAD);
> >  981   }
> >  982
> >  983 #if defined(ASSERT) && COMPILER2_OR_JVMCI
> >  984   if (DeoptimizeObjectsALot == 2) {
> >  985     // Initialize and start the object deoptimizer threads
> >  986     for (int thread_count = 0; thread_count <
> > DeoptimizeObjectsALotThreadCount; thread_count++) {
> >  987       Handle thread_oop = create_thread_oop("Deoptimize objects a lot
> > thread", CHECK);
> >  988       jobject thread_handle = JNIHandles::make_local(THREAD,
> > thread_oop());
> >  989       make_thread(deoptimizer_t, thread_handle, NULL, NULL, THREAD);
> >  990     }
> >  991   }
> >  992 #endif // defined(ASSERT) && COMPILER2_OR_JVMCI
> > 
> > I cannot really see broken indentation here. Am I looking at the wrong
> > location?
> I don't have the source version I reviewed last time any more, so 
> I can't check. But maybe an artefact from patching ... if there were
> tabs jcheck would have told you, so that's not it. No problem.

Ok.

Thanks again!

Cheers, Richard.

-----Original Message-----
From: Lindenmaier, Goetz <goetz.lindenmaier at sap.com> 
Sent: Donnerstag, 16. Juli 2020 18:30
To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net
Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Hi Richard, 

I'll answer to the obvious things in this mail now.
I'll go through the code thoroughly again and write 
a review of my findings thereafter.

> So here is the new webrev.6
> 
> Webrev.6:
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6/
> Delta:
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.inc/
Thanks for the incremental webrev, it's helpful!

> I spent most of the time running a microbenchmark [1] I wrote to answer
> questions from your
> review. At first I had trouble with variance in the results until I found out it
> was due to the NUMA
> architecture of the server I used. After that I noticed that there was a
> performance regression of
> about 5% even at low agent activity. I finally found out that it was due to the
> implementation of
> JavaThread::wait_for_object_deoptimization() which is called by the target
> of the JVMTI operation to
> self suspend for object deoptimization. I fixed this by adding limited spinning
> before calling
> wait() on the monitor.
> 
> The delta includes many changes in comments, renaming of names, etc. So
> I'd like to summarize
> functional changes:
> 
> * Collected all the code for the testing feature DeoptimizeObjectsALot in
> compileBroker.cpp and reworked it.
Thanks, this makes it much more compact.

>   With DeoptimizeObjectsALot enabled internal threads are started that
> deoptimize frames and
>   objects. The number of threads started are given with
> DeoptimizeObjectsALotThreadCountAll and
>   DeoptimizeObjectsALotThreadCountSingle. The former targets all existing
> threads whereas the
>   latter operates on a single thread selected round robin.
> 
>   I removed the mode where deoptimizations were performed at every nth
> exit from the runtime. I never used it.

Do I get it right? You have a n:1 and a n:all test scenario.
 n:1: n threads deoptimize 1 Jana thread    where n = DOALThreadCountSingle
 n:m: n threads deoptimize all Java threads where n = DOALThreadCountAll?

> * EscapeBarrier::sync_and_suspend_one(): use a direct handshake and
> execute it always independently
>   of is_thread_fully_suspended().
Is this also a performance optimization?

> * Bugfix in EscapeBarrier::thread_added(): must not clear deopt flag. Found
> this testing with DeoptimizeObjectsALot.
Ok.

> * Added EscapeBarrier::thread_removed().
Ok.

> * EscapeBarrier constructors: barriers can now be entirely disabled by
> disabling DoEscapeAnalysis.
>   This effectively disables the enhancement.
Good!

> * JavaThread::wait_for_object_deoptimization():
>   - Bugfix: the last check of is_obj_deopt_suspend() must be /after/ the
> safepoint check! This
>     caused issues with not walkable stacks with DeoptimizeObjectsALot.
OK. As I understand, there was one safepoint check in the old version, 
now there is one in each iteration.  I assume this is intended, right?

>   - Added limited spinning inspired by HandshakeSpinYield to fix regression in
> microbenchmark [1]
Ok.  Nice improvement, nice catch!

> 
> I refer to some more changes answering your questions and comments inline
> below.
> 
> Thanks,
> Richard.
> 
> [1] Microbenchmark:
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbenchmark/
> 

> > I understand you annotate at safepoints where the escape analysis
> > finds out that an object is "better" than global escape.
> > This are the cases where the analysis identifies optimization
> > opportunities. These annotations are then used to deoptimize
> > frames and the objects referenced by them.
> > Doesn't this overestimate the optimized
> > objects?  E.g., eliminate_alloc_node has many cases where it bails
> > out.
> 
> Yes, the implementation is conservative, but it is comparatively simple and
> the additional debug
> info is just 2 flags per safepoint. 
Thanks. It also helped that you explained to me offline that 
there are more optimizations than only lock elimination and scalar
replacement done based on the ea information.
The ea refines the IR graph with allows follow up optimizations 
which can not easily be tracked back to the escaping objects or 
the call sites where they do not escape. 
Thus, if there are non-global escaping objects, you have to 
deoptimize the frame.
Did I repeat that correctly?
With this understanding, a row of my proposed renamings/comments
are obsolete.

> On the other hand, those JVMTI operations
> that really trigger
> deoptimizations are expected to be comparatively infrequent such that
> switching to the interpreter
> for a few microseconds will hardly have an effect.
That sounds reasonable.

> I've done microbenchmarking to check this.
> 
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbe
> nchmark/
> 
> I found that in the worst case performance can be impacted by 10%. If the
> agent is extremely active
> and does relevant JVMTI calls like GetOwnedMonitorStackDepthInfo() every
> millisecond or more often,
> then the performance impact can be 30%. But I would think that this is not
> realistic. These calls
> are issued in interactive sessions to analyze deadlocks.
Ok. 

> We could get more precise deoptimizations by adding a third flag per
> safepoint for ea-local objects
> among the owned monitors. This would help improve the worst case in the
> benchmark. But I'm not
> convinced, if it is worth it.
> 
> Refer to the README.txt of the microbenchmark for a more detailled
> discussion.

> > pcDesc.hpp
> >
> > I would like to see some documentation of the methods. 
> Done. I didn't take your text, though, because I only noticed it after writing
> my own. Let me know if you are not ok with it.
That's fine. My texts were only proposals, you as author know better
what goes on anyways.

> > scopeDesc.cpp
> >
> >   Besides refactoring copy escape info from pcDesc to scopeDesc
> >   and add accessors. Trivial.
> >
> >   In scopeDesc.hpp you talk about NoEscape and ArgEscape.
> >   This are opto terms, but scopeDesc is a shared datastructure
> >   that does not depend on a specific compiler.
> >   Please explain what is going on without using these terms.
> 
> Actually these are not too opto specific terms. They are used in the paper
> referenced in
> escape.hpp. Also you can easily google them. I'd rather keep the comments
> as they are.
Hmm, I'm not really happy with this, as also the papers
are for the compiler community, and probably not familiar to 
others that work with HotSpot.
But stay with your terms if you think it makes it clearer.
Anyways, with now understanding why you use conservative
Information (see above), the descriptions I had in mind are not precise.

> > callnode.hpp
> >
> > You add functionality to annotate callnodes with escape information
> > This is carried through code generation to final output where it is
> > added to the compiled methods meta information.
> >
> > At Safepoints in general jvmti can access
> >   - Objects that were scalar replaced. They must be reallocated.
> >     (Flag EliminateAllocations)
> >   - Objects that should be locked but are not because they never
> >     escape the thread. They need to be relocked.
> >
> > At calls, Objects where locks have been removed escape to callees.
> > We must persist this information so that if jvmti accesses the
> > object in a callee, we can determine by looking at the caller that
> > it needs to be relocked.
> 
> Note that the ea-optimization must not be at the current location, it can also
> follow when control
> returns to the caller. Lock elimination isn't the only relevant optimization.
Yes, I understood now, see above. Thanks for explaining.
> Accesses to instance
> members or array elements can be optimized as well.
You mean the compiler can/will ignore volatile or memory ordering
requirements for non-escaping objects? Sounds reasonable to do.

> > // Returns true if at least one of the arguments to the call is an oop
> > // that does not escape globally.
> > bool ConnectionGraph::has_arg_escape(CallJavaNode* call) {
> 
> IMHO the method names are descriptive and don't need the comments. But I
> give in :) (only replaced
> "oop" with "object")
Thanks. Yes, object is better than oop.

> You are right, it is not correct how flags are checked. Especially if only
> running with the JVMCI compiler.
>
> I changed Deoptimization::deoptimize_objects_internal() to make
> reallocation and relocking dependent
> on similar checks as in Deoptimization::fetch_unroll_info_helper().
> Furthermore EscapeBarriers are
> conditionally activated depending on the following (see EscapeBarrier ctors):
> 
> JVMCI_ONLY(UseJVMCICompiler) NOT_JVMCI(false)
> COMPILER2_PRESENT(|| DoEscapeAnalysis)
> 
> So the enhancement can be practically completely disabled by disabling
> DoEscapeAnalysis, which is
> what C2 currently does if JVMTI capabilities that allow access to local
> references are taken.
Thanks for fixing. 

> I went for the latter.
> 
> > In fetch_unroll_info_helper, I don't understand why you need
> >  && !EscapeBarrier::objs_are_deoptimized(thread, deoptee.id())) {
> > for eliminated locks, but not for skalar replaced objects?
> 
> In short reallocation is idempotent, relocking is not.
> 
> Without the enhancement Deoptimization::realloc_objects() can already be
> called more than once for a frame:
> 
> First call in materializeVirtualObjects() (also iterateFrames()).
> 
> Second (indirect) call in fetch_unroll_info_helper().
> 
> The objects from the first call are saved as jvmti deferred updates when
> realloc_objects()
> returns. Note that there is no relationship to jvmti. The thing in common is
> that updates cannot be
> directely installed into a compiled frame, it is necessary to deoptimize the
> frame and defer the
> updates until the compiled frame gets replaced. Every time the vframes
> corresponding to the owner
> frame are iterated, they get the deferred updates. So in
> fetch_unroll_info_helper() the
> GrowableArray<compiledVFrame*>* chunk reference them too. All
> references to the objects created by
> the second (indirect) call to realloc_objects() are never used, because
> compiledVFrame accessors to
> locals, expressions, and monitors override them with the deferred updates.
> The objects become
> unreachable and get gc'ed.
OK, so repeatedly computed vFrames always have the first version of 
reallocated objects by construction, so it needs not be handled here.
But also due to construction, objects might be allocated just to be
discarded.

> materializeVirtualObjects() does not bother with relocking.
> deoptimize_objects_internal(), which is
> introduced by the enhancement, does relock objects, after all the lock
> elimination becomes illegal 
> with the change in escape state. Relocking twice does not work, so the
> enhancement avoids it by
> checking EscapeBarrier::objs_are_deoptimized(thread, deoptee.id()).
> 
> Note that materializeVirtualObjects() can be called more than once and will
> always return the very
> same objects, even though it calls realloc_objects() again.
Ok.

> > I would guess it is because the eliminated locks can be applied to
> > argEscape, but scalar replacement only to noescape objects?
> > I.e. it might have been done before?
> >
> > But why isn't this the case for eliminate_allocations?
> > deoptimize_objects_internal does both unconditionally,
> > so both can happen to inner frames, right?
> 
> Sorry, I don't quite understand. Hope the explanation above helps.
Yes.  I was guessing wrong :)

> >   I like if boolean operators are at the beginning of broken lines,
> >   but I think hotspot convention is to have them at the end.
> Ok, fixed.
Thanks.

> 
> > Code will get much more simple if BiasedLocking is removed.
> >
> > EscapeBarrier:: ...
> >
> > (This class maybe would qualify for a file of its own.)
> >
> > deoptimize_objects()
> > I would mention escape analysis only as side remark.  Also, as I understand,
> > there is only one frame at given depth?
> > // Deoptimize frames with optimized objects. This can be omitted locks and
> > // objects not allocated but replaced by scalars. In C2, these optimizations
> > // are based on escape analysis.
> > // Up to depth, deoptimize frames with any optimized objects.
> > // From depth to entry_frame, deoptimize only frames that
> > // pass optimized objects to their callees.
> > (First part similar for the comment above
> EscapeBarrier::deoptimize_objects_internal().)
> 
> I've reworked the comment. Let me know if you still think it needs to be
> improved.
Good now, thanks (maybe break the long line ...)

> > What is the check (cur_depth <= depth) good for? Can you
> > ever walk past entry_frame?
> 
> Yes (assuming you mean the outer while-statement), there are java frames
> beyond the entry frame if a
> native method calls java methods again. So we visit all frames up to the given
> depth and from there
> we continue to the entry frame. It is not necessary to continue beyond that
> entry frame, because
> escape analysis assumes that arguments to native functions escape globally.
> 
> Example: Let the java stack look like this:
> 
> +---------+
> | Frame A |
> +---------+
> | Frame N |
> +---------+
> | Frame B |
> +---------+ <- top of stack
> 
> Where java method A calls native method N and N calls java method B.
> 
> Very simplified the native stack will look like this
> 
> +-------------------------+
> | Frame of JIT Compiled A |
> +-------------------------+
> | Frame N                 |
> +-------------------------+
> | Entry Frame             |
> +-------------------------+
> | Frame B                 |
> +-------------------------+ <- top of stack
> 
> The entry frame is an activation of the call stub, which is a small assembler
> routine that
> translates from the native calling convention to the java calling convention.
> 
> There cannot be any ArgEscape that is passed to B (see above), therefore we
> can stop the stackwalk
> at the entry frame if depth is 1. If depth is 3 we have to continue to Frame A,
> as it is directely
> accessed. 
Ok, thanks, nice explanation!!

> > Isn't vf->is_compiled_frame() prerequisite that "Move to next physical
> frame"
> > is needed? You could move it into the other check.
> > If so, similar for deoptimize_objects_all_threads().
> 
> Only compiledVFrame require moving to the /top/ frame. Fixed.
Thanks, this looks better.

> > Syncronization: looks good. I think others had a look at this before.
> >
> > EscapeBarrier::deoptimize_objects_internal()
> >   The method name is misleading, it is not used by
> >   deoptimize_objects().
> >   Also, method with the same name is in Deopitmization.
> >   Proposal: deoptimize_objects_thread() ?
> 
> Sorry, but I don't see, why it would be misleading.
> What would be the meaning of 'deoptimize_objects_thread'? I don't
> understand that name.
1. I have no idea why it's called "_internal". Because it is private?
   By the name, I would expect that EscapeBarrier::deoptimize_objects()
   calls it for some internal tasks. But it does not.
2. My proposal: deoptimize_objects_all_threads() iterates all threads 
and calls deoptimize_objects(_one)_thread(thread) for each of these.
That's how I would have named it. 
But no bike shedding, if you don't see what I mean it's not obvious.

> > C1 stubs: this really shows you tested all configurations, great!
> >
> >
> > mutexLocker: ok.
> > objectMonitor.cpp: ok
> > stackValue.hpp   Is this missing clearing a bug?
> 
> In short: that change is not needed anymore. I'll remove it again.
Good. Thanks for the details.

> > Renaming deferred_locals to deferred_updates is good, as well as
> > adding a datastructure for it.
> > (Adding this data structure might be a breakout, too.)
> >
> > good.
> >
> > thread.cpp
> >
> > good.
> >
> > vframe.cpp
> >
> > Is this a bug in existing code?
> > Makes sense.
> 
> Depends on your definition of bug. There are no references to
> vframe::is_entry_frame() in the
> existing code. I would think it is a bug.
So it is :)

> 
> >
> > vframe_hp.hpp
> > (What stands _hp for? helper? The file should be named
> compiledVFrame ...)
> >
> > not_global_escape_in_scope() ...
> > Again, you mention escape analysis here. Comments above hold, too.
> 
> I think it is the right name, because it is meaningful and simple.
Ok, accepted ... given my understandings from above.

> 
> > You introduce JvmtiDeferredUpdates. Good.
> >
> > vframe_hp.cpp
> >
> > Changes for JvmtiDeferredUpdates, escape state accessors,
> >
> > line 422:
> > Would an assertion assert(!info->owner_is_scalar_replaced(), ...) hold here?
> >
> >
> > macros.hpp
> >   Good.
> >
> >
> > Test coding
> > ============
> >
> > compileBroker.h|cpp
> >
> > You introduce a third class of threads handled here and
> > add a new flag to distinguish it. Before, the two kinds
> > of threads were distinguished implicitly by passing in
> > a compiler for compiler threads.
> > The new thread kind is only used for testing in debug.
> >
> > make_thread:
> > You could assert (comp != NULL...) to assure previous
> > conditions.
> 
> If replaced the if-statements with a switch-statement, made sure all enum-
> elements are covered, and
> added the assertion you suggested.
> 
> > line 989 indentation broken
> 
> You are referring to this block I assume:
> (from
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.5/src/hots
> pot/share/compiler/compileBroker.cpp.frames.html)
> 
>  976   if (MethodFlushing) {
>  977     // Initialize the sweeper thread
>  978     Handle thread_oop = create_thread_oop("Sweeper thread", CHECK);
>  979     jobject thread_handle = JNIHandles::make_local(THREAD,
> thread_oop());
>  980     make_thread(sweeper_t, thread_handle, NULL, NULL, THREAD);
>  981   }
>  982
>  983 #if defined(ASSERT) && COMPILER2_OR_JVMCI
>  984   if (DeoptimizeObjectsALot == 2) {
>  985     // Initialize and start the object deoptimizer threads
>  986     for (int thread_count = 0; thread_count <
> DeoptimizeObjectsALotThreadCount; thread_count++) {
>  987       Handle thread_oop = create_thread_oop("Deoptimize objects a lot
> thread", CHECK);
>  988       jobject thread_handle = JNIHandles::make_local(THREAD,
> thread_oop());
>  989       make_thread(deoptimizer_t, thread_handle, NULL, NULL, THREAD);
>  990     }
>  991   }
>  992 #endif // defined(ASSERT) && COMPILER2_OR_JVMCI
> 
> I cannot really see broken indentation here. Am I looking at the wrong
> location?
I don't have the source version I reviewed last time any more, so 
I can't check. But maybe an artefact from patching ... if there were
tabs jcheck would have told you, so that's not it. No problem.

Best regards,
  Goetz.