RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Wed Jul 22 20:53:19 UTC 2020

Hi Goetz,

> Thanks for the quick reply.

Yes, this time it didn't take that long...

[... snip ...]

> > > > > I understand you annotate at safepoints where the escape analysis
> > > > > finds out that an object is "better" than global escape.
> > > > > This are the cases where the analysis identifies optimization
> > > > > opportunities. These annotations are then used to deoptimize
> > > > > frames and the objects referenced by them.
> > > > > Doesn't this overestimate the optimized
> > > > > objects?  E.g., eliminate_alloc_node has many cases where it bails
> > > > > out.
> > > >
> > > > Yes, the implementation is conservative, but it is comparatively simple
> > and
> > > > the additional debug
> > > > info is just 2 flags per safepoint.
> > > Thanks. It also helped that you explained to me offline that
> > > there are more optimizations than only lock elimination and scalar
> > > replacement done based on the ea information.
> > > The ea refines the IR graph with allows follow up optimizations
> > > which can not easily be tracked back to the escaping objects or
> > > the call sites where they do not escape.
> > > Thus, if there are non-global escaping objects, you have to
> > > deoptimize the frame.
> > > Did I repeat that correctly?
> > 
> > Mostly, but there are also cases where deoptimization is required if and only
> > if ea-local objects
> > are passed as arguments. This is the case when values are not read directly
> > from a frame, but from a callee frame.
> Hmm, don't get this completely, but ok.

Let C be a callee frame of B which is a callee of A. If you use JVMTI to read an
object reference from a local variable of C then the implementation of
JDK-8227745 deoptimizes A if it passes any ea-local as argument, because the
reference could be ea-local in A and there might be optimizations that are
invalid after the escape state change.

> > > > Accesses to instance
> > > > members or array elements can be optimized as well.
> > > You mean the compiler can/will ignore volatile or memory ordering
> > > requirements for non-escaping objects? Sounds reasonable to do.
> > 
> > Yes, for instance. Also without volatile modifiers it will eliminate accesses.
> > Here is an example:
> > Method A has a NoEscape allocation O that is not scalar replaced. A calls
> > Method B, which is not
> > inlined. When you use your debugger to break in B, then modify a field of O,
> > then this modification
> > would have no effect without deoptimization, because the jit assumes that B
> > cannot modify O without
> > a reference to it.
> Yes, A can keep O in a register, while the JVMTI thread would write to 
> the location in the stack where the local is held (if it was written back).

Not quite. It is the value of the field of O that is in a register not the
reference to O itself. The agent changes the field's value in the /java heap/
(remember: O is _not_ scalar replaced), but the fields value is not reloaded
after return from B.

> > > > > Syncronization: looks good. I think others had a look at this before.
> > > > >
> > > > > EscapeBarrier::deoptimize_objects_internal()
> > > > >   The method name is misleading, it is not used by
> > > > >   deoptimize_objects().
> > > > >   Also, method with the same name is in Deopitmization.
> > > > >   Proposal: deoptimize_objects_thread() ?
> > > >
> > > > Sorry, but I don't see, why it would be misleading.
> > > > What would be the meaning of 'deoptimize_objects_thread'? I don't
> > > > understand that name.
> > > 1. I have no idea why it's called "_internal". Because it is private?
> > >    By the name, I would expect that EscapeBarrier::deoptimize_objects()
> > >    calls it for some internal tasks. But it does not.
> > 
> > Well, I'd say it is pretty internal, what's happening in that method. So IMHO
> > the suffix _internal
> > is a match.
> > 
> > > 2. My proposal: deoptimize_objects_all_threads() iterates all threads
> > > and calls deoptimize_objects(_one)_thread(thread) for each of these.
> > > That's how I would have named it.
> > > But no bike shedding, if you don't see what I mean it's not obvious.
> > Ok. We could have a quick call, too, if you like.

> Ok, I think I have understood the remaining points.  I'm fine with this 
> so far.

Thanks again and best regards,
Richard.

-----Original Message-----
From: Lindenmaier, Goetz <goetz.lindenmaier at sap.com> 
Sent: Mittwoch, 22. Juli 2020 18:22
To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net
Subject: RE: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Hi Richard,

Thanks for the quick reply.

> > >   With DeoptimizeObjectsALot enabled internal threads are started that
> > > deoptimize frames and
> > >   objects. The number of threads started are given with
> > > DeoptimizeObjectsALotThreadCountAll and
> > >   DeoptimizeObjectsALotThreadCountSingle. The former targets all
> existing
> > > threads whereas the
> > >   latter operates on a single thread selected round robin.
> > >
> > >   I removed the mode where deoptimizations were performed at every nth
> > > exit from the runtime. I never used it.
> 
> > Do I get it right? You have a n:1 and a n:all test scenario.
> >  n:1: n threads deoptimize 1 Jana thread    where n => DOALThreadCountSingle
> >  n:m: n threads deoptimize all Java threads where n = DOALThreadCountAll?
> 
> Not quite.
> 
> -XX:+DeoptimizeObjectsALot // required
> -XX:DeoptimizeObjectsALotThreadCountAll=m
> -XX:DeoptimizeObjectsALotThreadCountSingle=n
> 
> Will start m+n threads. Each operating on all existing JavaThreads using
> EscapeBarriers. The
> difference between the 2 thread types is that one distinct EscapeBarrier
> targets either just a
> single thread or all exisitng threads at onece. If just one single thread is
> targeted per
> EscapeBarrier, then it is not always the same thread, but threads are selected
> round robin. So there
> will be n threads selecting independently single threads round robin per
> EscapeBarrier and m threads
> that target all threads in every EscapeBarrier.
Ok, yes, that is how I understood it. 

> > > * EscapeBarrier::sync_and_suspend_one(): use a direct handshake and
> > > execute it always independently
> > >   of is_thread_fully_suspended().
> > Is this also a performance optimization?
> 
> Maybe a minor one.
OK

> > > * JavaThread::wait_for_object_deoptimization():
> > >   - Bugfix: the last check of is_obj_deopt_suspend() must be /after/ the
> > > safepoint check! This
> > >     caused issues with not walkable stacks with DeoptimizeObjectsALot.
> > OK. As I understand, there was one safepoint check in the old version,
> > now there is one in each iteration.  I assume this is intended, right?
> 
> Yes it is. The important thing here is (A) a safepoint check is needed /after/
> leaving a safe state
> (_thread_in_native, _thread_blocked). (B) Shared variables that are modified
> at safepoints or with handshakes need to be reread /after/ the safepoint check.
> 
> BTW: I only noticed now that since JDK-8240918 JavaThreads themselves
> must disarm their polling
> page. Originally (before handshakes) this was done by the VM thread. With
> handshakes it was done by
> the thread executing the handshake op. This was changed for
> OrderAccess::cross_modify_fence() where
> the poll is left armed if the thread is in native and sice JDK-8240918 it is
> always left armed. So
> when a thread leaves a safe state (native, blocked) and there was a
> handshake/vm op, it will always
> call SafepointMechanism::block_if_requested_slow(), even if the
> handshake/vm operation have been
> processed already and everybody else is happyly executing bytecodes :)
Ok.

> Still (A) and (B) hold.

> > >   - Added limited spinning inspired by HandshakeSpinYield to fix regression in
> > > microbenchmark [1]
> > Ok.  Nice improvement, nice catch!
> 
> Yes. It certainly took some time to find out.
> 
> > >
> > > I refer to some more changes answering your questions and comments
> inline
> > > below.
> > >
> > > Thanks,
> > > Richard.
> > >
> > > [1] Microbenchmark:
> > >
> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.6.microbe
> nchmark/
> > >
> 
> 
> > > > I understand you annotate at safepoints where the escape analysis
> > > > finds out that an object is "better" than global escape.
> > > > This are the cases where the analysis identifies optimization
> > > > opportunities. These annotations are then used to deoptimize
> > > > frames and the objects referenced by them.
> > > > Doesn't this overestimate the optimized
> > > > objects?  E.g., eliminate_alloc_node has many cases where it bails
> > > > out.
> > >
> > > Yes, the implementation is conservative, but it is comparatively simple
> and
> > > the additional debug
> > > info is just 2 flags per safepoint.
> > Thanks. It also helped that you explained to me offline that
> > there are more optimizations than only lock elimination and scalar
> > replacement done based on the ea information.
> > The ea refines the IR graph with allows follow up optimizations
> > which can not easily be tracked back to the escaping objects or
> > the call sites where they do not escape.
> > Thus, if there are non-global escaping objects, you have to
> > deoptimize the frame.
> > Did I repeat that correctly?
> 
> Mostly, but there are also cases where deoptimization is required if and only
> if ea-local objects
> are passed as arguments. This is the case when values are not read directly
> from a frame, but from a callee frame.
Hmm, don't get this completely, but ok.

> > > Accesses to instance
> > > members or array elements can be optimized as well.
> > You mean the compiler can/will ignore volatile or memory ordering
> > requirements for non-escaping objects? Sounds reasonable to do.
> 
> Yes, for instance. Also without volatile modifiers it will eliminate accesses.
> Here is an example:
> Method A has a NoEscape allocation O that is not scalar replaced. A calls
> Method B, which is not
> inlined. When you use your debugger to break in B, then modify a field of O,
> then this modification
> would have no effect without deoptimization, because the jit assumes that B
> cannot modify O without
> a reference to it.
Yes, A can keep O in a register, while the JVMTI thread would write to 
the location in the stack where the local is held (if it was written back).

> > > > Syncronization: looks good. I think others had a look at this before.
> > > >
> > > > EscapeBarrier::deoptimize_objects_internal()
> > > >   The method name is misleading, it is not used by
> > > >   deoptimize_objects().
> > > >   Also, method with the same name is in Deopitmization.
> > > >   Proposal: deoptimize_objects_thread() ?
> > >
> > > Sorry, but I don't see, why it would be misleading.
> > > What would be the meaning of 'deoptimize_objects_thread'? I don't
> > > understand that name.
> > 1. I have no idea why it's called "_internal". Because it is private?
> >    By the name, I would expect that EscapeBarrier::deoptimize_objects()
> >    calls it for some internal tasks. But it does not.
> 
> Well, I'd say it is pretty internal, what's happening in that method. So IMHO
> the suffix _internal
> is a match.
> 
> > 2. My proposal: deoptimize_objects_all_threads() iterates all threads
> > and calls deoptimize_objects(_one)_thread(thread) for each of these.
> > That's how I would have named it.
> > But no bike shedding, if you don't see what I mean it's not obvious.
> Ok. We could have a quick call, too, if you like.

Ok, I think I have understood the remaining points.  I'm fine with this 
so far.

Thanks,
  Goetz.