RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Thu Dec 19 02:11:59 UTC 2019

Hi Richard,

I think my issue is with the way EliminateNestedLocks works so I'm going 
to look into that more deeply.

Thanks for the explanations.

David

On 18/12/2019 12:47 am, Reingruber, Richard wrote:
> Hi David,
> 
>    > >    > Some further queries/concerns:
>    > >    >
>    > >    > src/hotspot/share/runtime/objectMonitor.cpp
>    > >    >
>    > >    > Can you please explain the changes to ObjectMonitor::wait:
>    > >    >
>    > >    > !   _recursions = save      // restore the old recursion count
>    > >    > !                 + jt->get_and_reset_relock_count_after_wait(); //
>    > >    > increased by the deferred relock count
>    > >    >
>    > >    > what is the "deferred relock count"? I gather it relates to
>    > >    >
>    > >    > "The code was extended to be able to deoptimize objects of a
>    > > frame that
>    > >    > is not the top frame and to let another thread than the owning
>    > > thread do
>    > >    > it."
>    > >
>    > > Yes, these relate. Currently EA based optimizations are reverted, when a compiled frame is
>    > > replaced with corresponding interpreter frames. Part of this is relocking objects with eliminated
>    > > locking. New with the enhancement is that we do this also just before object references are
>    > > acquired through JVMTI. In this case we deoptimize also the owning compiled frame C and we
>    > > register deoptimized objects as deferred updates. When control returns to C it gets deoptimized,
>    > > we notice that objects are already deoptimized (reallocated and relocked), so we don't do it again
>    > > (relocking twice would be incorrect of course). Deferred updates are copied into the new
>    > > interpreter frames.
>    > >
>    > > Problem: relocking is not possible if the target thread T is waiting on the monitor that needs to
>    > > be relocked. This happens only with non-local objects with EliminateNestedLocks. Instead relocking
>    > > is deferred until T owns the monitor again. This is what the piece of code above does.
>    >
>    >  Sorry I need some more detail here. How can you wait() on an object
>    >  monitor if the object allocation and/or locking was optimised away? And
>    >  what is a "non-local object" in this context? Isn't EA restricted to
>    >  thread-confined objects?
> 
> "Non-local object" is an object that escapes its thread. The issue I'm addressing with the changes
> in ObjectMonitor::wait are almost unrelated to EA. They are caused by EliminateNestedLocks, where C2
> eliminates recursive locking of an already owned lock. The lock owning object exists on the heap, it
> is locked and you can call wait() on it.
> 
> EliminateLocks is the C2 option that controls lock elimination based on EA.  Both optimizations have
> in common that objects with eliminated locking need to be relocked when deoptimizing a frame,
> i.e. when replacing a compiled frame with equivalent interpreter
> frames. Deoptimization::relock_objects does that job for /all/ eliminated locks in scope. /All/ can
> be a mix of eliminated nested locks and locks of not-escaping objects.
> 
> New with the enhancement: I call relock_objects earlier, just before objects pontentially
> escape. But then later when the owning compiled frame gets deoptimized, I must not do it again:
> 
> See call to EscapeBarrier::objs_are_deoptimized in deoptimization.cpp:
> 
>   373   if ((jvmci_enabled || ((DoEscapeAnalysis || EliminateNestedLocks) && EliminateLocks))
>   374       && !EscapeBarrier::objs_are_deoptimized(thread, deoptee.id())) {
>   375     bool unused;
>   376     eliminate_locks(thread, chunk, realloc_failures, deoptee, exec_mode, unused);
>   377   }
> 
> Now when calling relock_objects early it is quiet possible that I have to relock an object the
> target thread currently waits for. Obviously I cannot relock in this case, instead I chose to
> introduce relock_count_after_wait to JavaThread.
> 
>    >  Is it just that some of the locking gets optimized away e.g.
>    >
>    >  synchronised(obj) {
>    >     synchronised(obj) {
>    >       synchronised(obj) {
>    >         obj.wait();
>    >       }
>    >     }
>    >  }
>    >
>    >  If this is reduced to a form as-if it were a single lock of the monitor
>    >  (due to EA) and the wait() triggers a JVM TI event which leads to the
>    >  escape of "obj" then we need to reconstruct the true lock state, and so
>    >  when the wait() internally unblocks and reacquires the monitor it has to
>    >  set the true recursion count to 3, not the 1 that it appeared to be when
>    >  wait() was initially called. Is that the scenario?
> 
> Kind of... except that the locking is not eliminated due to EA and there is no JVM TI event
> triggered by wait.
> 
> Add
> 
> LocalObject l1 = new LocalObject();
> 
> in front of the synchrnized blocks and assume a JVM TI agent acquires l1. This triggers the code in
> question.
> 
> See that relocking/reallocating is transactional. If it is done then for /all/ objects in scope and it is
> done at most once. It wouldn't be quite so easy to split this in relocking of nested/EA-based
> eliminated locks.
> 
>    >  If so I find this truly awful. Anyone using wait() in a realistic form
>    >  requires a notification and so the object cannot be thread confined. In
> 
> It is not thread confined.
> 
>    >  which case I would strongly argue that upon hitting the wait() the deopt
>    >  should occur unconditionally and so the lock state is correct before we
>    >  wait and so we don't need to mess with the recursion count internally
>    >  when we reacquire the monitor.
>    >
>    > >
>    > >    > which I don't like the sound of at all when it comes to ObjectMonitor
>    > >    > state. So I'd like to understand in detail exactly what is going on here
>    > >    > and why.  This is a very intrusive change that seems to badly break
>    > >    > encapsulation and impacts future changes to ObjectMonitor that are under
>    > >    > investigation.
>    > >
>    > > I would not regard this as breaking encapsulation. Certainly not badly.
>    > >
>    > > I've added a property relock_count_after_wait to JavaThread. The property is well
>    > > encapsulated. Future ObjectMonitor implementations have to deal with recursion too. They are free
>    > > in choosing a way to do that as long as that property is taken into account. This is hardly a
>    > > limitation.
>    >
>    >  I do think this badly breaks encapsulation as you have to add a callout
>    >  from the guts of the ObjectMonitor code to reach into the thread to get
>    >  this lock count adjustment. I understand why you have had to do this but
>    >  I would much rather see a change to the EA optimisation strategy so that
>    >  this is not needed.
>    >
>    > > Note also that the property is a straight forward extension of the existing concept of deferred
>    > > local updates. It is embedded into the structure holding them. So not even the footprint of a
>    > > JavaThread is enlarged if no deferred updates are generated.
>    >
>    > [...]
>    >
>    > >
>    > > I'm actually duplicating the existing external suspend mechanism, because a thread can be
>    > > suspended at most once. And hey, and don't like that either! But it seems not unlikely that the
>    > > duplicate can be removed together with the original and the new type of handshakes that will be
>    > > used for thread suspend can be used for object deoptimization too. See today's discussion in
>    > > JDK-8227745 [2].
>    >
>    >  I hope that discussion bears some fruit, at the moment it seems not to
>    >  be possible to use handshakes here. :(
>    >
>    >  The external suspend mechanism is a royal pain in the proverbial that we
>    >  have to carefully live with. The idea that we're duplicating that for
>    >  use in another fringe area of functionality does not thrill me at all.
>    >
>    >  To be clear, I understand the problem that exists and that you wish to
>    >  solve, but for the runtime parts I balk at the complexity cost of
>    >  solving it.
> 
> I know it's complex, but by far no rocket science.
> 
> Also I find it hard to imagine another fix for JDK-8233915 besides changing the JVM TI specification.
>   
> Thanks, Richard.
> 
> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Dienstag, 17. Dezember 2019 08:03
> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; Vladimir Kozlov (vladimir.kozlov at oracle.com) <vladimir.kozlov at oracle.com>
> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents
> 
> <resend as my mailer crashed during last send>
> 
> David
> 
> On 17/12/2019 4:57 pm, David Holmes wrote:
>> Hi Richard,
>>
>> On 14/12/2019 5:01 am, Reingruber, Richard wrote:
>>> Hi David,
>>>
>>>     > Some further queries/concerns:
>>>     >
>>>     > src/hotspot/share/runtime/objectMonitor.cpp
>>>     >
>>>     > Can you please explain the changes to ObjectMonitor::wait:
>>>     >
>>>     > !   _recursions = save      // restore the old recursion count
>>>     > !                 + jt->get_and_reset_relock_count_after_wait(); //
>>>     > increased by the deferred relock count
>>>     >
>>>     > what is the "deferred relock count"? I gather it relates to
>>>     >
>>>     > "The code was extended to be able to deoptimize objects of a
>>> frame that
>>>     > is not the top frame and to let another thread than the owning
>>> thread do
>>>     > it."
>>>
>>> Yes, these relate. Currently EA based optimizations are reverted, when
>>> a compiled frame is replaced
>>> with corresponding interpreter frames. Part of this is relocking
>>> objects with eliminated
>>> locking. New with the enhancement is that we do this also just before
>>> object references are acquired
>>> through JVMTI. In this case we deoptimize also the owning compiled
>>> frame C and we register
>>> deoptimized objects as deferred updates. When control returns to C it
>>> gets deoptimized, we notice
>>> that objects are already deoptimized (reallocated and relocked), so we
>>> don't do it again (relocking
>>> twice would be incorrect of course). Deferred updates are copied into
>>> the new interpreter frames.
>>>
>>> Problem: relocking is not possible if the target thread T is waiting
>>> on the monitor that needs to be
>>> relocked. This happens only with non-local objects with
>>> EliminateNestedLocks. Instead relocking is
>>> deferred until T owns the monitor again. This is what the piece of
>>> code above does.
>>
>> Sorry I need some more detail here. How can you wait() on an object
>> monitor if the object allocation and/or locking was optimised away? And
>> what is a "non-local object" in this context? Isn't EA restricted to
>> thread-confined objects?
>>
>> Is it just that some of the locking gets optimized away e.g.
>>
>> synchronised(obj) {
>>     synchronised(obj) {
>>       synchronised(obj) {
>>         obj.wait();
>>       }
>>     }
>> }
>>
>> If this is reduced to a form as-if it were a single lock of the monitor
>> (due to EA) and the wait() triggers a JVM TI event which leads to the
>> escape of "obj" then we need to reconstruct the true lock state, and so
>> when the wait() internally unblocks and reacquires the monitor it has to
>> set the true recursion count to 3, not the 1 that it appeared to be when
>> wait() was initially called. Is that the scenario?
>>
>> If so I find this truly awful. Anyone using wait() in a realistic form
>> requires a notification and so the object cannot be thread confined. In
>> which case I would strongly argue that upon hitting the wait() the deopt
>> should occur unconditionally and so the lock state is correct before we
>> wait and so we don't need to mess with the recursion count internally
>> when we reacquire the monitor.
>>
>>>
>>>     > which I don't like the sound of at all when it comes to
>>> ObjectMonitor
>>>     > state. So I'd like to understand in detail exactly what is going
>>> on here
>>>     > and why.  This is a very intrusive change that seems to badly break
>>>     > encapsulation and impacts future changes to ObjectMonitor that
>>> are under
>>>     > investigation.
>>>
>>> I would not regard this as breaking encapsulation. Certainly not badly.
>>>
>>> I've added a property relock_count_after_wait to JavaThread. The
>>> property is well
>>> encapsulated. Future ObjectMonitor implementations have to deal with
>>> recursion too. They are free in
>>> choosing a way to do that as long as that property is taken into
>>> account. This is hardly a
>>> limitation.
>>
>> I do think this badly breaks encapsulation as you have to add a callout
>> from the guts of the ObjectMonitor code to reach into the thread to get
>> this lock count adjustment. I understand why you have had to do this but
>> I would much rather see a change to the EA optimisation strategy so that
>> this is not needed.
>>
>>> Note also that the property is a straight forward extension of the
>>> existing concept of deferred
>>> local updates. It is embedded into the structure holding them. So not
>>> even the footprint of a
>>> JavaThread is enlarged if no deferred updates are generated.
>>>
>>>     > ---
>>>     >
>>>     > src/hotspot/share/runtime/thread.cpp
>>>     >
>>>     > Can you please explain why
>>> JavaThread::wait_for_object_deoptimization
>>>     > has to be handcrafted in this way rather than using proper
>>> transitions.
>>>     >
>>>
>>> I wrote wait_for_object_deoptimization taking
>>> JavaThread::java_suspend_self_with_safepoint_check
>>> as template. So in short: for the same reasons :)
>>>
>>> Threads reach both methods as part of thread state transitions,
>>> therefore special handling is
>>> required to change thread state on top of ongoing transitions.
>>>
>>>     > We got rid of "deopt suspend" some time ago and it is disturbing
>>> to see
>>>     > it being added back (effectively). This seems like it may be
>>> something
>>>     > that handshakes could be used for.
>>>
>>> Deopt suspend used to be something rather different with a similar
>>> name[1]. It is not being added back.
>>
>> I stand corrected. Despite comments in the code to the contrary
>> deopt_suspend didn't actually cause a self-suspend. I was doing a lot of
>> cleanup in this area 13 years ago :)
>>
>>>
>>> I'm actually duplicating the existing external suspend mechanism,
>>> because a thread can be suspended
>>> at most once. And hey, and don't like that either! But it seems not
>>> unlikely that the duplicate can
>>> be removed together with the original and the new type of handshakes
>>> that will be used for
>>> thread suspend can be used for object deoptimization too. See today's
>>> discussion in JDK-8227745 [2].
>>
>> I hope that discussion bears some fruit, at the moment it seems not to
>> be possible to use handshakes here. :(
>>
>> The external suspend mechanism is a royal pain in the proverbial that we
>> have to carefully live with. The idea that we're duplicating that for
>> use in another fringe area of functionality does not thrill me at all.
>>
>> To be clear, I understand the problem that exists and that you wish to
>> solve, but for the runtime parts I balk at the complexity cost of
>> solving it.
>>
>> Thanks,
>> David
>> -----
>>
>>> Thanks, Richard.
>>>
>>> [1] Deopt suspend was something like an async. handshake for
>>> architectures with register windows,
>>>       where patching the return pc for deoptimization of a compiled
>>> frame was racy if the owner thread
>>>       was in native code. Instead a "deopt" suspend flag was set on
>>> which the thread patched its own
>>>       frame upon return from native. So no thread was suspended. It got
>>> its name only from the name of
>>>       the flags.
>>>
>>> [2] Discussion about using handshakes to sync. with the target thread:
>>>       
>>> https://bugs.openjdk.java.net/browse/JDK-8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14306727
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes <david.holmes at oracle.com>
>>> Sent: Freitag, 13. Dezember 2019 00:56
>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
>>> serviceability-dev at openjdk.java.net;
>>> hotspot-compiler-dev at openjdk.java.net;
>>> hotspot-runtime-dev at openjdk.java.net
>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>>> Performance in the Presence of JVMTI Agents
>>>
>>> Hi Richard,
>>>
>>> Some further queries/concerns:
>>>
>>> src/hotspot/share/runtime/objectMonitor.cpp
>>>
>>> Can you please explain the changes to ObjectMonitor::wait:
>>>
>>> !   _recursions = save      // restore the old recursion count
>>> !                 + jt->get_and_reset_relock_count_after_wait(); //
>>> increased by the deferred relock count
>>>
>>> what is the "deferred relock count"? I gather it relates to
>>>
>>> "The code was extended to be able to deoptimize objects of a frame that
>>> is not the top frame and to let another thread than the owning thread do
>>> it."
>>>
>>> which I don't like the sound of at all when it comes to ObjectMonitor
>>> state. So I'd like to understand in detail exactly what is going on here
>>> and why.  This is a very intrusive change that seems to badly break
>>> encapsulation and impacts future changes to ObjectMonitor that are under
>>> investigation.
>>>
>>> ---
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>>
>>> Can you please explain why JavaThread::wait_for_object_deoptimization
>>> has to be handcrafted in this way rather than using proper transitions.
>>>
>>> We got rid of "deopt suspend" some time ago and it is disturbing to see
>>> it being added back (effectively). This seems like it may be something
>>> that handshakes could be used for.
>>>
>>> Thanks,
>>> David
>>> -----
>>>
>>> On 12/12/2019 7:02 am, David Holmes wrote:
>>>> On 12/12/2019 1:07 am, Reingruber, Richard wrote:
>>>>> Hi David,
>>>>>
>>>>>      > Most of the details here are in areas I can comment on in detail,
>>>>> but I
>>>>>      > did take an initial general look at things.
>>>>>
>>>>> Thanks for taking the time!
>>>>
>>>> Apologies the above should read:
>>>>
>>>> "Most of the details here are in areas I *can't* comment on in detail
>>>> ..."
>>>>
>>>> David
>>>>
>>>>>      > The only thing that jumped out at me is that I think the
>>>>>      > DeoptimizeObjectsALotThread should be a hidden thread.
>>>>>      >
>>>>>      > +  bool is_hidden_from_external_view() const { return true; }
>>>>>
>>>>> Yes, it should. Will add the method like above.
>>>>>
>>>>>      > Also I don't see any testing of the DeoptimizeObjectsALotThread.
>>>>> Without
>>>>>      > active testing this will just bit-rot.
>>>>>
>>>>> DeoptimizeObjectsALot is meant for stress testing with a larger
>>>>> workload. I will add a minimal test
>>>>> to keep it fresh.
>>>>>
>>>>>      > Also on the tests I don't understand your @requires clause:
>>>>>      >
>>>>>      >   @requires ((vm.compMode != "Xcomp") & vm.compiler2.enabled &
>>>>>      > (vm.opt.TieredCompilation != true))
>>>>>      >
>>>>>      > This seems to require that TieredCompilation is disabled, but
>>>>> tiered is
>>>>>      > our normal mode of operation. ??
>>>>>      >
>>>>>
>>>>> I removed the clause. I guess I wanted to target the tests towards the
>>>>> code they are supposed to
>>>>> test, and it's easier to analyze failures w/o tiered compilation and
>>>>> with just one compiler thread.
>>>>>
>>>>> Additionally I will make use of
>>>>> compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the tests.
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>> -----Original Message-----
>>>>> From: David Holmes <david.holmes at oracle.com>
>>>>> Sent: Mittwoch, 11. Dezember 2019 08:03
>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
>>>>> serviceability-dev at openjdk.java.net;
>>>>> hotspot-compiler-dev at openjdk.java.net;
>>>>> hotspot-runtime-dev at openjdk.java.net
>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>>>>> Performance in the Presence of JVMTI Agents
>>>>>
>>>>> Hi Richard,
>>>>>
>>>>> On 11/12/2019 7:45 am, Reingruber, Richard wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I would like to get reviews please for
>>>>>>
>>>>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/
>>>>>>
>>>>>> Corresponding RFE:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8227745
>>>>>>
>>>>>> Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915
>>>>>> And potentially https://bugs.openjdk.java.net/browse/JDK-8214584 [1]
>>>>>>
>>>>>> Vladimir Kozlov kindly put webrev.3 through tier1-8 testing without
>>>>>> issues (thanks!). In addition the
>>>>>> change is being tested at SAP since I posted the first RFR some
>>>>>> months ago.
>>>>>>
>>>>>> The intention of this enhancement is to benefit performance wise from
>>>>>> escape analysis even if JVMTI
>>>>>> agents request capabilities that allow them to access local variable
>>>>>> values. E.g. if you start-up
>>>>>> with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n, then
>>>>>> escape analysis is disabled right
>>>>>> from the beginning, well before a debugger attaches -- if ever one
>>>>>> should do so. With the
>>>>>> enhancement, escape analysis will remain enabled until and after a
>>>>>> debugger attaches. EA based
>>>>>> optimizations are reverted just before an agent acquires the
>>>>>> reference to an object. In the JBS item
>>>>>> you'll find more details.
>>>>>
>>>>> Most of the details here are in areas I can comment on in detail, but I
>>>>> did take an initial general look at things.
>>>>>
>>>>> The only thing that jumped out at me is that I think the
>>>>> DeoptimizeObjectsALotThread should be a hidden thread.
>>>>>
>>>>> +  bool is_hidden_from_external_view() const { return true; }
>>>>>
>>>>> Also I don't see any testing of the DeoptimizeObjectsALotThread.
>>>>> Without
>>>>> active testing this will just bit-rot.
>>>>>
>>>>> Also on the tests I don't understand your @requires clause:
>>>>>
>>>>>      @requires ((vm.compMode != "Xcomp") & vm.compiler2.enabled &
>>>>> (vm.opt.TieredCompilation != true))
>>>>>
>>>>> This seems to require that TieredCompilation is disabled, but tiered is
>>>>> our normal mode of operation. ??
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>> Thanks,
>>>>>> Richard.
>>>>>>
>>>>>> [1] Experimental fix for JDK-8214584 based on JDK-8227745
>>>>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.patch
>>>>>>
>>>>>>
>>>>>>