RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents
David Holmes
david.holmes at oracle.com
Thu Dec 19 02:11:59 UTC 2019
Hi Richard,
I think my issue is with the way EliminateNestedLocks works so I'm going
to look into that more deeply.
Thanks for the explanations.
David
On 18/12/2019 12:47 am, Reingruber, Richard wrote:
> Hi David,
>
> > > > Some further queries/concerns:
> > > >
> > > > src/hotspot/share/runtime/objectMonitor.cpp
> > > >
> > > > Can you please explain the changes to ObjectMonitor::wait:
> > > >
> > > > ! _recursions = save // restore the old recursion count
> > > > ! + jt->get_and_reset_relock_count_after_wait(); //
> > > > increased by the deferred relock count
> > > >
> > > > what is the "deferred relock count"? I gather it relates to
> > > >
> > > > "The code was extended to be able to deoptimize objects of a
> > > frame that
> > > > is not the top frame and to let another thread than the owning
> > > thread do
> > > > it."
> > >
> > > Yes, these relate. Currently EA based optimizations are reverted, when a compiled frame is
> > > replaced with corresponding interpreter frames. Part of this is relocking objects with eliminated
> > > locking. New with the enhancement is that we do this also just before object references are
> > > acquired through JVMTI. In this case we deoptimize also the owning compiled frame C and we
> > > register deoptimized objects as deferred updates. When control returns to C it gets deoptimized,
> > > we notice that objects are already deoptimized (reallocated and relocked), so we don't do it again
> > > (relocking twice would be incorrect of course). Deferred updates are copied into the new
> > > interpreter frames.
> > >
> > > Problem: relocking is not possible if the target thread T is waiting on the monitor that needs to
> > > be relocked. This happens only with non-local objects with EliminateNestedLocks. Instead relocking
> > > is deferred until T owns the monitor again. This is what the piece of code above does.
> >
> > Sorry I need some more detail here. How can you wait() on an object
> > monitor if the object allocation and/or locking was optimised away? And
> > what is a "non-local object" in this context? Isn't EA restricted to
> > thread-confined objects?
>
> "Non-local object" is an object that escapes its thread. The issue I'm addressing with the changes
> in ObjectMonitor::wait are almost unrelated to EA. They are caused by EliminateNestedLocks, where C2
> eliminates recursive locking of an already owned lock. The lock owning object exists on the heap, it
> is locked and you can call wait() on it.
>
> EliminateLocks is the C2 option that controls lock elimination based on EA. Both optimizations have
> in common that objects with eliminated locking need to be relocked when deoptimizing a frame,
> i.e. when replacing a compiled frame with equivalent interpreter
> frames. Deoptimization::relock_objects does that job for /all/ eliminated locks in scope. /All/ can
> be a mix of eliminated nested locks and locks of not-escaping objects.
>
> New with the enhancement: I call relock_objects earlier, just before objects pontentially
> escape. But then later when the owning compiled frame gets deoptimized, I must not do it again:
>
> See call to EscapeBarrier::objs_are_deoptimized in deoptimization.cpp:
>
> 373 if ((jvmci_enabled || ((DoEscapeAnalysis || EliminateNestedLocks) && EliminateLocks))
> 374 && !EscapeBarrier::objs_are_deoptimized(thread, deoptee.id())) {
> 375 bool unused;
> 376 eliminate_locks(thread, chunk, realloc_failures, deoptee, exec_mode, unused);
> 377 }
>
> Now when calling relock_objects early it is quiet possible that I have to relock an object the
> target thread currently waits for. Obviously I cannot relock in this case, instead I chose to
> introduce relock_count_after_wait to JavaThread.
>
> > Is it just that some of the locking gets optimized away e.g.
> >
> > synchronised(obj) {
> > synchronised(obj) {
> > synchronised(obj) {
> > obj.wait();
> > }
> > }
> > }
> >
> > If this is reduced to a form as-if it were a single lock of the monitor
> > (due to EA) and the wait() triggers a JVM TI event which leads to the
> > escape of "obj" then we need to reconstruct the true lock state, and so
> > when the wait() internally unblocks and reacquires the monitor it has to
> > set the true recursion count to 3, not the 1 that it appeared to be when
> > wait() was initially called. Is that the scenario?
>
> Kind of... except that the locking is not eliminated due to EA and there is no JVM TI event
> triggered by wait.
>
> Add
>
> LocalObject l1 = new LocalObject();
>
> in front of the synchrnized blocks and assume a JVM TI agent acquires l1. This triggers the code in
> question.
>
> See that relocking/reallocating is transactional. If it is done then for /all/ objects in scope and it is
> done at most once. It wouldn't be quite so easy to split this in relocking of nested/EA-based
> eliminated locks.
>
> > If so I find this truly awful. Anyone using wait() in a realistic form
> > requires a notification and so the object cannot be thread confined. In
>
> It is not thread confined.
>
> > which case I would strongly argue that upon hitting the wait() the deopt
> > should occur unconditionally and so the lock state is correct before we
> > wait and so we don't need to mess with the recursion count internally
> > when we reacquire the monitor.
> >
> > >
> > > > which I don't like the sound of at all when it comes to ObjectMonitor
> > > > state. So I'd like to understand in detail exactly what is going on here
> > > > and why. This is a very intrusive change that seems to badly break
> > > > encapsulation and impacts future changes to ObjectMonitor that are under
> > > > investigation.
> > >
> > > I would not regard this as breaking encapsulation. Certainly not badly.
> > >
> > > I've added a property relock_count_after_wait to JavaThread. The property is well
> > > encapsulated. Future ObjectMonitor implementations have to deal with recursion too. They are free
> > > in choosing a way to do that as long as that property is taken into account. This is hardly a
> > > limitation.
> >
> > I do think this badly breaks encapsulation as you have to add a callout
> > from the guts of the ObjectMonitor code to reach into the thread to get
> > this lock count adjustment. I understand why you have had to do this but
> > I would much rather see a change to the EA optimisation strategy so that
> > this is not needed.
> >
> > > Note also that the property is a straight forward extension of the existing concept of deferred
> > > local updates. It is embedded into the structure holding them. So not even the footprint of a
> > > JavaThread is enlarged if no deferred updates are generated.
> >
> > [...]
> >
> > >
> > > I'm actually duplicating the existing external suspend mechanism, because a thread can be
> > > suspended at most once. And hey, and don't like that either! But it seems not unlikely that the
> > > duplicate can be removed together with the original and the new type of handshakes that will be
> > > used for thread suspend can be used for object deoptimization too. See today's discussion in
> > > JDK-8227745 [2].
> >
> > I hope that discussion bears some fruit, at the moment it seems not to
> > be possible to use handshakes here. :(
> >
> > The external suspend mechanism is a royal pain in the proverbial that we
> > have to carefully live with. The idea that we're duplicating that for
> > use in another fringe area of functionality does not thrill me at all.
> >
> > To be clear, I understand the problem that exists and that you wish to
> > solve, but for the runtime parts I balk at the complexity cost of
> > solving it.
>
> I know it's complex, but by far no rocket science.
>
> Also I find it hard to imagine another fix for JDK-8233915 besides changing the JVM TI specification.
>
> Thanks, Richard.
>
> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Dienstag, 17. Dezember 2019 08:03
> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net; hotspot-compiler-dev at openjdk.java.net; hotspot-runtime-dev at openjdk.java.net; Vladimir Kozlov (vladimir.kozlov at oracle.com) <vladimir.kozlov at oracle.com>
> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents
>
> <resend as my mailer crashed during last send>
>
> David
>
> On 17/12/2019 4:57 pm, David Holmes wrote:
>> Hi Richard,
>>
>> On 14/12/2019 5:01 am, Reingruber, Richard wrote:
>>> Hi David,
>>>
>>> > Some further queries/concerns:
>>> >
>>> > src/hotspot/share/runtime/objectMonitor.cpp
>>> >
>>> > Can you please explain the changes to ObjectMonitor::wait:
>>> >
>>> > ! _recursions = save // restore the old recursion count
>>> > ! + jt->get_and_reset_relock_count_after_wait(); //
>>> > increased by the deferred relock count
>>> >
>>> > what is the "deferred relock count"? I gather it relates to
>>> >
>>> > "The code was extended to be able to deoptimize objects of a
>>> frame that
>>> > is not the top frame and to let another thread than the owning
>>> thread do
>>> > it."
>>>
>>> Yes, these relate. Currently EA based optimizations are reverted, when
>>> a compiled frame is replaced
>>> with corresponding interpreter frames. Part of this is relocking
>>> objects with eliminated
>>> locking. New with the enhancement is that we do this also just before
>>> object references are acquired
>>> through JVMTI. In this case we deoptimize also the owning compiled
>>> frame C and we register
>>> deoptimized objects as deferred updates. When control returns to C it
>>> gets deoptimized, we notice
>>> that objects are already deoptimized (reallocated and relocked), so we
>>> don't do it again (relocking
>>> twice would be incorrect of course). Deferred updates are copied into
>>> the new interpreter frames.
>>>
>>> Problem: relocking is not possible if the target thread T is waiting
>>> on the monitor that needs to be
>>> relocked. This happens only with non-local objects with
>>> EliminateNestedLocks. Instead relocking is
>>> deferred until T owns the monitor again. This is what the piece of
>>> code above does.
>>
>> Sorry I need some more detail here. How can you wait() on an object
>> monitor if the object allocation and/or locking was optimised away? And
>> what is a "non-local object" in this context? Isn't EA restricted to
>> thread-confined objects?
>>
>> Is it just that some of the locking gets optimized away e.g.
>>
>> synchronised(obj) {
>> synchronised(obj) {
>> synchronised(obj) {
>> obj.wait();
>> }
>> }
>> }
>>
>> If this is reduced to a form as-if it were a single lock of the monitor
>> (due to EA) and the wait() triggers a JVM TI event which leads to the
>> escape of "obj" then we need to reconstruct the true lock state, and so
>> when the wait() internally unblocks and reacquires the monitor it has to
>> set the true recursion count to 3, not the 1 that it appeared to be when
>> wait() was initially called. Is that the scenario?
>>
>> If so I find this truly awful. Anyone using wait() in a realistic form
>> requires a notification and so the object cannot be thread confined. In
>> which case I would strongly argue that upon hitting the wait() the deopt
>> should occur unconditionally and so the lock state is correct before we
>> wait and so we don't need to mess with the recursion count internally
>> when we reacquire the monitor.
>>
>>>
>>> > which I don't like the sound of at all when it comes to
>>> ObjectMonitor
>>> > state. So I'd like to understand in detail exactly what is going
>>> on here
>>> > and why. This is a very intrusive change that seems to badly break
>>> > encapsulation and impacts future changes to ObjectMonitor that
>>> are under
>>> > investigation.
>>>
>>> I would not regard this as breaking encapsulation. Certainly not badly.
>>>
>>> I've added a property relock_count_after_wait to JavaThread. The
>>> property is well
>>> encapsulated. Future ObjectMonitor implementations have to deal with
>>> recursion too. They are free in
>>> choosing a way to do that as long as that property is taken into
>>> account. This is hardly a
>>> limitation.
>>
>> I do think this badly breaks encapsulation as you have to add a callout
>> from the guts of the ObjectMonitor code to reach into the thread to get
>> this lock count adjustment. I understand why you have had to do this but
>> I would much rather see a change to the EA optimisation strategy so that
>> this is not needed.
>>
>>> Note also that the property is a straight forward extension of the
>>> existing concept of deferred
>>> local updates. It is embedded into the structure holding them. So not
>>> even the footprint of a
>>> JavaThread is enlarged if no deferred updates are generated.
>>>
>>> > ---
>>> >
>>> > src/hotspot/share/runtime/thread.cpp
>>> >
>>> > Can you please explain why
>>> JavaThread::wait_for_object_deoptimization
>>> > has to be handcrafted in this way rather than using proper
>>> transitions.
>>> >
>>>
>>> I wrote wait_for_object_deoptimization taking
>>> JavaThread::java_suspend_self_with_safepoint_check
>>> as template. So in short: for the same reasons :)
>>>
>>> Threads reach both methods as part of thread state transitions,
>>> therefore special handling is
>>> required to change thread state on top of ongoing transitions.
>>>
>>> > We got rid of "deopt suspend" some time ago and it is disturbing
>>> to see
>>> > it being added back (effectively). This seems like it may be
>>> something
>>> > that handshakes could be used for.
>>>
>>> Deopt suspend used to be something rather different with a similar
>>> name[1]. It is not being added back.
>>
>> I stand corrected. Despite comments in the code to the contrary
>> deopt_suspend didn't actually cause a self-suspend. I was doing a lot of
>> cleanup in this area 13 years ago :)
>>
>>>
>>> I'm actually duplicating the existing external suspend mechanism,
>>> because a thread can be suspended
>>> at most once. And hey, and don't like that either! But it seems not
>>> unlikely that the duplicate can
>>> be removed together with the original and the new type of handshakes
>>> that will be used for
>>> thread suspend can be used for object deoptimization too. See today's
>>> discussion in JDK-8227745 [2].
>>
>> I hope that discussion bears some fruit, at the moment it seems not to
>> be possible to use handshakes here. :(
>>
>> The external suspend mechanism is a royal pain in the proverbial that we
>> have to carefully live with. The idea that we're duplicating that for
>> use in another fringe area of functionality does not thrill me at all.
>>
>> To be clear, I understand the problem that exists and that you wish to
>> solve, but for the runtime parts I balk at the complexity cost of
>> solving it.
>>
>> Thanks,
>> David
>> -----
>>
>>> Thanks, Richard.
>>>
>>> [1] Deopt suspend was something like an async. handshake for
>>> architectures with register windows,
>>> where patching the return pc for deoptimization of a compiled
>>> frame was racy if the owner thread
>>> was in native code. Instead a "deopt" suspend flag was set on
>>> which the thread patched its own
>>> frame upon return from native. So no thread was suspended. It got
>>> its name only from the name of
>>> the flags.
>>>
>>> [2] Discussion about using handshakes to sync. with the target thread:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14306727
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes <david.holmes at oracle.com>
>>> Sent: Freitag, 13. Dezember 2019 00:56
>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
>>> serviceability-dev at openjdk.java.net;
>>> hotspot-compiler-dev at openjdk.java.net;
>>> hotspot-runtime-dev at openjdk.java.net
>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>>> Performance in the Presence of JVMTI Agents
>>>
>>> Hi Richard,
>>>
>>> Some further queries/concerns:
>>>
>>> src/hotspot/share/runtime/objectMonitor.cpp
>>>
>>> Can you please explain the changes to ObjectMonitor::wait:
>>>
>>> ! _recursions = save // restore the old recursion count
>>> ! + jt->get_and_reset_relock_count_after_wait(); //
>>> increased by the deferred relock count
>>>
>>> what is the "deferred relock count"? I gather it relates to
>>>
>>> "The code was extended to be able to deoptimize objects of a frame that
>>> is not the top frame and to let another thread than the owning thread do
>>> it."
>>>
>>> which I don't like the sound of at all when it comes to ObjectMonitor
>>> state. So I'd like to understand in detail exactly what is going on here
>>> and why. This is a very intrusive change that seems to badly break
>>> encapsulation and impacts future changes to ObjectMonitor that are under
>>> investigation.
>>>
>>> ---
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>>
>>> Can you please explain why JavaThread::wait_for_object_deoptimization
>>> has to be handcrafted in this way rather than using proper transitions.
>>>
>>> We got rid of "deopt suspend" some time ago and it is disturbing to see
>>> it being added back (effectively). This seems like it may be something
>>> that handshakes could be used for.
>>>
>>> Thanks,
>>> David
>>> -----
>>>
>>> On 12/12/2019 7:02 am, David Holmes wrote:
>>>> On 12/12/2019 1:07 am, Reingruber, Richard wrote:
>>>>> Hi David,
>>>>>
>>>>> > Most of the details here are in areas I can comment on in detail,
>>>>> but I
>>>>> > did take an initial general look at things.
>>>>>
>>>>> Thanks for taking the time!
>>>>
>>>> Apologies the above should read:
>>>>
>>>> "Most of the details here are in areas I *can't* comment on in detail
>>>> ..."
>>>>
>>>> David
>>>>
>>>>> > The only thing that jumped out at me is that I think the
>>>>> > DeoptimizeObjectsALotThread should be a hidden thread.
>>>>> >
>>>>> > + bool is_hidden_from_external_view() const { return true; }
>>>>>
>>>>> Yes, it should. Will add the method like above.
>>>>>
>>>>> > Also I don't see any testing of the DeoptimizeObjectsALotThread.
>>>>> Without
>>>>> > active testing this will just bit-rot.
>>>>>
>>>>> DeoptimizeObjectsALot is meant for stress testing with a larger
>>>>> workload. I will add a minimal test
>>>>> to keep it fresh.
>>>>>
>>>>> > Also on the tests I don't understand your @requires clause:
>>>>> >
>>>>> > @requires ((vm.compMode != "Xcomp") & vm.compiler2.enabled &
>>>>> > (vm.opt.TieredCompilation != true))
>>>>> >
>>>>> > This seems to require that TieredCompilation is disabled, but
>>>>> tiered is
>>>>> > our normal mode of operation. ??
>>>>> >
>>>>>
>>>>> I removed the clause. I guess I wanted to target the tests towards the
>>>>> code they are supposed to
>>>>> test, and it's easier to analyze failures w/o tiered compilation and
>>>>> with just one compiler thread.
>>>>>
>>>>> Additionally I will make use of
>>>>> compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the tests.
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>> -----Original Message-----
>>>>> From: David Holmes <david.holmes at oracle.com>
>>>>> Sent: Mittwoch, 11. Dezember 2019 08:03
>>>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
>>>>> serviceability-dev at openjdk.java.net;
>>>>> hotspot-compiler-dev at openjdk.java.net;
>>>>> hotspot-runtime-dev at openjdk.java.net
>>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>>>>> Performance in the Presence of JVMTI Agents
>>>>>
>>>>> Hi Richard,
>>>>>
>>>>> On 11/12/2019 7:45 am, Reingruber, Richard wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I would like to get reviews please for
>>>>>>
>>>>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/
>>>>>>
>>>>>> Corresponding RFE:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8227745
>>>>>>
>>>>>> Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915
>>>>>> And potentially https://bugs.openjdk.java.net/browse/JDK-8214584 [1]
>>>>>>
>>>>>> Vladimir Kozlov kindly put webrev.3 through tier1-8 testing without
>>>>>> issues (thanks!). In addition the
>>>>>> change is being tested at SAP since I posted the first RFR some
>>>>>> months ago.
>>>>>>
>>>>>> The intention of this enhancement is to benefit performance wise from
>>>>>> escape analysis even if JVMTI
>>>>>> agents request capabilities that allow them to access local variable
>>>>>> values. E.g. if you start-up
>>>>>> with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n, then
>>>>>> escape analysis is disabled right
>>>>>> from the beginning, well before a debugger attaches -- if ever one
>>>>>> should do so. With the
>>>>>> enhancement, escape analysis will remain enabled until and after a
>>>>>> debugger attaches. EA based
>>>>>> optimizations are reverted just before an agent acquires the
>>>>>> reference to an object. In the JBS item
>>>>>> you'll find more details.
>>>>>
>>>>> Most of the details here are in areas I can comment on in detail, but I
>>>>> did take an initial general look at things.
>>>>>
>>>>> The only thing that jumped out at me is that I think the
>>>>> DeoptimizeObjectsALotThread should be a hidden thread.
>>>>>
>>>>> + bool is_hidden_from_external_view() const { return true; }
>>>>>
>>>>> Also I don't see any testing of the DeoptimizeObjectsALotThread.
>>>>> Without
>>>>> active testing this will just bit-rot.
>>>>>
>>>>> Also on the tests I don't understand your @requires clause:
>>>>>
>>>>> @requires ((vm.compMode != "Xcomp") & vm.compiler2.enabled &
>>>>> (vm.opt.TieredCompilation != true))
>>>>>
>>>>> This seems to require that TieredCompilation is disabled, but tiered is
>>>>> our normal mode of operation. ??
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>> Thanks,
>>>>>> Richard.
>>>>>>
>>>>>> [1] Experimental fix for JDK-8214584 based on JDK-8227745
>>>>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.patch
>>>>>>
>>>>>>
>>>>>>
More information about the serviceability-dev
mailing list