RFR(L) 8227745: Enable Escape Analysis for Better Performance in the Presence of JVMTI Agents

Tue Dec 17 07:03:00 UTC 2019

<resend as my mailer crashed during last send>

David

On 17/12/2019 4:57 pm, David Holmes wrote:
> Hi Richard,
> 
> On 14/12/2019 5:01 am, Reingruber, Richard wrote:
>> Hi David,
>>
>>    > Some further queries/concerns:
>>    >
>>    > src/hotspot/share/runtime/objectMonitor.cpp
>>    >
>>    > Can you please explain the changes to ObjectMonitor::wait:
>>    >
>>    > !   _recursions = save      // restore the old recursion count
>>    > !                 + jt->get_and_reset_relock_count_after_wait(); //
>>    > increased by the deferred relock count
>>    >
>>    > what is the "deferred relock count"? I gather it relates to
>>    >
>>    > "The code was extended to be able to deoptimize objects of a 
>> frame that
>>    > is not the top frame and to let another thread than the owning 
>> thread do
>>    > it."
>>
>> Yes, these relate. Currently EA based optimizations are reverted, when 
>> a compiled frame is replaced
>> with corresponding interpreter frames. Part of this is relocking 
>> objects with eliminated
>> locking. New with the enhancement is that we do this also just before 
>> object references are acquired
>> through JVMTI. In this case we deoptimize also the owning compiled 
>> frame C and we register
>> deoptimized objects as deferred updates. When control returns to C it 
>> gets deoptimized, we notice
>> that objects are already deoptimized (reallocated and relocked), so we 
>> don't do it again (relocking
>> twice would be incorrect of course). Deferred updates are copied into 
>> the new interpreter frames.
>>
>> Problem: relocking is not possible if the target thread T is waiting 
>> on the monitor that needs to be
>> relocked. This happens only with non-local objects with 
>> EliminateNestedLocks. Instead relocking is
>> deferred until T owns the monitor again. This is what the piece of 
>> code above does.
> 
> Sorry I need some more detail here. How can you wait() on an object 
> monitor if the object allocation and/or locking was optimised away? And 
> what is a "non-local object" in this context? Isn't EA restricted to 
> thread-confined objects?
> 
> Is it just that some of the locking gets optimized away e.g.
> 
> synchronised(obj) {
>    synchronised(obj) {
>      synchronised(obj) {
>        obj.wait();
>      }
>    }
> }
> 
> If this is reduced to a form as-if it were a single lock of the monitor 
> (due to EA) and the wait() triggers a JVM TI event which leads to the 
> escape of "obj" then we need to reconstruct the true lock state, and so 
> when the wait() internally unblocks and reacquires the monitor it has to 
> set the true recursion count to 3, not the 1 that it appeared to be when 
> wait() was initially called. Is that the scenario?
> 
> If so I find this truly awful. Anyone using wait() in a realistic form 
> requires a notification and so the object cannot be thread confined. In 
> which case I would strongly argue that upon hitting the wait() the deopt 
> should occur unconditionally and so the lock state is correct before we 
> wait and so we don't need to mess with the recursion count internally 
> when we reacquire the monitor.
> 
>>
>>    > which I don't like the sound of at all when it comes to 
>> ObjectMonitor
>>    > state. So I'd like to understand in detail exactly what is going 
>> on here
>>    > and why.  This is a very intrusive change that seems to badly break
>>    > encapsulation and impacts future changes to ObjectMonitor that 
>> are under
>>    > investigation.
>>
>> I would not regard this as breaking encapsulation. Certainly not badly.
>>
>> I've added a property relock_count_after_wait to JavaThread. The 
>> property is well
>> encapsulated. Future ObjectMonitor implementations have to deal with 
>> recursion too. They are free in
>> choosing a way to do that as long as that property is taken into 
>> account. This is hardly a
>> limitation.
> 
> I do think this badly breaks encapsulation as you have to add a callout 
> from the guts of the ObjectMonitor code to reach into the thread to get 
> this lock count adjustment. I understand why you have had to do this but 
> I would much rather see a change to the EA optimisation strategy so that 
> this is not needed.
> 
>> Note also that the property is a straight forward extension of the 
>> existing concept of deferred
>> local updates. It is embedded into the structure holding them. So not 
>> even the footprint of a
>> JavaThread is enlarged if no deferred updates are generated.
>>
>>    > ---
>>    >
>>    > src/hotspot/share/runtime/thread.cpp
>>    >
>>    > Can you please explain why 
>> JavaThread::wait_for_object_deoptimization
>>    > has to be handcrafted in this way rather than using proper 
>> transitions.
>>    >
>>
>> I wrote wait_for_object_deoptimization taking 
>> JavaThread::java_suspend_self_with_safepoint_check
>> as template. So in short: for the same reasons :)
>>
>> Threads reach both methods as part of thread state transitions, 
>> therefore special handling is
>> required to change thread state on top of ongoing transitions.
>>
>>    > We got rid of "deopt suspend" some time ago and it is disturbing 
>> to see
>>    > it being added back (effectively). This seems like it may be 
>> something
>>    > that handshakes could be used for.
>>
>> Deopt suspend used to be something rather different with a similar 
>> name[1]. It is not being added back.
> 
> I stand corrected. Despite comments in the code to the contrary 
> deopt_suspend didn't actually cause a self-suspend. I was doing a lot of 
> cleanup in this area 13 years ago :)
> 
>>
>> I'm actually duplicating the existing external suspend mechanism, 
>> because a thread can be suspended
>> at most once. And hey, and don't like that either! But it seems not 
>> unlikely that the duplicate can
>> be removed together with the original and the new type of handshakes 
>> that will be used for
>> thread suspend can be used for object deoptimization too. See today's 
>> discussion in JDK-8227745 [2].
> 
> I hope that discussion bears some fruit, at the moment it seems not to 
> be possible to use handshakes here. :(
> 
> The external suspend mechanism is a royal pain in the proverbial that we 
> have to carefully live with. The idea that we're duplicating that for 
> use in another fringe area of functionality does not thrill me at all.
> 
> To be clear, I understand the problem that exists and that you wish to 
> solve, but for the runtime parts I balk at the complexity cost of 
> solving it.
> 
> Thanks,
> David
> -----
> 
>> Thanks, Richard.
>>
>> [1] Deopt suspend was something like an async. handshake for 
>> architectures with register windows,
>>      where patching the return pc for deoptimization of a compiled 
>> frame was racy if the owner thread
>>      was in native code. Instead a "deopt" suspend flag was set on 
>> which the thread patched its own
>>      frame upon return from native. So no thread was suspended. It got 
>> its name only from the name of
>>      the flags.
>>
>> [2] Discussion about using handshakes to sync. with the target thread:
>>      
>> https://bugs.openjdk.java.net/browse/JDK-8227745?focusedCommentId=14306727&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14306727 
>>
>>
>> -----Original Message-----
>> From: David Holmes <david.holmes at oracle.com>
>> Sent: Freitag, 13. Dezember 2019 00:56
>> To: Reingruber, Richard <richard.reingruber at sap.com>; 
>> serviceability-dev at openjdk.java.net; 
>> hotspot-compiler-dev at openjdk.java.net; 
>> hotspot-runtime-dev at openjdk.java.net
>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better 
>> Performance in the Presence of JVMTI Agents
>>
>> Hi Richard,
>>
>> Some further queries/concerns:
>>
>> src/hotspot/share/runtime/objectMonitor.cpp
>>
>> Can you please explain the changes to ObjectMonitor::wait:
>>
>> !   _recursions = save      // restore the old recursion count
>> !                 + jt->get_and_reset_relock_count_after_wait(); //
>> increased by the deferred relock count
>>
>> what is the "deferred relock count"? I gather it relates to
>>
>> "The code was extended to be able to deoptimize objects of a frame that
>> is not the top frame and to let another thread than the owning thread do
>> it."
>>
>> which I don't like the sound of at all when it comes to ObjectMonitor
>> state. So I'd like to understand in detail exactly what is going on here
>> and why.  This is a very intrusive change that seems to badly break
>> encapsulation and impacts future changes to ObjectMonitor that are under
>> investigation.
>>
>> ---
>>
>> src/hotspot/share/runtime/thread.cpp
>>
>> Can you please explain why JavaThread::wait_for_object_deoptimization
>> has to be handcrafted in this way rather than using proper transitions.
>>
>> We got rid of "deopt suspend" some time ago and it is disturbing to see
>> it being added back (effectively). This seems like it may be something
>> that handshakes could be used for.
>>
>> Thanks,
>> David
>> -----
>>
>> On 12/12/2019 7:02 am, David Holmes wrote:
>>> On 12/12/2019 1:07 am, Reingruber, Richard wrote:
>>>> Hi David,
>>>>
>>>>     > Most of the details here are in areas I can comment on in detail,
>>>> but I
>>>>     > did take an initial general look at things.
>>>>
>>>> Thanks for taking the time!
>>>
>>> Apologies the above should read:
>>>
>>> "Most of the details here are in areas I *can't* comment on in detail 
>>> ..."
>>>
>>> David
>>>
>>>>     > The only thing that jumped out at me is that I think the
>>>>     > DeoptimizeObjectsALotThread should be a hidden thread.
>>>>     >
>>>>     > +  bool is_hidden_from_external_view() const { return true; }
>>>>
>>>> Yes, it should. Will add the method like above.
>>>>
>>>>     > Also I don't see any testing of the DeoptimizeObjectsALotThread.
>>>> Without
>>>>     > active testing this will just bit-rot.
>>>>
>>>> DeoptimizeObjectsALot is meant for stress testing with a larger
>>>> workload. I will add a minimal test
>>>> to keep it fresh.
>>>>
>>>>     > Also on the tests I don't understand your @requires clause:
>>>>     >
>>>>     >   @requires ((vm.compMode != "Xcomp") & vm.compiler2.enabled &
>>>>     > (vm.opt.TieredCompilation != true))
>>>>     >
>>>>     > This seems to require that TieredCompilation is disabled, but
>>>> tiered is
>>>>     > our normal mode of operation. ??
>>>>     >
>>>>
>>>> I removed the clause. I guess I wanted to target the tests towards the
>>>> code they are supposed to
>>>> test, and it's easier to analyze failures w/o tiered compilation and
>>>> with just one compiler thread.
>>>>
>>>> Additionally I will make use of
>>>> compiler.whitebox.CompilerWhiteBoxTest.THRESHOLD in the tests.
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>> -----Original Message-----
>>>> From: David Holmes <david.holmes at oracle.com>
>>>> Sent: Mittwoch, 11. Dezember 2019 08:03
>>>> To: Reingruber, Richard <richard.reingruber at sap.com>;
>>>> serviceability-dev at openjdk.java.net;
>>>> hotspot-compiler-dev at openjdk.java.net;
>>>> hotspot-runtime-dev at openjdk.java.net
>>>> Subject: Re: RFR(L) 8227745: Enable Escape Analysis for Better
>>>> Performance in the Presence of JVMTI Agents
>>>>
>>>> Hi Richard,
>>>>
>>>> On 11/12/2019 7:45 am, Reingruber, Richard wrote:
>>>>> Hi,
>>>>>
>>>>> I would like to get reviews please for
>>>>>
>>>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8227745/webrev.3/
>>>>>
>>>>> Corresponding RFE:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8227745
>>>>>
>>>>> Fixes also https://bugs.openjdk.java.net/browse/JDK-8233915
>>>>> And potentially https://bugs.openjdk.java.net/browse/JDK-8214584 [1]
>>>>>
>>>>> Vladimir Kozlov kindly put webrev.3 through tier1-8 testing without
>>>>> issues (thanks!). In addition the
>>>>> change is being tested at SAP since I posted the first RFR some
>>>>> months ago.
>>>>>
>>>>> The intention of this enhancement is to benefit performance wise from
>>>>> escape analysis even if JVMTI
>>>>> agents request capabilities that allow them to access local variable
>>>>> values. E.g. if you start-up
>>>>> with -agentlib:jdwp=transport=dt_socket,server=y,suspend=n, then
>>>>> escape analysis is disabled right
>>>>> from the beginning, well before a debugger attaches -- if ever one
>>>>> should do so. With the
>>>>> enhancement, escape analysis will remain enabled until and after a
>>>>> debugger attaches. EA based
>>>>> optimizations are reverted just before an agent acquires the
>>>>> reference to an object. In the JBS item
>>>>> you'll find more details.
>>>>
>>>> Most of the details here are in areas I can comment on in detail, but I
>>>> did take an initial general look at things.
>>>>
>>>> The only thing that jumped out at me is that I think the
>>>> DeoptimizeObjectsALotThread should be a hidden thread.
>>>>
>>>> +  bool is_hidden_from_external_view() const { return true; }
>>>>
>>>> Also I don't see any testing of the DeoptimizeObjectsALotThread. 
>>>> Without
>>>> active testing this will just bit-rot.
>>>>
>>>> Also on the tests I don't understand your @requires clause:
>>>>
>>>>     @requires ((vm.compMode != "Xcomp") & vm.compiler2.enabled &
>>>> (vm.opt.TieredCompilation != true))
>>>>
>>>> This seems to require that TieredCompilation is disabled, but tiered is
>>>> our normal mode of operation. ??
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>> [1] Experimental fix for JDK-8214584 based on JDK-8227745
>>>>> http://cr.openjdk.java.net/~rrich/webrevs/2019/8214584/experiment_v1.patch 
>>>>>
>>>>>
>>>>>