RFR(S) Contended Locking fast exit bucket (8073165)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue Apr 14 01:58:23 UTC 2015
On 4/13/15 7:03 PM, David Holmes wrote:
> Hi Dan,
>
> Just one follow up comment on the recursive locking below ... with
> another email to follow for round 2.
>
> On 9/04/2015 4:30 AM, Daniel D. Daugherty wrote:
>> On 3/12/15 10:50 PM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> Well that was fun ... :)
>>
>> Sorry for the delay in getting back to this review. Been caught with
>> other work issues and took a trip over Spring Break (which is not
>> normal for my family).
>>
>> Glad you had fun with this one! The "fast notify" and "adaptive spin"
>> buckets are chock full of even more fun!
>>
>> As usual, replies are embedded below...
>>
>>
>>>
>>> On 10/03/2015 3:11 AM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I have the Contended Locking fast exit bucket ready for review.
>>>>
>>>> The code changes in this bucket are primarily assembly language
>>>> tweaks that come into play when an inflated ObjectMonitor is in
>>>> use. There are tweaks that apply to the default code path and
>>>> there are tweaks that only apply when an experimental flag is
>>>> set. This code code review is focused on the default code path
>>>> or the "out of the box" configuration.
>>>>
>>>> This work is being tracked by the following bug ID:
>>>>
>>>> JDK-8073165 Contended Locking fast exit bucket
>>>> https://bugs.openjdk.java.net/browse/JDK-8073165
>>>>
>>>> Here is the webrev URL:
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/
>>>>
>>>> Here is the JEP link:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8046133
>>>>
>>>> Testing:
>>>>
>>>> - Aurora Adhoc RT/SVC baseline batch
>>>> - JPRT test jobs
>>>> - MonitorExitStresser micro-benchmark (in process)
>>>> - MonitorEnterExitStresser micro-benchmark (in process)
>>>> - CallTimerGrid stress testing (in process)
>>>> - Aurora performance testing:
>>>> - out of the box for the "promotion" and 32-bit server configs
>>>> - heavy weight monitors for the "promotion" and 32-bit server
>>>> configs
>>>> (-XX:-UseBiasedLocking -XX:+UseHeavyMonitors)
>>>> (in process)
>>>>
>>>>
>>>> 8073165 summary of changes:
>>>>
>>>> macroAssembler_sparc.cpp: MacroAssembler::compiler_unlock_object()
>>>>
>>>> - default (OOTB) EmitSync value == 0
>>>> - an _owner field optimization is in the non-default path under
>>>> a (EmitSync & 1024) check
>>>>
>>>> - classic lock release without 1-0 support is in the non-default
>>>> path under a (EmitSync & 512) check
>>>>
>>>> - an EntryList, cxq and _succ field optimization is in the
>>>> non-default path under a (EmitSync & 16384) check
>>>>
>>>> - the default path is now the 1-0 form that avoids CAS and membar
>>>> in the common case; see the "1-0 exit" section in objectMonitor.cpp
>>>
>>> I don't pretend that I could follow all these changes (I don't read
>>> sparc very well) - the main thing of interest is the EmitSync==0 path
>>
>> Yes, the "EmitSync == 0" path is the "out-of-the-box" (OOTB) config
>> and that's what we care about for this review. I expect that we'll
>> eventually revisit the alternate optimizations in more detail,
>> but that's not in the cards for this project.
>>
>>
>>> - everything else is just experimental (did you actually test to any
>>> great extent with all these different possible EmitSync values?).
>>
>> I've done minimal testing of the experimental code paths mostly
>> just to satisfy my "what the heck does this do?" curiosity...
>>
>> In the past, Karen has done more experiments with the alternate
>> code paths, but my focus is on the OOTB config. Of course, Dave
>> Dice and Doug Lea use these experimental flags quite a bit as
>> part of their research work on locking. Dice has often used them
>> in customer situations to diagnose a system's Java monitor usage
>> and behavior patterns.
>>
>>
>>> So looking at the default path ... I could not figure out what logic
>>> this (pre-existing) code is effecting:
>>>
>>> 3071 ld_ptr(Address(Rmark,
>>> OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions)), Rbox);
>>> 3072 orcc(Rbox, G0, G0);
>>> 3073 brx(Assembler::notZero, false, Assembler::pn, done);
>>>
>>> It seems to be a check for:
>>>
>>> if (recursions != 0) return;
>>>
>>> which would be fine except where is recursions decremented ??
>>
>> For the other readers of this thread, the above quoted
>> code is here:
>>
>> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/cpu/sparc/vm/macroAssembler_sparc.cpp.frames.html
>>
>>
>>
>> in this function:
>>
>> 2964 void MacroAssembler::compiler_unlock_object(Register Roop, Register
>> Rmark,
>> 2965 Register Rbox, Register
>> Rscratch,
>> 2966 bool try_bias) {
>>
>> The key to understanding the "slow path" fall back for this
>> optimization is to look at this code:
>>
>> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/cpu/sparc/vm/sharedRuntime_sparc.cpp.frames.html
>>
>>
>>
>> in this function:
>>
>> 1934 nmethod* SharedRuntime::generate_native_wrapper(MacroAssembler*
>> masm,
>> 1935 methodHandle method,
>> 1936 int compile_id,
>> 1937 BasicType*
>> in_sig_bt,
>> 1938 VMRegPair* in_regs,
>> 1939 BasicType
>> ret_type) {
>>
>>
>> 2638 // Unlock
>> 2639 if (method->is_synchronized()) {
>>
>> <snip>
>>
>> 2649 // Now unlock
>> 2650 // (Roop, Rmark, Rbox, Rscratch)
>> 2651 __ compiler_unlock_object(L4, L1, L3_box, L2);
>> 2652 __ br(Assembler::equal, false, Assembler::pt, done);
>> 2653 __ delayed()-> add(SP, lock_offset+STACK_BIAS, L3_box);
>> 2654
>> 2655 // save and restore any potential method result value around
>> the unlocking
>> 2656 // operation. Will save in I0 (or stack for FP returns).
>> 2657 save_native_result(masm, ret_type, stack_slots);
>> 2658
>> 2659 // Must clear pending-exception before re-entering the VM.
>> Since this is
>> 2660 // a leaf call, pending-exception-oop can be safely kept in a
>> register.
>> 2661 __ st_ptr(G0, G2_thread,
>> in_bytes(Thread::pending_exception_offset()));
>> 2662
>> 2663 // slow case of monitor enter. Inline a special case of
>> call_VM that
>> 2664 // disallows any pending_exception.
>> 2665 __ mov(L3_box, O1);
>> 2666
>> 2667 // Pass in current thread pointer
>> 2668 __ mov(G2_thread, O2);
>> 2669
>> 2670 __ call(CAST_FROM_FN_PTR(address,
>> SharedRuntime::complete_monitor_unlocking_C),
>> relocInfo::runtime_call_type);
>> 2671 __ delayed()->mov(L4, O0); // Need oop in O0
>> 2672
>> 2673 __ restore_thread(L7_thread_cache); // restore G2_thread
>>
>> so we call compiler_unlock_object() on line 2651. If it
>> was able to do the unlock operation by itself, then we
>> branch to "done". If compiler_unlock_object() couldn't
>> do the unlock operation, then we have to fall back to
>> the slow path which results in a call to
>> SharedRuntime::complete_monitor_unlocking_C() on line 2670.
>>
>> In the case where we have a recursive monitor exit, that
>> qualifies as a "slow path" operation so the decrement of
>> the 'recursions' counter happens on the
>> complete_monitor_unlocking_C() code path. See
>>
>> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/share/vm/runtime/sharedRuntime.cpp.frames.html
>>
>>
>>
>> 1822 JRT_LEAF(void, SharedRuntime::complete_monitor_unlocking_C(oopDesc*
>> _obj, BasicLock* lock, JavaThread * THREAD))
>>
>> and the resulting call to ObjectSynchronizer::slow_exit().
>
> Thanks for the detailed map of the code!
That was just for the synchronized native case. The C2 node stuff
for regular Java code is harder to map (IMHO), but ends up in the
same place...
> I'm surprised that recursive unlocking is considered a slow-path
> operation! I would have expected it to be fast-path given it can be
> handled CAS free.
There are some comments about moving recursion into this code,
but the problem is code size. Something about making the C2
node code too big...
Dan
>
> Thanks,
> David
>
>>
>>> But that aside the description seems reasonable.
>>>
>>>> sharedRuntime_sparc.cpp:
>>>>
>>>> - add JavaThread* param for
>>>> SharedRuntime::complete_monitor_unlocking_C call
>>>
>>> Ok.
>>>
>>>> macroAssembler_x86.cpp: MacroAssembler::fast_unlock()
>>>>
>>>> - with all the new/modified comments and all the optional
>>>> code path updates, it is hard to see that the only real
>>>> changes to the default code path are 1) the _owner==Self
>>>> check has been made optional and 2) the optional succ
>>>> field check for the slow path is now on the default path
>>>>
>>>> - default (OOTB) EmitSync value == 0
>>>> - remove optional (EmitSync & 8) code
>>>> - refactor code in MacroAssembler::fast_unlock
>>>> - remove old mfence() support
>>>
>>> So I had expected to see this change conditional on the runtime
>>> processor type. But then I checked Dave's blog and saw this was
>>> determined 5+ years ago (!!!) so those older processors are no longer
>>> a consideration. I do hope the performance tests have been re-run
>>> since then though ;-)
>>
>> We have been doing both targeted microbenchmarks and general
>> perf runs for every bucket that we've processed so far. In
>> general, the microbenchmarks show a performance gain and the
>> general perf runs don't show more than a minimal improvement
>> if there is any at all. This is in keeping with Dave Dice's
>> prediction that we won't see the big gains until all the
>> buckets are in play. We structured this project with the
>> distinct buckets because we wanted to figure out which of
>> the groups of changes caused the performance degradations
>> that Karen saw back when she was working with this code. So
>> far we haven't run into performance degradations, but we
>> have three buckets (including this one) to go...
>>
>>
>>>
>>>> - an _owner field optimization is in the non-default path under
>>>> a (EmitSync & 1024) check
>>>
>>> Doesn't the comment:
>>>
>>> 2163 if (EmitSync & 1024) {
>>> 2164 // Don't bother to ratify that m_Owner==Self
>>>
>>> apply to else part?
>>
>> Yes, that comment does appear to be misplaced. The comment
>> from the SPARC side is better:
>>
>> 3026 // Emit code to check that _owner == Self
>> 3027 // We could fold the _owner test into subsequent code more
>> efficiently
>> 3028 // than using a stand-alone check, but since _owner checking
>> is off by
>> 3029 // default we don't bother.
>>
>> And my description for this change should have been:
>>
>> - an _owner field sanity/paranoia check is in the non-default path under
>> a (EmitSync & 1024) check
>>
>>
>>>
>>>> - an _owner field optimization is in the non-default path under
>>>> a (EmitSync & 16) check
>>>> - a barrier optimization is in the non-default path under a
>>>> (EmitSync & 32) check
>>>> - remove old AMD optional work around
>>>
>>> 2229 // box is really RAX -- the following CMPXCHG depends on
>>> that binding
>>> 2230 // cmpxchg R,[M] is equivalent to rax = CAS(M,rax,R)
>>> 2231
>>> 2232 movptr(boxReg, (int32_t)NULL_WORD); // box is really RAX
>>>
>>> You've repeated the comment about box.
>>
>> I'll clean that up.
>>
>>
>>>
>>>> sharedRuntime_x86_32.cpp:
>>>> sharedRuntime_x86_64.cpp:
>>>>
>>>> - add thread param for SharedRuntime::complete_monitor_unlocking_C
>>>> call
>>>
>>> Ok.
>>>
>>>> macro.[ch]pp: PhaseMacroExpand::make_slow_call()
>>>>
>>>> - add param2 and update callers
>>>
>>> Ok.
>>>
>>>> runtime.[ch]pp: OptoRuntime::complete_monitor_exit_Type()
>>>>
>>>> - add JavaThread* parameter
>>>
>>> Ok
>>>
>>>> sharedRuntime.[ch]pp: SharedRuntime::complete_monitor_unlocking_C()
>>>>
>>>> - add JavaThread* parameter
>>>
>>> Ok. So all that passing through of the Thread was just to avoid a call
>>> to JavaThread::current()? I thought that was already optimized to just
>>> be a register load (at least on some platforms) ?
>>
>> Yes. I don't have a good history from Dave/Karen on what
>> motivated this change. I do remember from my Serviceability
>> days that JavaThread::current() can be slow and if you
>> already had the right JavaThread *, then you should use it.
>>
>> In this case, the callers of complete_monitor_unlocking_C()
>> have the right JavaThread * so passing it down is good
>> (and cheap!).
>>
>>
>>>
>>>> synchronizer.cpp: ObjectSynchronizer::omRelease()
>>>>
>>>> - cleanup in use list traversal
>>>> - add assert that ObjectMonitor being released was
>>>> actually found and extracted
>>>>
>>>> synchronizer.[ch]pp:
>>>>
>>>> - add some comments in various places
>>>> - rename walk_monitor_list() -> deflate_monitor_list()
>>>> and update callers
>>>> - minor logic refactoring
>>>
>>> Generally okay.
>>>
>>> Minor nit: you changed inuse to "in use" but for the "inuse list" it
>>> should really be "in-use list" (and not in-use-list as appears in a
>>> couple of existing places)
>>
>> I'll clean that up a little bit more...
>>
>>
>>>
>>> 1507 // if deflate_monitor succeeded,/moribund
>>>
>>> Not sure what that comment is supposed to mean :)
>>
>> I'll have to ask Dave/Karen about that comment.
>>
>>
>>>
>>>> thread.hpp:
>>>>
>>>> - add _TSelf field for mfence() replacement
>>>
>>> Hmmmm, so we add a new pointer field to the Thread structure purely
>>> for use on x86 if running with a particular experimental option set ?
>>> That doesn't seem like it fits with our normal considerations for
>>> space/time trade-offs. Maybe the xchg T,T->TSelf stuff should be under
>>> a compile-flag instead?
>>
>> Claes' testing of this bucket showed a small but detectable
>> footprint regression here so I'm going to back out this change
>> from this bucket.
>>
>>
>>>
>>>>
>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>
>>> Cheers,
>>> David
>>>
>>>> Dan
>>
>> Thanks for the thorough review (as usual)! Please let me know
>> if I've addressed your comments satisfactorily...
>>
>> Again, sorry for the delay in getting back to this review.
>>
>> Dan
>>
>>
More information about the hotspot-runtime-dev
mailing list