RFR(S) Contended Locking fast exit bucket (8073165)

Tue Apr 14 01:03:30 UTC 2015

Hi Dan,

Just one follow up comment on the recursive locking below ... with 
another email to follow for round 2.

On 9/04/2015 4:30 AM, Daniel D. Daugherty wrote:
> On 3/12/15 10:50 PM, David Holmes wrote:
>> Hi Dan,
>>
>> Well that was fun ... :)
>
> Sorry for the delay in getting back to this review. Been caught with
> other work issues and took a trip over Spring Break (which is not
> normal for my family).
>
> Glad you had fun with this one! The "fast notify" and "adaptive spin"
> buckets are chock full of even more fun!
>
> As usual, replies are embedded below...
>
>
>>
>> On 10/03/2015 3:11 AM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> I have the Contended Locking fast exit bucket ready for review.
>>>
>>> The code changes in this bucket are primarily assembly language
>>> tweaks that come into play when an inflated ObjectMonitor is in
>>> use. There are tweaks that apply to the default code path and
>>> there are tweaks that only apply when an experimental flag is
>>> set. This code code review is focused on the default code path
>>> or the "out of the box" configuration.
>>>
>>> This work is being tracked by the following bug ID:
>>>
>>>      JDK-8073165 Contended Locking fast exit bucket
>>>      https://bugs.openjdk.java.net/browse/JDK-8073165
>>>
>>> Here is the webrev URL:
>>>
>>> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/
>>>
>>> Here is the JEP link:
>>>
>>>      https://bugs.openjdk.java.net/browse/JDK-8046133
>>>
>>> Testing:
>>>
>>> - Aurora Adhoc RT/SVC baseline batch
>>> - JPRT test jobs
>>> - MonitorExitStresser micro-benchmark (in process)
>>> - MonitorEnterExitStresser micro-benchmark (in process)
>>> - CallTimerGrid stress testing (in process)
>>> - Aurora performance testing:
>>>    - out of the box for the "promotion" and 32-bit server configs
>>>    - heavy weight monitors for the "promotion" and 32-bit server configs
>>>      (-XX:-UseBiasedLocking -XX:+UseHeavyMonitors)
>>>      (in process)
>>>
>>>
>>> 8073165 summary of changes:
>>>
>>> macroAssembler_sparc.cpp: MacroAssembler::compiler_unlock_object()
>>>
>>> - default (OOTB) EmitSync value == 0
>>> - an _owner field optimization is in the non-default path under
>>>    a (EmitSync & 1024) check
>>>
>>> - classic lock release without 1-0 support is in the non-default
>>>    path under a (EmitSync & 512) check
>>>
>>> - an EntryList, cxq and _succ field optimization is in the
>>>    non-default path under a (EmitSync & 16384) check
>>>
>>> - the default path is now the 1-0 form that avoids CAS and membar
>>>    in the common case; see the "1-0 exit" section in objectMonitor.cpp
>>
>> I don't pretend that I could follow all these changes (I don't read
>> sparc very well) - the main thing of interest is the EmitSync==0 path
>
> Yes, the "EmitSync == 0" path is the "out-of-the-box" (OOTB) config
> and that's what we care about for this review. I expect that we'll
> eventually revisit the alternate optimizations in more detail,
> but that's not in the cards for this project.
>
>
>> - everything else is just experimental (did you actually test to any
>> great extent with all these different possible EmitSync values?).
>
> I've done minimal testing of the experimental code paths mostly
> just to satisfy my "what the heck does this do?" curiosity...
>
> In the past, Karen has done more experiments with the alternate
> code paths, but my focus is on the OOTB config. Of course, Dave
> Dice and Doug Lea use these experimental flags quite a bit as
> part of their research work on locking. Dice has often used them
> in customer situations to diagnose a system's Java monitor usage
> and behavior patterns.
>
>
>> So looking at the default path ... I could not figure out what logic
>> this (pre-existing) code is effecting:
>>
>> 3071      ld_ptr(Address(Rmark,
>> OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions)), Rbox);
>> 3072      orcc(Rbox, G0, G0);
>> 3073      brx(Assembler::notZero, false, Assembler::pn, done);
>>
>> It seems to be a check for:
>>
>> if (recursions != 0) return;
>>
>> which would be fine except where is recursions decremented ??
>
> For the other readers of this thread, the above quoted
> code is here:
>
> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/cpu/sparc/vm/macroAssembler_sparc.cpp.frames.html
>
>
> in this function:
>
> 2964 void MacroAssembler::compiler_unlock_object(Register Roop, Register
> Rmark,
> 2965                                             Register Rbox, Register
> Rscratch,
> 2966                                             bool try_bias) {
>
> The key to understanding the "slow path" fall back for this
> optimization is to look at this code:
>
> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/cpu/sparc/vm/sharedRuntime_sparc.cpp.frames.html
>
>
> in this function:
>
> 1934 nmethod* SharedRuntime::generate_native_wrapper(MacroAssembler* masm,
> 1935                                                 methodHandle method,
> 1936                                                 int compile_id,
> 1937                                                 BasicType* in_sig_bt,
> 1938                                                 VMRegPair* in_regs,
> 1939                                                 BasicType ret_type) {
>
>
> 2638   // Unlock
> 2639   if (method->is_synchronized()) {
>
> <snip>
>
> 2649     // Now unlock
> 2650     //                       (Roop, Rmark, Rbox,   Rscratch)
> 2651     __ compiler_unlock_object(L4,   L1,    L3_box, L2);
> 2652     __ br(Assembler::equal, false, Assembler::pt, done);
> 2653     __ delayed()-> add(SP, lock_offset+STACK_BIAS, L3_box);
> 2654
> 2655     // save and restore any potential method result value around
> the unlocking
> 2656     // operation.  Will save in I0 (or stack for FP returns).
> 2657     save_native_result(masm, ret_type, stack_slots);
> 2658
> 2659     // Must clear pending-exception before re-entering the VM.
> Since this is
> 2660     // a leaf call, pending-exception-oop can be safely kept in a
> register.
> 2661     __ st_ptr(G0, G2_thread,
> in_bytes(Thread::pending_exception_offset()));
> 2662
> 2663     // slow case of monitor enter.  Inline a special case of
> call_VM that
> 2664     // disallows any pending_exception.
> 2665     __ mov(L3_box, O1);
> 2666
> 2667     // Pass in current thread pointer
> 2668     __ mov(G2_thread, O2);
> 2669
> 2670     __ call(CAST_FROM_FN_PTR(address,
> SharedRuntime::complete_monitor_unlocking_C),
> relocInfo::runtime_call_type);
> 2671     __ delayed()->mov(L4, O0);              // Need oop in O0
> 2672
> 2673     __ restore_thread(L7_thread_cache); // restore G2_thread
>
> so we call compiler_unlock_object() on line 2651. If it
> was able to do the unlock operation by itself, then we
> branch to "done". If compiler_unlock_object() couldn't
> do the unlock operation, then we have to fall back to
> the slow path which results in a call to
> SharedRuntime::complete_monitor_unlocking_C() on line 2670.
>
> In the case where we have a recursive monitor exit, that
> qualifies as a "slow path" operation so the decrement of
> the 'recursions' counter happens on the
> complete_monitor_unlocking_C() code path. See
>
> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/share/vm/runtime/sharedRuntime.cpp.frames.html
>
>
> 1822 JRT_LEAF(void, SharedRuntime::complete_monitor_unlocking_C(oopDesc*
> _obj, BasicLock* lock, JavaThread * THREAD))
>
> and the resulting call to ObjectSynchronizer::slow_exit().

Thanks for the detailed map of the code! I'm surprised that recursive 
unlocking is considered a slow-path operation! I would have expected it 
to be fast-path given it can be handled CAS free.

Thanks,
David

>
>> But that aside the description seems reasonable.
>>
>>> sharedRuntime_sparc.cpp:
>>>
>>> - add JavaThread* param for
>>>    SharedRuntime::complete_monitor_unlocking_C call
>>
>> Ok.
>>
>>> macroAssembler_x86.cpp: MacroAssembler::fast_unlock()
>>>
>>> - with all the new/modified comments and all the optional
>>>    code path updates, it is hard to see that the only real
>>>    changes to the default code path are 1) the _owner==Self
>>>    check has been made optional and 2) the optional succ
>>>    field check for the slow path is now on the default path
>>>
>>> - default (OOTB) EmitSync value == 0
>>> - remove optional (EmitSync & 8) code
>>> - refactor code in MacroAssembler::fast_unlock
>>> - remove old mfence() support
>>
>> So I had expected to see this change conditional on the runtime
>> processor type. But then I checked Dave's blog and saw this was
>> determined 5+ years ago (!!!) so those older processors are no longer
>> a consideration. I do hope the performance tests have been re-run
>> since then though ;-)
>
> We have been doing both targeted microbenchmarks and general
> perf runs for every bucket that we've processed so far. In
> general, the microbenchmarks show a performance gain and the
> general perf runs don't show more than a minimal improvement
> if there is any at all. This is in keeping with Dave Dice's
> prediction that we won't see the big gains until all the
> buckets are in play. We structured this project with the
> distinct buckets because we wanted to figure out which of
> the groups of changes caused the performance degradations
> that Karen saw back when she was working with this code. So
> far we haven't run into performance degradations, but we
> have three buckets (including this one) to go...
>
>
>>
>>> - an _owner field optimization is in the non-default path under
>>>    a (EmitSync & 1024) check
>>
>> Doesn't the comment:
>>
>> 2163     if (EmitSync & 1024) {
>> 2164       // Don't bother to ratify that m_Owner==Self
>>
>> apply to else part?
>
> Yes, that comment does appear to be misplaced. The comment
> from the SPARC side is better:
>
> 3026      // Emit code to check that _owner == Self
> 3027      // We could fold the _owner test into subsequent code more
> efficiently
> 3028      // than using a stand-alone check, but since _owner checking
> is off by
> 3029      // default we don't bother.
>
> And my description for this change should have been:
>
> - an _owner field sanity/paranoia check is in the non-default path under
>     a (EmitSync & 1024) check
>
>
>>
>>> - an _owner field optimization is in the non-default path under
>>>    a (EmitSync & 16) check
>>> - a barrier optimization is in the non-default path under a
>>>    (EmitSync & 32) check
>>> - remove old AMD optional work around
>>
>> 2229       // box is really RAX -- the following CMPXCHG depends on
>> that binding
>> 2230       // cmpxchg R,[M] is equivalent to rax = CAS(M,rax,R)
>> 2231
>> 2232       movptr(boxReg, (int32_t)NULL_WORD);     // box is really RAX
>>
>> You've repeated the comment about box.
>
> I'll clean that up.
>
>
>>
>>> sharedRuntime_x86_32.cpp:
>>> sharedRuntime_x86_64.cpp:
>>>
>>> - add thread param for SharedRuntime::complete_monitor_unlocking_C
>>>    call
>>
>> Ok.
>>
>>> macro.[ch]pp: PhaseMacroExpand::make_slow_call()
>>>
>>> - add param2 and update callers
>>
>> Ok.
>>
>>> runtime.[ch]pp: OptoRuntime::complete_monitor_exit_Type()
>>>
>>> - add JavaThread* parameter
>>
>> Ok
>>
>>> sharedRuntime.[ch]pp: SharedRuntime::complete_monitor_unlocking_C()
>>>
>>> - add JavaThread* parameter
>>
>> Ok. So all that passing through of the Thread was just to avoid a call
>> to JavaThread::current()? I thought that was already optimized to just
>> be a register load (at least on some platforms) ?
>
> Yes. I don't have a good history from Dave/Karen on what
> motivated this change. I do remember from my Serviceability
> days that JavaThread::current() can be slow and if you
> already had the right JavaThread *, then you should use it.
>
> In this case, the callers of complete_monitor_unlocking_C()
> have the right JavaThread * so passing it down is good
> (and cheap!).
>
>
>>
>>> synchronizer.cpp: ObjectSynchronizer::omRelease()
>>>
>>> - cleanup in use list traversal
>>> - add assert that ObjectMonitor being released was
>>>    actually found and extracted
>>>
>>> synchronizer.[ch]pp:
>>>
>>> - add some comments in various places
>>> - rename walk_monitor_list() -> deflate_monitor_list()
>>>    and update callers
>>> - minor logic refactoring
>>
>> Generally okay.
>>
>> Minor nit: you changed inuse to "in use" but for the "inuse list" it
>> should really be "in-use list" (and not in-use-list as appears in a
>> couple of existing places)
>
> I'll clean that up a little bit more...
>
>
>>
>> 1507       // if deflate_monitor succeeded,/moribund
>>
>> Not sure what that comment is supposed to mean :)
>
> I'll have to ask Dave/Karen about that comment.
>
>
>>
>>> thread.hpp:
>>>
>>> - add _TSelf field for mfence() replacement
>>
>> Hmmmm, so we add a new pointer field to the Thread structure purely
>> for use on x86 if running with a particular experimental option set ?
>> That doesn't seem like it fits with our normal considerations for
>> space/time trade-offs. Maybe the xchg T,T->TSelf stuff should be under
>> a compile-flag instead?
>
> Claes' testing of this bucket showed a small but detectable
> footprint regression here so I'm going to back out this change
> from this bucket.
>
>
>>
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> Cheers,
>> David
>>
>>> Dan
>
> Thanks for the thorough review (as usual)! Please let me know
> if I've addressed your comments satisfactorily...
>
> Again, sorry for the delay in getting back to this review.
>
> Dan
>
>