RFR(S) Contended Locking fast exit bucket (8073165)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Wed Apr 8 18:30:31 UTC 2015
On 3/12/15 10:50 PM, David Holmes wrote:
> Hi Dan,
>
> Well that was fun ... :)
Sorry for the delay in getting back to this review. Been caught with
other work issues and took a trip over Spring Break (which is not
normal for my family).
Glad you had fun with this one! The "fast notify" and "adaptive spin"
buckets are chock full of even more fun!
As usual, replies are embedded below...
>
> On 10/03/2015 3:11 AM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I have the Contended Locking fast exit bucket ready for review.
>>
>> The code changes in this bucket are primarily assembly language
>> tweaks that come into play when an inflated ObjectMonitor is in
>> use. There are tweaks that apply to the default code path and
>> there are tweaks that only apply when an experimental flag is
>> set. This code code review is focused on the default code path
>> or the "out of the box" configuration.
>>
>> This work is being tracked by the following bug ID:
>>
>> JDK-8073165 Contended Locking fast exit bucket
>> https://bugs.openjdk.java.net/browse/JDK-8073165
>>
>> Here is the webrev URL:
>>
>> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/
>>
>> Here is the JEP link:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8046133
>>
>> Testing:
>>
>> - Aurora Adhoc RT/SVC baseline batch
>> - JPRT test jobs
>> - MonitorExitStresser micro-benchmark (in process)
>> - MonitorEnterExitStresser micro-benchmark (in process)
>> - CallTimerGrid stress testing (in process)
>> - Aurora performance testing:
>> - out of the box for the "promotion" and 32-bit server configs
>> - heavy weight monitors for the "promotion" and 32-bit server configs
>> (-XX:-UseBiasedLocking -XX:+UseHeavyMonitors)
>> (in process)
>>
>>
>> 8073165 summary of changes:
>>
>> macroAssembler_sparc.cpp: MacroAssembler::compiler_unlock_object()
>>
>> - default (OOTB) EmitSync value == 0
>> - an _owner field optimization is in the non-default path under
>> a (EmitSync & 1024) check
>>
>> - classic lock release without 1-0 support is in the non-default
>> path under a (EmitSync & 512) check
>>
>> - an EntryList, cxq and _succ field optimization is in the
>> non-default path under a (EmitSync & 16384) check
>>
>> - the default path is now the 1-0 form that avoids CAS and membar
>> in the common case; see the "1-0 exit" section in objectMonitor.cpp
>
> I don't pretend that I could follow all these changes (I don't read
> sparc very well) - the main thing of interest is the EmitSync==0 path
Yes, the "EmitSync == 0" path is the "out-of-the-box" (OOTB) config
and that's what we care about for this review. I expect that we'll
eventually revisit the alternate optimizations in more detail,
but that's not in the cards for this project.
> - everything else is just experimental (did you actually test to any
> great extent with all these different possible EmitSync values?).
I've done minimal testing of the experimental code paths mostly
just to satisfy my "what the heck does this do?" curiosity...
In the past, Karen has done more experiments with the alternate
code paths, but my focus is on the OOTB config. Of course, Dave
Dice and Doug Lea use these experimental flags quite a bit as
part of their research work on locking. Dice has often used them
in customer situations to diagnose a system's Java monitor usage
and behavior patterns.
> So looking at the default path ... I could not figure out what logic
> this (pre-existing) code is effecting:
>
> 3071 ld_ptr(Address(Rmark,
> OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions)), Rbox);
> 3072 orcc(Rbox, G0, G0);
> 3073 brx(Assembler::notZero, false, Assembler::pn, done);
>
> It seems to be a check for:
>
> if (recursions != 0) return;
>
> which would be fine except where is recursions decremented ??
For the other readers of this thread, the above quoted
code is here:
http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/cpu/sparc/vm/macroAssembler_sparc.cpp.frames.html
in this function:
2964 void MacroAssembler::compiler_unlock_object(Register Roop, Register
Rmark,
2965 Register Rbox, Register
Rscratch,
2966 bool try_bias) {
The key to understanding the "slow path" fall back for this
optimization is to look at this code:
http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/cpu/sparc/vm/sharedRuntime_sparc.cpp.frames.html
in this function:
1934 nmethod* SharedRuntime::generate_native_wrapper(MacroAssembler* masm,
1935 methodHandle method,
1936 int compile_id,
1937 BasicType* in_sig_bt,
1938 VMRegPair* in_regs,
1939 BasicType ret_type) {
2638 // Unlock
2639 if (method->is_synchronized()) {
<snip>
2649 // Now unlock
2650 // (Roop, Rmark, Rbox, Rscratch)
2651 __ compiler_unlock_object(L4, L1, L3_box, L2);
2652 __ br(Assembler::equal, false, Assembler::pt, done);
2653 __ delayed()-> add(SP, lock_offset+STACK_BIAS, L3_box);
2654
2655 // save and restore any potential method result value around
the unlocking
2656 // operation. Will save in I0 (or stack for FP returns).
2657 save_native_result(masm, ret_type, stack_slots);
2658
2659 // Must clear pending-exception before re-entering the VM.
Since this is
2660 // a leaf call, pending-exception-oop can be safely kept in a
register.
2661 __ st_ptr(G0, G2_thread,
in_bytes(Thread::pending_exception_offset()));
2662
2663 // slow case of monitor enter. Inline a special case of
call_VM that
2664 // disallows any pending_exception.
2665 __ mov(L3_box, O1);
2666
2667 // Pass in current thread pointer
2668 __ mov(G2_thread, O2);
2669
2670 __ call(CAST_FROM_FN_PTR(address,
SharedRuntime::complete_monitor_unlocking_C), relocInfo::runtime_call_type);
2671 __ delayed()->mov(L4, O0); // Need oop in O0
2672
2673 __ restore_thread(L7_thread_cache); // restore G2_thread
so we call compiler_unlock_object() on line 2651. If it
was able to do the unlock operation by itself, then we
branch to "done". If compiler_unlock_object() couldn't
do the unlock operation, then we have to fall back to
the slow path which results in a call to
SharedRuntime::complete_monitor_unlocking_C() on line 2670.
In the case where we have a recursive monitor exit, that
qualifies as a "slow path" operation so the decrement of
the 'recursions' counter happens on the
complete_monitor_unlocking_C() code path. See
http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/share/vm/runtime/sharedRuntime.cpp.frames.html
1822 JRT_LEAF(void, SharedRuntime::complete_monitor_unlocking_C(oopDesc*
_obj, BasicLock* lock, JavaThread * THREAD))
and the resulting call to ObjectSynchronizer::slow_exit().
> But that aside the description seems reasonable.
>
>> sharedRuntime_sparc.cpp:
>>
>> - add JavaThread* param for
>> SharedRuntime::complete_monitor_unlocking_C call
>
> Ok.
>
>> macroAssembler_x86.cpp: MacroAssembler::fast_unlock()
>>
>> - with all the new/modified comments and all the optional
>> code path updates, it is hard to see that the only real
>> changes to the default code path are 1) the _owner==Self
>> check has been made optional and 2) the optional succ
>> field check for the slow path is now on the default path
>>
>> - default (OOTB) EmitSync value == 0
>> - remove optional (EmitSync & 8) code
>> - refactor code in MacroAssembler::fast_unlock
>> - remove old mfence() support
>
> So I had expected to see this change conditional on the runtime
> processor type. But then I checked Dave's blog and saw this was
> determined 5+ years ago (!!!) so those older processors are no longer
> a consideration. I do hope the performance tests have been re-run
> since then though ;-)
We have been doing both targeted microbenchmarks and general
perf runs for every bucket that we've processed so far. In
general, the microbenchmarks show a performance gain and the
general perf runs don't show more than a minimal improvement
if there is any at all. This is in keeping with Dave Dice's
prediction that we won't see the big gains until all the
buckets are in play. We structured this project with the
distinct buckets because we wanted to figure out which of
the groups of changes caused the performance degradations
that Karen saw back when she was working with this code. So
far we haven't run into performance degradations, but we
have three buckets (including this one) to go...
>
>> - an _owner field optimization is in the non-default path under
>> a (EmitSync & 1024) check
>
> Doesn't the comment:
>
> 2163 if (EmitSync & 1024) {
> 2164 // Don't bother to ratify that m_Owner==Self
>
> apply to else part?
Yes, that comment does appear to be misplaced. The comment
from the SPARC side is better:
3026 // Emit code to check that _owner == Self
3027 // We could fold the _owner test into subsequent code more
efficiently
3028 // than using a stand-alone check, but since _owner checking
is off by
3029 // default we don't bother.
And my description for this change should have been:
- an _owner field sanity/paranoia check is in the non-default path under
a (EmitSync & 1024) check
>
>> - an _owner field optimization is in the non-default path under
>> a (EmitSync & 16) check
>> - a barrier optimization is in the non-default path under a
>> (EmitSync & 32) check
>> - remove old AMD optional work around
>
> 2229 // box is really RAX -- the following CMPXCHG depends on
> that binding
> 2230 // cmpxchg R,[M] is equivalent to rax = CAS(M,rax,R)
> 2231
> 2232 movptr(boxReg, (int32_t)NULL_WORD); // box is really RAX
>
> You've repeated the comment about box.
I'll clean that up.
>
>> sharedRuntime_x86_32.cpp:
>> sharedRuntime_x86_64.cpp:
>>
>> - add thread param for SharedRuntime::complete_monitor_unlocking_C
>> call
>
> Ok.
>
>> macro.[ch]pp: PhaseMacroExpand::make_slow_call()
>>
>> - add param2 and update callers
>
> Ok.
>
>> runtime.[ch]pp: OptoRuntime::complete_monitor_exit_Type()
>>
>> - add JavaThread* parameter
>
> Ok
>
>> sharedRuntime.[ch]pp: SharedRuntime::complete_monitor_unlocking_C()
>>
>> - add JavaThread* parameter
>
> Ok. So all that passing through of the Thread was just to avoid a call
> to JavaThread::current()? I thought that was already optimized to just
> be a register load (at least on some platforms) ?
Yes. I don't have a good history from Dave/Karen on what
motivated this change. I do remember from my Serviceability
days that JavaThread::current() can be slow and if you
already had the right JavaThread *, then you should use it.
In this case, the callers of complete_monitor_unlocking_C()
have the right JavaThread * so passing it down is good
(and cheap!).
>
>> synchronizer.cpp: ObjectSynchronizer::omRelease()
>>
>> - cleanup in use list traversal
>> - add assert that ObjectMonitor being released was
>> actually found and extracted
>>
>> synchronizer.[ch]pp:
>>
>> - add some comments in various places
>> - rename walk_monitor_list() -> deflate_monitor_list()
>> and update callers
>> - minor logic refactoring
>
> Generally okay.
>
> Minor nit: you changed inuse to "in use" but for the "inuse list" it
> should really be "in-use list" (and not in-use-list as appears in a
> couple of existing places)
I'll clean that up a little bit more...
>
> 1507 // if deflate_monitor succeeded,/moribund
>
> Not sure what that comment is supposed to mean :)
I'll have to ask Dave/Karen about that comment.
>
>> thread.hpp:
>>
>> - add _TSelf field for mfence() replacement
>
> Hmmmm, so we add a new pointer field to the Thread structure purely
> for use on x86 if running with a particular experimental option set ?
> That doesn't seem like it fits with our normal considerations for
> space/time trade-offs. Maybe the xchg T,T->TSelf stuff should be under
> a compile-flag instead?
Claes' testing of this bucket showed a small but detectable
footprint regression here so I'm going to back out this change
from this bucket.
>
>>
>> Thanks, in advance, for any comments, questions or suggestions.
>
> Cheers,
> David
>
>> Dan
Thanks for the thorough review (as usual)! Please let me know
if I've addressed your comments satisfactorily...
Again, sorry for the delay in getting back to this review.
Dan
More information about the hotspot-runtime-dev
mailing list