RFR(S) Contended Locking fast exit bucket (8073165)

Wed Apr 8 21:26:22 UTC 2015

Sorry for the delay in getting back to this review. Been caught with
other work issues and took a trip over Spring Break (which is not
normal for my family).

As usual, replies are embedded below...

On 3/25/15 6:56 PM, Coleen Phillimore wrote:
>
> Dan,
>
> I've finally gotten around to seeing this.  What is this?
>
> +   if (EmitSync & 1024) {
>
> and
>
> +   if (EmitSync & 512) {

I guess I should have been more clear in the code review invite.

I wrote this in the invite:

 > There are tweaks that apply to the default code path and
 > there are tweaks that only apply when an experimental flag is
 > set. This code review is focused on the default code path
 > or the "out of the box" configuration.

and I wrote this in the "summary of changes" section:

- default (OOTB) EmitSync value == 0

for the files where the EmitSync value comes into play.

To me, it is obvious that (EmitSync == 0) means the default
or Out-Of-The-Box (OOTB) config and that something like
(EmitSync & 1024) means an optional code path.

However, I have been living with this code for a few years
so what's obvious to me is likely not obvious to anyone else.

> I thought this sort of unintelligible code was going to get cleaned up?

I suppose that depends on what you mean by cleaned up. I'm
not being intentionally snarky here. To some folks, all of
these experimental code paths should be removed which would
count as being cleaned up. However, we (product engineering)
have an agreement with Dave Dice and Doug Lea to leave in
some experimental code so that they can more easily do
experiments _and_ diagnose real world issues with Java
monitors in the field.

For this project, we're removing some of the experimental
code that has been proven to be not useful. I've called out
those removals in the summary. For this project, we're also
considering reordering some of the experimental code versus
the default code such that the default code is not obscured.
For this bucket, I don't see how to move any of the experimental
code related to "fast monitor exit" without breaking things.

Currently the EmitSync option is just a collection of flag
bits. As you know, HotSpot doesn't do a great job with
supporting options like this. We only recently got support
for setting options to 0xNNNN patterns.

We could do something like what JVM/TI does with the
TraceJVMTI option:

src/share/vm/prims/jvmtiTrace.cpp:

// Support for JVMTI tracing code
//
// ------------
// Usage:
//    -XX:TraceJVMTI=DESC,DESC,DESC
//
//    DESC is   DOMAIN ACTION KIND
//
//    DOMAIN is function name
//              event name
//              "all" (all functions and events)
//              "func" (all functions except boring)
//              "allfunc" (all functions)
//              "event" (all events)
//              "ec" (event controller)
//
//    ACTION is "+" (add)
//              "-" (remove)
//
//    KIND is
//     for func
//              "i" (input params)
//              "e" (error returns)
//              "o" (output)
//     for event
//              "t" (event triggered aka posted)
//              "s" (event sent)
//
// Example:
//            -XX:TraceJVMTI=ec+,GetCallerFrame+ie,Breakpoint+s

but I'm not sure that's what you really mean by
cleaning up this "unintelligible code".

>   Do these constants mean anything?  Other than being powers of two?

Yes. Primarily to Dave Dice and Doug Lea. I know something
about most of these flags just from experiments trying to
understand this code, but that knowledge fades quickly.
Karen may still remember what some of these magic values
do, but...

More seriously, each of the flag values enables a particular
code path or optimization or sanity check. I've never seen
a list in a comment that tries to list the different flag
values along with what they do. I suspect that only Dave
and/or Doug could write such a comment completely correctly.

> +     if (EmitSync & 16384) {
>
>
> This one isn't a power of two.

bc(1) disagrees with you:

2^14
16384

I don't defend these uses of literals and you know that
I can't stand literals like this in the code base. However,
the important part for this project is the default code
path where (EmitSync == 0). That's the Out-Of-The-Box (OOTB)
config and that's what I care about right now.

> In 
> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/share/vm/runtime/synchronizer.cpp.udiff.html
>
> +  assert(inusetally == Self->omInUseCount, "in use count off");
>
>
> is this in_use_tally ?

Not sure what you're asking here. Do you want me to
rename 'inusetally' -> 'in_use_tally'? It would
improve readability...

> is curmidinuse
>
> cur_mid_in_use?

Not sure what you're asking here. Do you want me to
rename 'curmidinuse ' -> 'cur_mid_in_use'? It would
improve readability...

I did some parameter renaming in the cleanup bucket,
but didn't really touch the locals...

> I wonder what gOmInUseList is.
>
> global object monitor in use list?

Exactly. I have this block of "globals" and only a few
of them are commented:

  114 // gBlockList is really PaddedEnd<ObjectMonitor> *, but we don't
  115 // want to expose the PaddedEnd template more than necessary.
  116 ObjectMonitor * ObjectSynchronizer::gBlockList = NULL;
  117 ObjectMonitor * volatile ObjectSynchronizer::gFreeList  = NULL;
  118 ObjectMonitor * volatile ObjectSynchronizer::gOmInUseList  = NULL;
  119 int ObjectSynchronizer::gOmInUseCount = 0;
  120 static volatile intptr_t ListLock = 0;      // protects global 
monitor free-list cache
  121 static volatile int MonitorFreeCount  = 0;  // # on gFreeList
  122 static volatile int MonitorPopulation = 0;  // # Extant -- in 
circulation

Are you asking that I do so?

> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/src/share/vm/runtime/synchronizer.hpp.udiff.html 
>
>
> deflate_monitor_list appears to be  more descriptive.

Interesting. I think that's one that I cleaned up... :-)

> Sorry I can only comment on surface things.

May I still list you as a reviewer?

> It's a shame this code has to be in macro assembler.

As opposed to where? It used to be in .ad files and got
moved the macro assembler by one of the C2 cleanup and
reorganization passes...

> It appears to be called from generate_native_wrapper so affects 
> performance of synchronized native methods?

In theory it makes synchronized native method support faster.
Both the lock and unlock code are also called from the
corresponding C2 nodes for regular synchronized Java object
stuff...

Thanks for taking a look at the code and I'm sorry that I
don't have answers (yet) that will make you cringe less
with this code. It is getting better and less arcane, but
we have a way to go...

Dan

>
>
> Thanks,
> Coleen
>
>
> On 3/13/15, 12:50 AM, David Holmes wrote:
>> Hi Dan,
>>
>> Well that was fun ... :)
>>
>> On 10/03/2015 3:11 AM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> I have the Contended Locking fast exit bucket ready for review.
>>>
>>> The code changes in this bucket are primarily assembly language
>>> tweaks that come into play when an inflated ObjectMonitor is in
>>> use. There are tweaks that apply to the default code path and
>>> there are tweaks that only apply when an experimental flag is
>>> set. This code code review is focused on the default code path
>>> or the "out of the box" configuration.
>>>
>>> This work is being tracked by the following bug ID:
>>>
>>>      JDK-8073165 Contended Locking fast exit bucket
>>>      https://bugs.openjdk.java.net/browse/JDK-8073165
>>>
>>> Here is the webrev URL:
>>>
>>> http://cr.openjdk.java.net/~dcubed/8073165-webrev/0-jdk9-hs-rt/
>>>
>>> Here is the JEP link:
>>>
>>>      https://bugs.openjdk.java.net/browse/JDK-8046133
>>>
>>> Testing:
>>>
>>> - Aurora Adhoc RT/SVC baseline batch
>>> - JPRT test jobs
>>> - MonitorExitStresser micro-benchmark (in process)
>>> - MonitorEnterExitStresser micro-benchmark (in process)
>>> - CallTimerGrid stress testing (in process)
>>> - Aurora performance testing:
>>>    - out of the box for the "promotion" and 32-bit server configs
>>>    - heavy weight monitors for the "promotion" and 32-bit server 
>>> configs
>>>      (-XX:-UseBiasedLocking -XX:+UseHeavyMonitors)
>>>      (in process)
>>>
>>>
>>> 8073165 summary of changes:
>>>
>>> macroAssembler_sparc.cpp: MacroAssembler::compiler_unlock_object()
>>>
>>> - default (OOTB) EmitSync value == 0
>>> - an _owner field optimization is in the non-default path under
>>>    a (EmitSync & 1024) check
>>>
>>> - classic lock release without 1-0 support is in the non-default
>>>    path under a (EmitSync & 512) check
>>>
>>> - an EntryList, cxq and _succ field optimization is in the
>>>    non-default path under a (EmitSync & 16384) check
>>>
>>> - the default path is now the 1-0 form that avoids CAS and membar
>>>    in the common case; see the "1-0 exit" section in objectMonitor.cpp
>>
>> I don't pretend that I could follow all these changes (I don't read 
>> sparc very well) - the main thing of interest is the EmitSync==0 path 
>> - everything else is just experimental (did you actually test to any 
>> great extent with all these different possible EmitSync values?).
>>
>> So looking at the default path ... I could not figure out what logic 
>> this (pre-existing) code is effecting:
>>
>> 3071      ld_ptr(Address(Rmark,
>> OM_OFFSET_NO_MONITOR_VALUE_TAG(recursions)), Rbox);
>> 3072      orcc(Rbox, G0, G0);
>> 3073      brx(Assembler::notZero, false, Assembler::pn, done);
>>
>> It seems to be a check for:
>>
>> if (recursions != 0) return;
>>
>> which would be fine except where is recursions decremented ??
>>
>> But that aside the description seems reasonable.
>>
>>> sharedRuntime_sparc.cpp:
>>>
>>> - add JavaThread* param for
>>>    SharedRuntime::complete_monitor_unlocking_C call
>>
>> Ok.
>>
>>> macroAssembler_x86.cpp: MacroAssembler::fast_unlock()
>>>
>>> - with all the new/modified comments and all the optional
>>>    code path updates, it is hard to see that the only real
>>>    changes to the default code path are 1) the _owner==Self
>>>    check has been made optional and 2) the optional succ
>>>    field check for the slow path is now on the default path
>>>
>>> - default (OOTB) EmitSync value == 0
>>> - remove optional (EmitSync & 8) code
>>> - refactor code in MacroAssembler::fast_unlock
>>> - remove old mfence() support
>>
>> So I had expected to see this change conditional on the runtime 
>> processor type. But then I checked Dave's blog and saw this was 
>> determined 5+ years ago (!!!) so those older processors are no longer 
>> a consideration. I do hope the performance tests have been re-run 
>> since then though ;-)
>>
>>> - an _owner field optimization is in the non-default path under
>>>    a (EmitSync & 1024) check
>>
>> Doesn't the comment:
>>
>> 2163     if (EmitSync & 1024) {
>> 2164       // Don't bother to ratify that m_Owner==Self
>>
>> apply to else part?
>>
>>> - an _owner field optimization is in the non-default path under
>>>    a (EmitSync & 16) check
>>> - a barrier optimization is in the non-default path under a
>>>    (EmitSync & 32) check
>>> - remove old AMD optional work around
>>
>> 2229       // box is really RAX -- the following CMPXCHG depends on 
>> that binding
>> 2230       // cmpxchg R,[M] is equivalent to rax = CAS(M,rax,R)
>> 2231
>> 2232       movptr(boxReg, (int32_t)NULL_WORD);     // box is really RAX
>>
>> You've repeated the comment about box.
>>
>>> sharedRuntime_x86_32.cpp:
>>> sharedRuntime_x86_64.cpp:
>>>
>>> - add thread param for SharedRuntime::complete_monitor_unlocking_C
>>>    call
>>
>> Ok.
>>
>>> macro.[ch]pp: PhaseMacroExpand::make_slow_call()
>>>
>>> - add param2 and update callers
>>
>> Ok.
>>
>>> runtime.[ch]pp: OptoRuntime::complete_monitor_exit_Type()
>>>
>>> - add JavaThread* parameter
>>
>> Ok
>>
>>> sharedRuntime.[ch]pp: SharedRuntime::complete_monitor_unlocking_C()
>>>
>>> - add JavaThread* parameter
>>
>> Ok. So all that passing through of the Thread was just to avoid a 
>> call to JavaThread::current()? I thought that was already optimized 
>> to just be a register load (at least on some platforms) ?
>>
>>> synchronizer.cpp: ObjectSynchronizer::omRelease()
>>>
>>> - cleanup in use list traversal
>>> - add assert that ObjectMonitor being released was
>>>    actually found and extracted
>>>
>>> synchronizer.[ch]pp:
>>>
>>> - add some comments in various places
>>> - rename walk_monitor_list() -> deflate_monitor_list()
>>>    and update callers
>>> - minor logic refactoring
>>
>> Generally okay.
>>
>> Minor nit: you changed inuse to "in use" but for the "inuse list" it 
>> should really be "in-use list" (and not in-use-list as appears in a 
>> couple of existing places)
>>
>> 1507       // if deflate_monitor succeeded,/moribund
>>
>> Not sure what that comment is supposed to mean :)
>>
>>> thread.hpp:
>>>
>>> - add _TSelf field for mfence() replacement
>>
>> Hmmmm, so we add a new pointer field to the Thread structure purely 
>> for use on x86 if running with a particular experimental option set ? 
>> That doesn't seem like it fits with our normal considerations for 
>> space/time trade-offs. Maybe the xchg T,T->TSelf stuff should be 
>> under a compile-flag instead?
>>
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> Cheers,
>> David
>>
>>> Dan
>