RFR (XL) 8031320: Use Intel RTM instructions for locks

Thu Mar 20 16:23:38 UTC 2014

Thank you, David

I will remove that check.

Vladimir

On 3/20/14 5:48 AM, David Holmes wrote:
> <trimming>
>
> On 20/03/2014 3:26 PM, Vladimir Kozlov wrote:
>>> src/share/vm/runtime/task.cpp
>>>      line: 108:          _interval <= max_jint &&
>>>          Not obvious why this changed.
>>
>> I hit this assert if I specify RTMLockingCalculationDelay=11000 more
>> then 10 sec. PeriodicTask::max_interval  = 10000. I don't want to
>> increase it because it is used for _intervalHistogram[max_interval] in
>> debug VM. Note, PeriodicTask correctly process times larger than
>> max_interval:
>>
>>      if (ms >= PeriodicTask::max_interval) ms =
>> PeriodicTask::max_interval - 1;
>
> This seems broken. You either need to limit your interval to < max_interval, or you have to increase max_interval; or
> max_interval is meaningless. Checking an int is <= max_jint is vacuously true. I think max_interval is meaningless other
> than for the histogram - so I would remove that part of the assertion, as the histogram code already handles putting all
> larger intervals into the last bucket. max_interval should be configurable really.
>
> David
>
>>
>>>
>>> src/share/vm/runtime/thread.cpp
>>>      line 2: #include "runtime/rtmLocking.hpp"
>>>          Should this include be conditional (#ifdef'ed)?
>>
>> done
>>
>>>
>>> src/cpu/x86/vm/globals_x86.hpp
>>>      Seems like the new options are also "available" in every X86
>>>      build config, but probably not usable. Perhaps these should
>>>      only be defined when INCLUDE_RTM_OPT is defined.
>>>
>>>      Update: OK, now I see from code in vm_version_x86.cpp that you
>>>      want at least some of the options available to get better
>>>      diagnostics when someone tries to specify UseRTMLocking on a
>>>      machine that doesn't support the feature.
>>>
>>> src/cpu/x86/vm/vm_version_x86.hpp
>>>      line 450:     if (_cpuid_info.sef_cpuid7_ebx.bits.rtm != 0)
>>>      line 451:       result |= CPU_RTM;
>>>          Should the above lines be farther down in this block:
>>>
>>>          line 463:     // Intel features.
>>>          line 464:     if(is_intel()) {
>>
>> AMD uses different bits for their ASF and different instructions.
>> I don't think we need to put it under is_intel(). The rule is if a bit
>> is set the instructions should fully be supported.
>>
>>>
>>> src/cpu/x86/vm/vm_version_x86.cpp
>>>      line 601:       if (!FLAG_IS_DEFAULT(UseRTMForStackLocks)) {
>>>      line 602:         warning("UseRTMForStackLocks flag should be off
>>> when UseRTMLocking flag is off");
>>>      line 603:       }
>>>          So why does 'UseRTMForStackLocks' warrant a warning when used
>>>          without 'UseRTM'? 'UseRTMDeopt' and
>>> 'PrintPreciseRTMLockingStatistics'
>>>          simply get set to false.
>>
>> Because UseRTMForStackLocks is additional optimization. Where
>> UseRTMDeopt is for additional code generation in UseRTM code.
>>
>>>
>>>      line 975:       UseBiasedLocking = false;
>>>          You use the FLAG_SET_DEFAULT macro elsewhere when resetting
>>>          flags so why not here? Have I missed some magic part of the
>>>          FLAG_FOO support macros?
>>
>> I am not sure what to use here. FLAG_SET_ERGO? We are sloppy about
>> setting UseBiasedLocking in arguments.cpp too. So I will leave it for
>> future cleanup.
>>
>>>
>>> src/cpu/x86/vm/assembler_x86.hpp
>>>      line 1454: void pause();
>>>          This was a surprise; if it's the same as the pause() in my
>>>          contended locking project code from Dice, then I'll defer to
>>>          your work.
>>
>> It is 'hint' (noop) instruction to reduce power consumption in spin-loop:
>>
>> "The PAUSE instruction provides a hint to the processor that the code
>> sequence is a spin-wait loop. The processor uses this hint to avoid the
>> memory order violation in most situations, which greatly improves
>> processor performance. For this reason, it is recommended that a PAUSE
>> instruction be placed in all spin-wait loops."
>>
>>>
>>> src/cpu/x86/vm/assembler_x86.cpp
>>>      I presume someone else is checking the HEX instruction encodings
>>>      versus the data book.
>>>
>>>      line 3014   }
>>>      line 3015   else
>>>      line 3016   {
>>>          nite: Most of the 'else' block styles in this file are:
>>>
>>>              } else {
>>
>> Fixed.
>>
>>>
>>>      line 3017     abort.add_patch_at(code(), locator());
>>>      line 3018     emit_int8((unsigned char)0xC7);
>>>      line 3019     emit_int8((unsigned char)0xF8);
>>>      line 3020     emit_int32(0);
>>>          Trying to understand the above sequence. Is it basically
>>>          an "Xbegin" over no bytes? Or a no-op of sorts?
>>
>> It is code for forward branch when distance is unknown. emit_int32(0)
>> will be patched when bind(abort_label) is executed later. It is normal
>> code for branches.
>>
>>>
>>> src/cpu/x86/vm/macroAssembler_x86.hpp
>>>      line 30: #include "runtime/rtmLocking.hpp"
>>>          Should this include be conditional (#ifdef'ed)?
>>
>> Can't do that because of reason you pointed below. I need
>> RTMLockingCounters class defined in all x86 cases.
>>
>>>
>>>      line 660: void fast_lock(Register obj, Register box, Register tmp,
>>>      line 661:                Register scr, Register cx1, Register cx2,
>>>      line 662:                BiasedLockingCounters* counters,
>>>      line 663:                RTMLockingCounters* rtmcounters,
>>>      line 664:                RTMLockingCounters* stackrtmcounters,
>>>      line 665:                Metadata* method_data,
>>>      line 666:                bool use_rtm, bool profile_rtm);
>>>          With the unconditional inclusion of RTM stuff in the
>>>          parameters, you have no choice but to include some of
>>>          the RTM data types in all X86 configs.
>>>
>>>          Did you consider having two different version of the
>>>          function declaration?
>>
>> Too much trouble and additional code which will not help significantly.
>>
>>>
>>> src/cpu/x86/vm/macroAssembler_x86.cpp
>>>      line 304: void MacroAssembler::movptr(Register dst, AddressLiteral
>>> src, Register scratch) {
>>>          Perhaps reiterate that you're intentionally ignoring the
>>>          'scratch' parameter for 3bit.
>>
>> void MacroAssembler::movptr(Register dst, AddressLiteral src, Register
>> scratch) {
>>    // scratch register is not used,
>>    // it is defined to match parameters of 64-bit version of this method.
>>
>>>
>>>      line 701:       movq(dst, Address(scratch,0));
>>>          nit: space before '0'.
>>
>> done.
>>
>>>
>>>      line 1000: void MacroAssembler::atomic_incl()
>>>          It's not clear (to me anyway) why the pushf() and popf()
>>>          calls are no longer needed.
>>>
>>>          Update: OK, I think I see it. The old code always called
>>>          pushf()/popf() even when 'rscratch1' wasn't used. Now
>>>          you pass in a scratch register when you need one.
>>
>> No, it is because we need to save flags in cond_inc32() to where I moved
>> pushf() and popf(). Before atomic_incl() was called only from cond_inc32().
>>
>>>
>>>      line 1684: testptr(tmpReg, markOopDesc::monitor_value); // inflated
>>> vs stack-locked|neutral|biased
>>>      line 1944: movptr (boxReg, tmpReg);
>>>          These fixes are independent of RTM code.
>>>          Are these something that should be fixed in older HSX versions?
>>
>> No, it is simple cleanpu to use names instead of hardcoded digits. But
>> generated code does not change.
>>
>>>
>>>      line 1698: jcc(Assembler::equal, DONE_LABEL);
>>>      line 1708: jmp(DONE_LABEL);
>>>      line 2050: jcc   (Assembler::zero, DONE_LABEL);            // 0
>>> indicates recursive stack-lock
>>>      line 2053: jcc   (Assembler::zero, Stacked);
>>>          I presume these changed because when the RTM code is there
>>>          the DONE_LABEL is just too far away. Should we be doing:
>>>
>>>          #if INCLUDE_RTM_OPT
>>>            <new_code>
>>>          #else
>>>            <old_code>
>>>          #endif
>>>
>>>          or is the difference in code size/speed just not worth
>>>          the #if noise?
>>
>> Not worth. It is more important to have code which we can understand
>> easy then save 3 bytes.
>> Actually in fast_unlock() (lines around 2050) additional rtm code is
>> small so I can use jccb for Stacked branch. DONE_LABEL is far unfortunate.
>>
>>>
>>>      line 2050: jcc   (Assembler::zero, DONE_LABEL);            // 0
>>> indicates recursive stack-lock
>>>          Changed from 'jccb' -> 'jcc' because of increased code
>>>          size (I'm guessing). However, the instruction also moved
>>>          from after this line to before it:
>>>
>>>          line 2051: movptr(tmpReg, Address(objReg, 0)); // Examine the
>>> object's markword
>>>
>>>          which makes me think that at least part of this change
>>>          is applicable to older HSX.
>>
>> It is again not for correctness but for easy to understand code.
>>
>>>
>>>      line 2841:   pushf(); // Preserve flags
>>>      line 2842:   atomic_incl(counter_addr);
>>>      line 2843:   popf();
>>>          I thought the 32-bit increment was changed to no longer
>>>          clobber the flags register (and the 64-bit increment was
>>>          changed to take a scratch parameter).
>>
>> No, increment instruction on x86 always changes flags. Only lea()
>> instruction does not affect flags but it is not used for this increment +1.
>>
>>>
>>> src/cpu/x86/vm/sharedRuntime_x86_32.cpp
>>>      No comments.
>>>
>>> src/cpu/x86/vm/sharedRuntime_x86_64.cpp
>>>      No comments.
>>>
>>>
>>> Pausing my review at this point and sending what I have since the
>>> remainder is C2 code and that'll take me a while to get through.
>>
>> Thank you, Dan, for very detailed review.
>>
>> Vladimir
>>
>>>
>>> Dan
>>>
>>>
>>> src/cpu/x86/vm/x86_32.ad
>>> src/cpu/x86/vm/x86_64.ad
>>> src/share/vm/adlc/output_c.cpp
>>> src/share/vm/ci/ciEnv.cpp
>>> src/share/vm/ci/ciEnv.hpp
>>> src/share/vm/ci/ciMethodData.hpp
>>> src/share/vm/code/nmethod.cpp
>>> src/share/vm/code/nmethod.hpp
>>> src/share/vm/oops/method.cpp
>>> src/share/vm/oops/methodData.cpp
>>> src/share/vm/oops/methodData.hpp
>>> src/share/vm/opto/c2_globals.hpp
>>> src/share/vm/opto/classes.hpp
>>> src/share/vm/opto/compile.cpp
>>> src/share/vm/opto/compile.hpp
>>> src/share/vm/opto/connode.hpp
>>> src/share/vm/opto/graphKit.cpp
>>> src/share/vm/opto/locknode.cpp
>>> src/share/vm/opto/locknode.hpp
>>> src/share/vm/opto/loopTransform.cpp
>>> src/share/vm/opto/machnode.hpp
>>> src/share/vm/opto/macro.cpp
>>> src/share/vm/opto/macro.hpp
>>> src/share/vm/opto/parse.hpp
>>> src/share/vm/opto/parse1.cpp
>>> src/share/vm/opto/runtime.cpp
>>> src/share/vm/opto/runtime.hpp
>>>
>>>
>>>
>>>
>>>
>>>> The Intel architectures codenamed Haswell has support for RTM
>>>> (Restricted Transactional Memory) instructions xbegin, xabort, xend
>>>> and xtest as part of Intel Transactional Synchronization Extension
>>>> (TSX). The xbegin and xend instructions enclose a set of instructions
>>>> to be executed as a transaction. If no conflict found during execution
>>>> of the transaction, the memory and register modifications are
>>>> committed together at xend. xabort instruction can be used for
>>>> explicit abort of transaction and xtest to check if we are in
>>>> transaction.
>>>>
>>>> RTM is useful for highly contended locks with low conflict in the
>>>> critical region. The highly contended locks don't scale well otherwise
>>>> but with RTM they show good scaling. RTM allows using coarse grain
>>>> locking for applications. Also for lightly contended locks which are
>>>> used by different threads RTM can reduce cache line ping pong and
>>>> thereby show performance improvement too.
>>>>
>>>> Implementation:
>>>>
>>>> Generate RTM locking code for all inflated locks when "UseRTMLocking"
>>>> option is on with normal locking as fall back mechanism. On abort or
>>>> lock busy the lock will be retried a fixed number of times as
>>>> specified by "RTMRetryCount" option.  The locks which abort too often
>>>> can be auto tuned or manually tuned.
>>>>
>>>> Auto-tuning can be done using "UseRTMDeopt" flag which will add an
>>>> abort ratio calculation code for each lock. The abort ratio will be
>>>> calculated after "RTMAbortThreshold" aborts are encountered.
>>>> With "UseRTMDeopt" if the aborts ratio reaches "RTMAbortRatio" the
>>>> nmethod containing the lock will be deoptimized and recompiled with
>>>> all locks as normal (stack) locks. If the abort ratio continues to
>>>> remain low after "RTMLockingThreshold" attempted locks, then the
>>>> method will be deoptimized and recompiled with all locks as RTM locks
>>>> without abort ratio calculation code. The abort ratio calculation can
>>>> be delayed by specifying -XX:RTMLockingCalculationDelay=<millisec> flag.
>>>> Deoptimization of nmethod is done by adding an uncommon trap at the
>>>> beginning of the code which checks rtm state field in MDO which is
>>>> modified by the abort calculation code.
>>>>
>>>> For manual tuning the abort statistics for each lock could be provided
>>>> to a user using "PrintPreciseRTMLockingStatistics" diagnostic flag.
>>>> Based on the abort statistics users can create a .hotspot_compiler
>>>> file or use -XX:CompileCommand=<option> flag to specify for which
>>>> methods disable RTM locking using <option> "NoRTMLockEliding" or
>>>> always enable RTM locking using <option> "UseRTMLockEliding".
>>>>
>>>> The abort calculation and statistic collection are done using
>>>> RTMLockingCounters wrapped into RTMLockingNamedCounter counters which
>>>> are generated for each lock. To reduce burden on cache line RTM lock
>>>> total counter is updated randomly with RTMTotalCountIncrRate rate.
>>>>
>>>> Note, both auto and manually tuning is done for whole method. There is
>>>> no a mechanism to tune an individual lock.
>>>>
>>>> RTM locking can be used for normal (stack) locks by specifying
>>>> "UseRTMForStackLocks" flag.
>>>>
>>>> RTM locking code requires that biased locking is switched off because
>>>> it conflicts with it. RTM locking is most useful when there is high
>>>> lock contention and low data contention.  With high lock contention
>>>> the lock is usually inflated and biased locking is not suitable for
>>>> that case anyway.
>>>>
>>>> It was requested that this code did not affect other platforms. For
>>>> that the most of the code is put under #if INCLUDE_RTM_OPT which is
>>>> defined only for X86 and C2 and not EMBEDDED.
>>>>
>>>> All new RTM flags are declared as experimental and require to specify
>>>> "UnlockExperimentalVMOptions" flag.
>>>>
>>>>
>>>> SQE did full testing on these changes. Additional tests were developed.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>