RFR (XL) 8031320: Use Intel RTM instructions for locks

Thu Mar 20 05:26:07 UTC 2014

Thank you, Dan

I updated webrev.01 with your and last Igor's suggestions. See my other 
mail with description of changes in it (mostly refactoring in 
macroAssembler_x86.cpp).

http://cr.openjdk.java.net/~kvn/8031320_9/webrev.01/

On 3/19/14 8:00 PM, Daniel D. Daugherty wrote:
> On 3/17/14 1:11 PM, Vladimir Kozlov wrote:
>> https://bugs.openjdk.java.net/browse/JDK-8031320
>> http://cr.openjdk.java.net/~kvn/8031320_9/webrev/
>
> src/share/vm/runtime/rtmLocking.hpp
>      line 29: 29 // locks when "UseRTM" option ...
>          Is it "UseRTM" or "UseRTMLocking"?

Fixed.

>
>      line 56: // or use -XX:CompileCommnad=option,class::method,DoNotElide
>          typo: 'CompileCommnad' -> 'CompileCommand'

Fixed.

>
>      line 88: static uintx* rtm_calculation_flag() { return
> &_calculation_flag; }
>          Should this be: rtm_calculation_flag_addr() to match other
>          accessors where we return the address of a field.

Changes as you suggested.

>
>      line 107: bool nonzero() {  return (_abort_count+_total_count) > 0; }
>          nit: spaces around the '+' op

Added spaces.

>
> src/cpu/x86/vm/rtmLocking.cpp
>      No comments.
>
> src/share/vm/utilities/globalDefinitions.hpp
>      Since code above this is conditional, I wonder why lines 384-389
> aren't:
>
>   376 #if defined(X86) && defined(COMPILER2) && !defined(JAVASE_EMBEDDED)
> <snip>
>   380 #else
> <snip>
>   383 #endif
>   384 // States of Restricted Transactional Memory usage.
>   385 enum RTMState {
>   386   NoRTM      = 0x2, // Don't use RTM
>   387   UseRTM     = 0x1, // Use RTM
>   388   ProfileRTM = 0x0  // Use RTM with abort ratio calculation
>   389 };
>
>      Update: Some of the code that uses these values is not
>      conditional.

Yes, C2 use it unconditionally to cache rtm_state. We are trying to keep 
number of #ifdef low in C2.

>
> src/share/vm/runtime/arguments.cpp
>      The movement of this UseOptoBiasInlining option management
>      code appears to be generally useful and not just RTM related.
>      Is this something that should be fixed in older HSX versions?

Yes, it should be fixed in 7u for embedded. These changes will be 
backported into 8u so it will be fixed there.

>
> src/share/vm/runtime/deoptimization.hpp
>      No comments.
>
> src/share/vm/runtime/deoptimization.cpp
>      line 1579: if ((reason != Reason_rtm_state_change) && (trap_mdo !=
> NULL) &&
>      line 1580:     UseRTMDeopt && (nm->rtm_state() != ProfileRTM)) {
>          nit: seems like this is over parenthesized.

No, we put parentheses in such cases: (==) && (!=).

>
> src/share/vm/runtime/java.cpp
>      line 268: if (PrintLockStatistics ||
> PrintPreciseBiasedLockingStatistics || PrintPreciseRTMLockingStatistics) {
>      line 390: if (PrintPreciseBiasedLockingStatistics ||
> PrintPreciseRTMLockingStatistics) {
>          Seems inconsistent not to use 'RTM_OPT_ONLY(...) here.

PrintPreciseBiasedLockingStatistics is declared in c2_globals.hpp so it 
does not require RTM_OPT_ONLY(). And it is harmless.

> src/share/vm/runtime/task.cpp
>      line: 108:          _interval <= max_jint &&
>          Not obvious why this changed.

I hit this assert if I specify RTMLockingCalculationDelay=11000 more 
then 10 sec. PeriodicTask::max_interval  = 10000. I don't want to 
increase it because it is used for _intervalHistogram[max_interval] in 
debug VM. Note, PeriodicTask correctly process times larger than 
max_interval:

     if (ms >= PeriodicTask::max_interval) ms = 
PeriodicTask::max_interval - 1;

>
> src/share/vm/runtime/thread.cpp
>      line 2: #include "runtime/rtmLocking.hpp"
>          Should this include be conditional (#ifdef'ed)?

done

>
> src/cpu/x86/vm/globals_x86.hpp
>      Seems like the new options are also "available" in every X86
>      build config, but probably not usable. Perhaps these should
>      only be defined when INCLUDE_RTM_OPT is defined.
>
>      Update: OK, now I see from code in vm_version_x86.cpp that you
>      want at least some of the options available to get better
>      diagnostics when someone tries to specify UseRTMLocking on a
>      machine that doesn't support the feature.
>
> src/cpu/x86/vm/vm_version_x86.hpp
>      line 450:     if (_cpuid_info.sef_cpuid7_ebx.bits.rtm != 0)
>      line 451:       result |= CPU_RTM;
>          Should the above lines be farther down in this block:
>
>          line 463:     // Intel features.
>          line 464:     if(is_intel()) {

AMD uses different bits for their ASF and different instructions.
I don't think we need to put it under is_intel(). The rule is if a bit 
is set the instructions should fully be supported.

>
> src/cpu/x86/vm/vm_version_x86.cpp
>      line 601:       if (!FLAG_IS_DEFAULT(UseRTMForStackLocks)) {
>      line 602:         warning("UseRTMForStackLocks flag should be off
> when UseRTMLocking flag is off");
>      line 603:       }
>          So why does 'UseRTMForStackLocks' warrant a warning when used
>          without 'UseRTM'? 'UseRTMDeopt' and
> 'PrintPreciseRTMLockingStatistics'
>          simply get set to false.

Because UseRTMForStackLocks is additional optimization. Where 
UseRTMDeopt is for additional code generation in UseRTM code.

>
>      line 975:       UseBiasedLocking = false;
>          You use the FLAG_SET_DEFAULT macro elsewhere when resetting
>          flags so why not here? Have I missed some magic part of the
>          FLAG_FOO support macros?

I am not sure what to use here. FLAG_SET_ERGO? We are sloppy about 
setting UseBiasedLocking in arguments.cpp too. So I will leave it for 
future cleanup.

>
> src/cpu/x86/vm/assembler_x86.hpp
>      line 1454: void pause();
>          This was a surprise; if it's the same as the pause() in my
>          contended locking project code from Dice, then I'll defer to
>          your work.

It is 'hint' (noop) instruction to reduce power consumption in spin-loop:

"The PAUSE instruction provides a hint to the processor that the code 
sequence is a spin-wait loop. The processor uses this hint to avoid the 
memory order violation in most situations, which greatly improves 
processor performance. For this reason, it is recommended that a PAUSE 
instruction be placed in all spin-wait loops."

>
> src/cpu/x86/vm/assembler_x86.cpp
>      I presume someone else is checking the HEX instruction encodings
>      versus the data book.
>
>      line 3014   }
>      line 3015   else
>      line 3016   {
>          nite: Most of the 'else' block styles in this file are:
>
>              } else {

Fixed.

>
>      line 3017     abort.add_patch_at(code(), locator());
>      line 3018     emit_int8((unsigned char)0xC7);
>      line 3019     emit_int8((unsigned char)0xF8);
>      line 3020     emit_int32(0);
>          Trying to understand the above sequence. Is it basically
>          an "Xbegin" over no bytes? Or a no-op of sorts?

It is code for forward branch when distance is unknown. emit_int32(0) 
will be patched when bind(abort_label) is executed later. It is normal 
code for branches.

>
> src/cpu/x86/vm/macroAssembler_x86.hpp
>      line 30: #include "runtime/rtmLocking.hpp"
>          Should this include be conditional (#ifdef'ed)?

Can't do that because of reason you pointed below. I need 
RTMLockingCounters class defined in all x86 cases.

>
>      line 660: void fast_lock(Register obj, Register box, Register tmp,
>      line 661:                Register scr, Register cx1, Register cx2,
>      line 662:                BiasedLockingCounters* counters,
>      line 663:                RTMLockingCounters* rtmcounters,
>      line 664:                RTMLockingCounters* stackrtmcounters,
>      line 665:                Metadata* method_data,
>      line 666:                bool use_rtm, bool profile_rtm);
>          With the unconditional inclusion of RTM stuff in the
>          parameters, you have no choice but to include some of
>          the RTM data types in all X86 configs.
>
>          Did you consider having two different version of the
>          function declaration?

Too much trouble and additional code which will not help significantly.

>
> src/cpu/x86/vm/macroAssembler_x86.cpp
>      line 304: void MacroAssembler::movptr(Register dst, AddressLiteral
> src, Register scratch) {
>          Perhaps reiterate that you're intentionally ignoring the
>          'scratch' parameter for 3bit.

void MacroAssembler::movptr(Register dst, AddressLiteral src, Register 
scratch) {
   // scratch register is not used,
   // it is defined to match parameters of 64-bit version of this method.

>
>      line 701:       movq(dst, Address(scratch,0));
>          nit: space before '0'.

done.

>
>      line 1000: void MacroAssembler::atomic_incl()
>          It's not clear (to me anyway) why the pushf() and popf()
>          calls are no longer needed.
>
>          Update: OK, I think I see it. The old code always called
>          pushf()/popf() even when 'rscratch1' wasn't used. Now
>          you pass in a scratch register when you need one.

No, it is because we need to save flags in cond_inc32() to where I moved 
pushf() and popf(). Before atomic_incl() was called only from cond_inc32().

>
>      line 1684: testptr(tmpReg, markOopDesc::monitor_value); // inflated
> vs stack-locked|neutral|biased
>      line 1944: movptr (boxReg, tmpReg);
>          These fixes are independent of RTM code.
>          Are these something that should be fixed in older HSX versions?

No, it is simple cleanpu to use names instead of hardcoded digits. But 
generated code does not change.

>
>      line 1698: jcc(Assembler::equal, DONE_LABEL);
>      line 1708: jmp(DONE_LABEL);
>      line 2050: jcc   (Assembler::zero, DONE_LABEL);            // 0
> indicates recursive stack-lock
>      line 2053: jcc   (Assembler::zero, Stacked);
>          I presume these changed because when the RTM code is there
>          the DONE_LABEL is just too far away. Should we be doing:
>
>          #if INCLUDE_RTM_OPT
>            <new_code>
>          #else
>            <old_code>
>          #endif
>
>          or is the difference in code size/speed just not worth
>          the #if noise?

Not worth. It is more important to have code which we can understand 
easy then save 3 bytes.
Actually in fast_unlock() (lines around 2050) additional rtm code is 
small so I can use jccb for Stacked branch. DONE_LABEL is far unfortunate.

>
>      line 2050: jcc   (Assembler::zero, DONE_LABEL);            // 0
> indicates recursive stack-lock
>          Changed from 'jccb' -> 'jcc' because of increased code
>          size (I'm guessing). However, the instruction also moved
>          from after this line to before it:
>
>          line 2051: movptr(tmpReg, Address(objReg, 0)); // Examine the
> object's markword
>
>          which makes me think that at least part of this change
>          is applicable to older HSX.

It is again not for correctness but for easy to understand code.

>
>      line 2841:   pushf(); // Preserve flags
>      line 2842:   atomic_incl(counter_addr);
>      line 2843:   popf();
>          I thought the 32-bit increment was changed to no longer
>          clobber the flags register (and the 64-bit increment was
>          changed to take a scratch parameter).

No, increment instruction on x86 always changes flags. Only lea() 
instruction does not affect flags but it is not used for this increment +1.

>
> src/cpu/x86/vm/sharedRuntime_x86_32.cpp
>      No comments.
>
> src/cpu/x86/vm/sharedRuntime_x86_64.cpp
>      No comments.
>
>
> Pausing my review at this point and sending what I have since the
> remainder is C2 code and that'll take me a while to get through.

Thank you, Dan, for very detailed review.

Vladimir

>
> Dan
>
>
> src/cpu/x86/vm/x86_32.ad
> src/cpu/x86/vm/x86_64.ad
> src/share/vm/adlc/output_c.cpp
> src/share/vm/ci/ciEnv.cpp
> src/share/vm/ci/ciEnv.hpp
> src/share/vm/ci/ciMethodData.hpp
> src/share/vm/code/nmethod.cpp
> src/share/vm/code/nmethod.hpp
> src/share/vm/oops/method.cpp
> src/share/vm/oops/methodData.cpp
> src/share/vm/oops/methodData.hpp
> src/share/vm/opto/c2_globals.hpp
> src/share/vm/opto/classes.hpp
> src/share/vm/opto/compile.cpp
> src/share/vm/opto/compile.hpp
> src/share/vm/opto/connode.hpp
> src/share/vm/opto/graphKit.cpp
> src/share/vm/opto/locknode.cpp
> src/share/vm/opto/locknode.hpp
> src/share/vm/opto/loopTransform.cpp
> src/share/vm/opto/machnode.hpp
> src/share/vm/opto/macro.cpp
> src/share/vm/opto/macro.hpp
> src/share/vm/opto/parse.hpp
> src/share/vm/opto/parse1.cpp
> src/share/vm/opto/runtime.cpp
> src/share/vm/opto/runtime.hpp
>
>
>
>
>
>> The Intel architectures codenamed Haswell has support for RTM
>> (Restricted Transactional Memory) instructions xbegin, xabort, xend
>> and xtest as part of Intel Transactional Synchronization Extension
>> (TSX). The xbegin and xend instructions enclose a set of instructions
>> to be executed as a transaction. If no conflict found during execution
>> of the transaction, the memory and register modifications are
>> committed together at xend. xabort instruction can be used for
>> explicit abort of transaction and xtest to check if we are in
>> transaction.
>>
>> RTM is useful for highly contended locks with low conflict in the
>> critical region. The highly contended locks don't scale well otherwise
>> but with RTM they show good scaling. RTM allows using coarse grain
>> locking for applications. Also for lightly contended locks which are
>> used by different threads RTM can reduce cache line ping pong and
>> thereby show performance improvement too.
>>
>> Implementation:
>>
>> Generate RTM locking code for all inflated locks when "UseRTMLocking"
>> option is on with normal locking as fall back mechanism. On abort or
>> lock busy the lock will be retried a fixed number of times as
>> specified by "RTMRetryCount" option.  The locks which abort too often
>> can be auto tuned or manually tuned.
>>
>> Auto-tuning can be done using "UseRTMDeopt" flag which will add an
>> abort ratio calculation code for each lock. The abort ratio will be
>> calculated after "RTMAbortThreshold" aborts are encountered.
>> With "UseRTMDeopt" if the aborts ratio reaches "RTMAbortRatio" the
>> nmethod containing the lock will be deoptimized and recompiled with
>> all locks as normal (stack) locks. If the abort ratio continues to
>> remain low after "RTMLockingThreshold" attempted locks, then the
>> method will be deoptimized and recompiled with all locks as RTM locks
>> without abort ratio calculation code. The abort ratio calculation can
>> be delayed by specifying -XX:RTMLockingCalculationDelay=<millisec> flag.
>> Deoptimization of nmethod is done by adding an uncommon trap at the
>> beginning of the code which checks rtm state field in MDO which is
>> modified by the abort calculation code.
>>
>> For manual tuning the abort statistics for each lock could be provided
>> to a user using "PrintPreciseRTMLockingStatistics" diagnostic flag.
>> Based on the abort statistics users can create a .hotspot_compiler
>> file or use -XX:CompileCommand=<option> flag to specify for which
>> methods disable RTM locking using <option> "NoRTMLockEliding" or
>> always enable RTM locking using <option> "UseRTMLockEliding".
>>
>> The abort calculation and statistic collection are done using
>> RTMLockingCounters wrapped into RTMLockingNamedCounter counters which
>> are generated for each lock. To reduce burden on cache line RTM lock
>> total counter is updated randomly with RTMTotalCountIncrRate rate.
>>
>> Note, both auto and manually tuning is done for whole method. There is
>> no a mechanism to tune an individual lock.
>>
>> RTM locking can be used for normal (stack) locks by specifying
>> "UseRTMForStackLocks" flag.
>>
>> RTM locking code requires that biased locking is switched off because
>> it conflicts with it. RTM locking is most useful when there is high
>> lock contention and low data contention.  With high lock contention
>> the lock is usually inflated and biased locking is not suitable for
>> that case anyway.
>>
>> It was requested that this code did not affect other platforms. For
>> that the most of the code is put under #if INCLUDE_RTM_OPT which is
>> defined only for X86 and C2 and not EMBEDDED.
>>
>> All new RTM flags are declared as experimental and require to specify
>> "UnlockExperimentalVMOptions" flag.
>>
>>
>> SQE did full testing on these changes. Additional tests were developed.
>>
>> Thanks,
>> Vladimir
>