RFR (XL) 8031320: Use Intel RTM instructions for locks

Thu Mar 20 06:24:53 UTC 2014

Thanks! Looks good.

igor

On Mar 19, 2014, at 10:26 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:

> Thank you, Dan
> 
> I updated webrev.01 with your and last Igor's suggestions. See my other mail with description of changes in it (mostly refactoring in macroAssembler_x86.cpp).
> 
> http://cr.openjdk.java.net/~kvn/8031320_9/webrev.01/
> 
> On 3/19/14 8:00 PM, Daniel D. Daugherty wrote:
>> On 3/17/14 1:11 PM, Vladimir Kozlov wrote:
>>> https://bugs.openjdk.java.net/browse/JDK-8031320
>>> http://cr.openjdk.java.net/~kvn/8031320_9/webrev/
>> 
>> src/share/vm/runtime/rtmLocking.hpp
>>     line 29: 29 // locks when "UseRTM" option ...
>>         Is it "UseRTM" or "UseRTMLocking"?
> 
> Fixed.
> 
>> 
>>     line 56: // or use -XX:CompileCommnad=option,class::method,DoNotElide
>>         typo: 'CompileCommnad' -> 'CompileCommand'
> 
> Fixed.
> 
>> 
>>     line 88: static uintx* rtm_calculation_flag() { return
>> &_calculation_flag; }
>>         Should this be: rtm_calculation_flag_addr() to match other
>>         accessors where we return the address of a field.
> 
> Changes as you suggested.
> 
>> 
>>     line 107: bool nonzero() {  return (_abort_count+_total_count) > 0; }
>>         nit: spaces around the '+' op
> 
> Added spaces.
> 
>> 
>> src/cpu/x86/vm/rtmLocking.cpp
>>     No comments.
>> 
>> src/share/vm/utilities/globalDefinitions.hpp
>>     Since code above this is conditional, I wonder why lines 384-389
>> aren't:
>> 
>>  376 #if defined(X86) && defined(COMPILER2) && !defined(JAVASE_EMBEDDED)
>> <snip>
>>  380 #else
>> <snip>
>>  383 #endif
>>  384 // States of Restricted Transactional Memory usage.
>>  385 enum RTMState {
>>  386   NoRTM      = 0x2, // Don't use RTM
>>  387   UseRTM     = 0x1, // Use RTM
>>  388   ProfileRTM = 0x0  // Use RTM with abort ratio calculation
>>  389 };
>> 
>>     Update: Some of the code that uses these values is not
>>     conditional.
> 
> Yes, C2 use it unconditionally to cache rtm_state. We are trying to keep number of #ifdef low in C2.
> 
>> 
>> src/share/vm/runtime/arguments.cpp
>>     The movement of this UseOptoBiasInlining option management
>>     code appears to be generally useful and not just RTM related.
>>     Is this something that should be fixed in older HSX versions?
> 
> Yes, it should be fixed in 7u for embedded. These changes will be backported into 8u so it will be fixed there.
> 
>> 
>> src/share/vm/runtime/deoptimization.hpp
>>     No comments.
>> 
>> src/share/vm/runtime/deoptimization.cpp
>>     line 1579: if ((reason != Reason_rtm_state_change) && (trap_mdo !=
>> NULL) &&
>>     line 1580:     UseRTMDeopt && (nm->rtm_state() != ProfileRTM)) {
>>         nit: seems like this is over parenthesized.
> 
> No, we put parentheses in such cases: (==) && (!=).
> 
>> 
>> src/share/vm/runtime/java.cpp
>>     line 268: if (PrintLockStatistics ||
>> PrintPreciseBiasedLockingStatistics || PrintPreciseRTMLockingStatistics) {
>>     line 390: if (PrintPreciseBiasedLockingStatistics ||
>> PrintPreciseRTMLockingStatistics) {
>>         Seems inconsistent not to use 'RTM_OPT_ONLY(...) here.
> 
> PrintPreciseBiasedLockingStatistics is declared in c2_globals.hpp so it does not require RTM_OPT_ONLY(). And it is harmless.
> 
> 
>> src/share/vm/runtime/task.cpp
>>     line: 108:          _interval <= max_jint &&
>>         Not obvious why this changed.
> 
> I hit this assert if I specify RTMLockingCalculationDelay=11000 more then 10 sec. PeriodicTask::max_interval  = 10000. I don't want to increase it because it is used for _intervalHistogram[max_interval] in debug VM. Note, PeriodicTask correctly process times larger than max_interval:
> 
>    if (ms >= PeriodicTask::max_interval) ms = PeriodicTask::max_interval - 1;
> 
> 
>> 
>> src/share/vm/runtime/thread.cpp
>>     line 2: #include "runtime/rtmLocking.hpp"
>>         Should this include be conditional (#ifdef'ed)?
> 
> done
> 
>> 
>> src/cpu/x86/vm/globals_x86.hpp
>>     Seems like the new options are also "available" in every X86
>>     build config, but probably not usable. Perhaps these should
>>     only be defined when INCLUDE_RTM_OPT is defined.
>> 
>>     Update: OK, now I see from code in vm_version_x86.cpp that you
>>     want at least some of the options available to get better
>>     diagnostics when someone tries to specify UseRTMLocking on a
>>     machine that doesn't support the feature.
>> 
>> src/cpu/x86/vm/vm_version_x86.hpp
>>     line 450:     if (_cpuid_info.sef_cpuid7_ebx.bits.rtm != 0)
>>     line 451:       result |= CPU_RTM;
>>         Should the above lines be farther down in this block:
>> 
>>         line 463:     // Intel features.
>>         line 464:     if(is_intel()) {
> 
> AMD uses different bits for their ASF and different instructions.
> I don't think we need to put it under is_intel(). The rule is if a bit is set the instructions should fully be supported.
> 
>> 
>> src/cpu/x86/vm/vm_version_x86.cpp
>>     line 601:       if (!FLAG_IS_DEFAULT(UseRTMForStackLocks)) {
>>     line 602:         warning("UseRTMForStackLocks flag should be off
>> when UseRTMLocking flag is off");
>>     line 603:       }
>>         So why does 'UseRTMForStackLocks' warrant a warning when used
>>         without 'UseRTM'? 'UseRTMDeopt' and
>> 'PrintPreciseRTMLockingStatistics'
>>         simply get set to false.
> 
> Because UseRTMForStackLocks is additional optimization. Where UseRTMDeopt is for additional code generation in UseRTM code.
> 
>> 
>>     line 975:       UseBiasedLocking = false;
>>         You use the FLAG_SET_DEFAULT macro elsewhere when resetting
>>         flags so why not here? Have I missed some magic part of the
>>         FLAG_FOO support macros?
> 
> I am not sure what to use here. FLAG_SET_ERGO? We are sloppy about setting UseBiasedLocking in arguments.cpp too. So I will leave it for future cleanup.
> 
>> 
>> src/cpu/x86/vm/assembler_x86.hpp
>>     line 1454: void pause();
>>         This was a surprise; if it's the same as the pause() in my
>>         contended locking project code from Dice, then I'll defer to
>>         your work.
> 
> It is 'hint' (noop) instruction to reduce power consumption in spin-loop:
> 
> "The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops."
> 
>> 
>> src/cpu/x86/vm/assembler_x86.cpp
>>     I presume someone else is checking the HEX instruction encodings
>>     versus the data book.
>> 
>>     line 3014   }
>>     line 3015   else
>>     line 3016   {
>>         nite: Most of the 'else' block styles in this file are:
>> 
>>             } else {
> 
> Fixed.
> 
>> 
>>     line 3017     abort.add_patch_at(code(), locator());
>>     line 3018     emit_int8((unsigned char)0xC7);
>>     line 3019     emit_int8((unsigned char)0xF8);
>>     line 3020     emit_int32(0);
>>         Trying to understand the above sequence. Is it basically
>>         an "Xbegin" over no bytes? Or a no-op of sorts?
> 
> It is code for forward branch when distance is unknown. emit_int32(0) will be patched when bind(abort_label) is executed later. It is normal code for branches.
> 
>> 
>> src/cpu/x86/vm/macroAssembler_x86.hpp
>>     line 30: #include "runtime/rtmLocking.hpp"
>>         Should this include be conditional (#ifdef'ed)?
> 
> Can't do that because of reason you pointed below. I need RTMLockingCounters class defined in all x86 cases.
> 
>> 
>>     line 660: void fast_lock(Register obj, Register box, Register tmp,
>>     line 661:                Register scr, Register cx1, Register cx2,
>>     line 662:                BiasedLockingCounters* counters,
>>     line 663:                RTMLockingCounters* rtmcounters,
>>     line 664:                RTMLockingCounters* stackrtmcounters,
>>     line 665:                Metadata* method_data,
>>     line 666:                bool use_rtm, bool profile_rtm);
>>         With the unconditional inclusion of RTM stuff in the
>>         parameters, you have no choice but to include some of
>>         the RTM data types in all X86 configs.
>> 
>>         Did you consider having two different version of the
>>         function declaration?
> 
> Too much trouble and additional code which will not help significantly.
> 
>> 
>> src/cpu/x86/vm/macroAssembler_x86.cpp
>>     line 304: void MacroAssembler::movptr(Register dst, AddressLiteral
>> src, Register scratch) {
>>         Perhaps reiterate that you're intentionally ignoring the
>>         'scratch' parameter for 3bit.
> 
> void MacroAssembler::movptr(Register dst, AddressLiteral src, Register scratch) {
>  // scratch register is not used,
>  // it is defined to match parameters of 64-bit version of this method.
> 
>> 
>>     line 701:       movq(dst, Address(scratch,0));
>>         nit: space before '0'.
> 
> done.
> 
>> 
>>     line 1000: void MacroAssembler::atomic_incl()
>>         It's not clear (to me anyway) why the pushf() and popf()
>>         calls are no longer needed.
>> 
>>         Update: OK, I think I see it. The old code always called
>>         pushf()/popf() even when 'rscratch1' wasn't used. Now
>>         you pass in a scratch register when you need one.
> 
> No, it is because we need to save flags in cond_inc32() to where I moved pushf() and popf(). Before atomic_incl() was called only from cond_inc32().
> 
>> 
>>     line 1684: testptr(tmpReg, markOopDesc::monitor_value); // inflated
>> vs stack-locked|neutral|biased
>>     line 1944: movptr (boxReg, tmpReg);
>>         These fixes are independent of RTM code.
>>         Are these something that should be fixed in older HSX versions?
> 
> No, it is simple cleanpu to use names instead of hardcoded digits. But generated code does not change.
> 
>> 
>>     line 1698: jcc(Assembler::equal, DONE_LABEL);
>>     line 1708: jmp(DONE_LABEL);
>>     line 2050: jcc   (Assembler::zero, DONE_LABEL);            // 0
>> indicates recursive stack-lock
>>     line 2053: jcc   (Assembler::zero, Stacked);
>>         I presume these changed because when the RTM code is there
>>         the DONE_LABEL is just too far away. Should we be doing:
>> 
>>         #if INCLUDE_RTM_OPT
>>           <new_code>
>>         #else
>>           <old_code>
>>         #endif
>> 
>>         or is the difference in code size/speed just not worth
>>         the #if noise?
> 
> Not worth. It is more important to have code which we can understand easy then save 3 bytes.
> Actually in fast_unlock() (lines around 2050) additional rtm code is small so I can use jccb for Stacked branch. DONE_LABEL is far unfortunate.
> 
>> 
>>     line 2050: jcc   (Assembler::zero, DONE_LABEL);            // 0
>> indicates recursive stack-lock
>>         Changed from 'jccb' -> 'jcc' because of increased code
>>         size (I'm guessing). However, the instruction also moved
>>         from after this line to before it:
>> 
>>         line 2051: movptr(tmpReg, Address(objReg, 0)); // Examine the
>> object's markword
>> 
>>         which makes me think that at least part of this change
>>         is applicable to older HSX.
> 
> It is again not for correctness but for easy to understand code.
> 
>> 
>>     line 2841:   pushf(); // Preserve flags
>>     line 2842:   atomic_incl(counter_addr);
>>     line 2843:   popf();
>>         I thought the 32-bit increment was changed to no longer
>>         clobber the flags register (and the 64-bit increment was
>>         changed to take a scratch parameter).
> 
> No, increment instruction on x86 always changes flags. Only lea() instruction does not affect flags but it is not used for this increment +1.
> 
>> 
>> src/cpu/x86/vm/sharedRuntime_x86_32.cpp
>>     No comments.
>> 
>> src/cpu/x86/vm/sharedRuntime_x86_64.cpp
>>     No comments.
>> 
>> 
>> Pausing my review at this point and sending what I have since the
>> remainder is C2 code and that'll take me a while to get through.
> 
> Thank you, Dan, for very detailed review.
> 
> Vladimir
> 
>> 
>> Dan
>> 
>> 
>> src/cpu/x86/vm/x86_32.ad
>> src/cpu/x86/vm/x86_64.ad
>> src/share/vm/adlc/output_c.cpp
>> src/share/vm/ci/ciEnv.cpp
>> src/share/vm/ci/ciEnv.hpp
>> src/share/vm/ci/ciMethodData.hpp
>> src/share/vm/code/nmethod.cpp
>> src/share/vm/code/nmethod.hpp
>> src/share/vm/oops/method.cpp
>> src/share/vm/oops/methodData.cpp
>> src/share/vm/oops/methodData.hpp
>> src/share/vm/opto/c2_globals.hpp
>> src/share/vm/opto/classes.hpp
>> src/share/vm/opto/compile.cpp
>> src/share/vm/opto/compile.hpp
>> src/share/vm/opto/connode.hpp
>> src/share/vm/opto/graphKit.cpp
>> src/share/vm/opto/locknode.cpp
>> src/share/vm/opto/locknode.hpp
>> src/share/vm/opto/loopTransform.cpp
>> src/share/vm/opto/machnode.hpp
>> src/share/vm/opto/macro.cpp
>> src/share/vm/opto/macro.hpp
>> src/share/vm/opto/parse.hpp
>> src/share/vm/opto/parse1.cpp
>> src/share/vm/opto/runtime.cpp
>> src/share/vm/opto/runtime.hpp
>> 
>> 
>> 
>> 
>> 
>>> The Intel architectures codenamed Haswell has support for RTM
>>> (Restricted Transactional Memory) instructions xbegin, xabort, xend
>>> and xtest as part of Intel Transactional Synchronization Extension
>>> (TSX). The xbegin and xend instructions enclose a set of instructions
>>> to be executed as a transaction. If no conflict found during execution
>>> of the transaction, the memory and register modifications are
>>> committed together at xend. xabort instruction can be used for
>>> explicit abort of transaction and xtest to check if we are in
>>> transaction.
>>> 
>>> RTM is useful for highly contended locks with low conflict in the
>>> critical region. The highly contended locks don't scale well otherwise
>>> but with RTM they show good scaling. RTM allows using coarse grain
>>> locking for applications. Also for lightly contended locks which are
>>> used by different threads RTM can reduce cache line ping pong and
>>> thereby show performance improvement too.
>>> 
>>> Implementation:
>>> 
>>> Generate RTM locking code for all inflated locks when "UseRTMLocking"
>>> option is on with normal locking as fall back mechanism. On abort or
>>> lock busy the lock will be retried a fixed number of times as
>>> specified by "RTMRetryCount" option.  The locks which abort too often
>>> can be auto tuned or manually tuned.
>>> 
>>> Auto-tuning can be done using "UseRTMDeopt" flag which will add an
>>> abort ratio calculation code for each lock. The abort ratio will be
>>> calculated after "RTMAbortThreshold" aborts are encountered.
>>> With "UseRTMDeopt" if the aborts ratio reaches "RTMAbortRatio" the
>>> nmethod containing the lock will be deoptimized and recompiled with
>>> all locks as normal (stack) locks. If the abort ratio continues to
>>> remain low after "RTMLockingThreshold" attempted locks, then the
>>> method will be deoptimized and recompiled with all locks as RTM locks
>>> without abort ratio calculation code. The abort ratio calculation can
>>> be delayed by specifying -XX:RTMLockingCalculationDelay=<millisec> flag.
>>> Deoptimization of nmethod is done by adding an uncommon trap at the
>>> beginning of the code which checks rtm state field in MDO which is
>>> modified by the abort calculation code.
>>> 
>>> For manual tuning the abort statistics for each lock could be provided
>>> to a user using "PrintPreciseRTMLockingStatistics" diagnostic flag.
>>> Based on the abort statistics users can create a .hotspot_compiler
>>> file or use -XX:CompileCommand=<option> flag to specify for which
>>> methods disable RTM locking using <option> "NoRTMLockEliding" or
>>> always enable RTM locking using <option> "UseRTMLockEliding".
>>> 
>>> The abort calculation and statistic collection are done using
>>> RTMLockingCounters wrapped into RTMLockingNamedCounter counters which
>>> are generated for each lock. To reduce burden on cache line RTM lock
>>> total counter is updated randomly with RTMTotalCountIncrRate rate.
>>> 
>>> Note, both auto and manually tuning is done for whole method. There is
>>> no a mechanism to tune an individual lock.
>>> 
>>> RTM locking can be used for normal (stack) locks by specifying
>>> "UseRTMForStackLocks" flag.
>>> 
>>> RTM locking code requires that biased locking is switched off because
>>> it conflicts with it. RTM locking is most useful when there is high
>>> lock contention and low data contention.  With high lock contention
>>> the lock is usually inflated and biased locking is not suitable for
>>> that case anyway.
>>> 
>>> It was requested that this code did not affect other platforms. For
>>> that the most of the code is put under #if INCLUDE_RTM_OPT which is
>>> defined only for X86 and C2 and not EMBEDDED.
>>> 
>>> All new RTM flags are declared as experimental and require to specify
>>> "UnlockExperimentalVMOptions" flag.
>>> 
>>> 
>>> SQE did full testing on these changes. Additional tests were developed.
>>> 
>>> Thanks,
>>> Vladimir