RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

David Holmes david.holmes at oracle.com
Fri Nov 27 07:26:25 UTC 2015


Hi Thomas,

On 26/11/2015 7:56 PM, Thomas Stüfe wrote:
>
> On Thu, Nov 26, 2015 at 12:51 AM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
>     On 26/11/2015 9:08 AM, Daniel D. Daugherty wrote:
>
>         If I understand the problem... :-)
>
>         The Win-X64 ABI requires that some extra space be allocated on
>         the stack in the caller for "homing" four registers...
>
>         This code uses MacroAssembler::call() which is a lower level
>         call() function that assumes you know the ABI...
>
>         9765 void MacroAssembler::get_thread(Register thread) {
>         <snip>
>         9784   call(RuntimeAddress(CAST_FROM_FN_PTR(address,
>         Thread::current)));
>
>     So the failing here was nothing to do with register saving per-se,
>     nor the alignment code, but "simply" not also saving the extra space
>     prior to the call!
>
> Yes. These 32 byte are a scratch area for the callee, which may be used
> for parameter registers or anything else. We may just have been lucky so
> far that TlsGetValue() did not use this area. As Bertrand already said,
> now we call Thread::current(), which does magic tls accesses under the
> hood, so this must be tested again.

Right. All testing so far has been successful but ...

> However I think that a simple addition to your current fix could simply
> be subtracting 32 from rsp before the call:
>
> subq(rsp, 32)
> call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));
> addq(rsp, 32)

... I decided to do what was previously suggested and use:

   call_VM_leaf_base(CAST_FROM_FN_PTR(address, Thread::current), 0);

as it takes care of everything (stack alignment if needed and preserving 
space for the register args on win64).

> Also, we could probably skip saving/restoring rdi and rsi on Windows:

Maybe - but we don't seem to consider that anywhere else ie:

void MacroAssembler::push_callee_saved_registers() {
   push(rsi);
   push(rdi);
   push(rdx);
   push(rcx);
}

so I don't want to be the trail-blazer here. :)

Updated webrev:

  http://cr.openjdk.java.net/~dholmes/8132510/webrev.v10

Only change to macroAssembler_x86.cpp. Re-tested on Windows 32-bit and 
64-bit.

Thanks,
David

> https://msdn.microsoft.com/en-us/library/6t169e9c.aspx
>
> "The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile
> and must be considered destroyed on function calls (unless otherwise
> safety-provable by analysis such as whole program optimization). The
> registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered
> nonvolatile and must be saved and restored by a function that uses them."



> Best Regards, Thomas
>
>
>         If you switch to MacroAssembler::call_VM_leaf_base() it handles
>         more of the gory details for you than MacroAssembler::call():
>
>
>     Wow! Not exactly a clearly-defined name for this. AFAICS there's
>     nothing VM or leaf specific about this. Also the MacroAssembler is
>     supposed to be a platform-agnostic higher-level abstraction that is
>     easy to use - but in this case every use of the plain call on
>     Windows-x64 is potentially broken if the programmer was not aware of
>     this ABI constraint which only applies to Windows (and we're in
>     OS-neutral files!).
>
>     This seems fundamentally broken to me.
>
>     Thanks for the info.
>
>     David
>
>
>         src/cpu/x86/vm/macroAssembler_x86.cpp:
>
>              524  void MacroAssembler::call_VM_leaf_base(address
>         entry_point, int
>         num_args
>         ) {
>              525    Label L, E;
>              526
>              527  #ifdef _WIN64
>              528    // Windows always allocates space for it's register args
>              529    assert(num_args <= 4, "only register arguments
>         supported");
>              530    subq(rsp,  frame::arg_reg_save_area_bytes);
>              531  #endif
>              532
>              533    // Align stack if necessary
>              534    testl(rsp, 15);
>              535    jcc(Assembler::zero, L);
>              536
>              537    subq(rsp, 8);
>              538    {
>              539      call(RuntimeAddress(entry_point));
>              540    }
>
>         So call_VM_leaf_base() allocates the "homing" space for register
>         args for Win-X64 on L530 and it handles any stack alignment before
>         calling MacroAssembler::call().
>
>         So if you switch to call_VM_leaf_base() you can drop
>         the stack alignment code:
>
>         9777   // ensure stack alignment
>         9778   mov(r10, rsp);
>         9779   andq(rsp, -16);
>         9780   push(r10);
>
>         and just switch to:
>
>         9777   push(rsp);
>
>         And you won't be exposed to potential saved register
>         overwrite by the "homing" of any registers in the call
>         Thread::current().
>
>         I _think_ I have that all correct... :-)
>
>         Dan
>
>
>         On 11/25/15 3:52 PM, David Holmes wrote:
>
>             On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>
>                 I created a bug to track this:
>
>                 https://bugs.openjdk.java.net/browse/JDK-8144059
>
>
>             Thanks. I admit I still don't really grok the exact problem
>             here. It
>             is the alignment code that is the problem right?
>
>             79 mov(r10, rsp);
>             80 andq(rsp, -16);
>             81 push(r10);
>
>             ?
>
>             Thanks,
>             David
>
>                 Because Davids change is not submitted yet, the bug
>                 refers to the old
>                 coding, but stays relevant also after this fix.
>
>                 Kind Regards, Thomas
>
>                 On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>                 <daniel.daugherty at oracle.com
>                 <mailto:daniel.daugherty at oracle.com>
>                 <mailto:daniel.daugherty at oracle.com
>                 <mailto:daniel.daugherty at oracle.com>>>
>                 wrote:
>
>                      On 11/25/15 6:10 AM, David Holmes wrote:
>
>                          On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>
>
>                              On Wed, Nov 25, 2015 at 10:22 AM, Bertrand
>                 Delsart
>                              <bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>
>                              <mailto:bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>>
>                              <mailto:bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>
>                              <mailto:bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>>>> wrote:
>
>                                   Good finding Thomas,
>
>                                   On 25/11/2015 09:40, Thomas Stüfe wrote:
>
>                                       Hi David, Bertrand,
>
>                                       about
>                 http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>
>
>                                       I am not sure this is correct for
>                 Windows x64.
>
>
>                                   Unless I'm misreading it, the new code
>                 looks a lot like
>                              the old one
>                                   (including the alignement part). Hence
>                 this does not
>                              look like a
>                                   regression caused by David's change.
>
>
>                              Yes, this seems like an old bug.
>
>                              I also see that this is fixed in our
>                 codebase since 2010,
>                              exactly the
>                              way you propose (with MacroAssembler
>                 <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>
>
>                 <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>
>
>                              So, it is already used since five years. I
>                 would propose
>                              opening a
>                              separate bug for this one and we contribute
>                 our fix.
>
>
>                          That sounds good to me too. I also suspect
>                 additional code will
>                          need to be checked as this stack alignment is
>                 performed in a few
>                          places in ./cpu/x86/vm/macroAssembler_x86.cpp
>                 with no OS
>                          specific differences being considered.
>
>
>                      Looks like MacroAssembler::call_VM_leaf_base() does
>                 the "right
>                 thing"
>                      for stack alignment in the X64 case. Perhaps we
>                 need to audit both
>                      stack alignment _and_ uses of
>                 MacroAssembler::call() just to make
>                      sure...
>
>                      Did I mention that all the assembly code stuff
>                 gives me a
>                 headache? :-)
>
>                      Dan
>
>
>
>
>                          Thanks,
>                          David
>
>                              Regards, Thomas
>
>
>                 https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>
>                                       windows x64 Abi requires the
>                 register parameter
>                              area right above the
>                                       return address on the stack. It is not
>                 guaranteed to be
>                                       preserved during
>                                       the function call:
>
>                                       "Even if the called function has
>                 fewer than 4
>                              parameters, these
>                                       4 stack
>                                       locations are effectively owned by
>                 the called
>                              function, and may
>                                       be used
>                                       by the called function for other
>                 purposes besides
>                              saving parameter
>                                       register values. Thus the caller
>                 may not save
>                              information in
>                                       this region
>                                       of stack across a function call."
>
>                                       So, restoring parameters from it
>                 may not work if
>                                       Thread::current() uses
>                                       this area for any reason. In this
>                 case, r9, rsp,
>                              r10, r11 may be
>                                       overwritten by the callee.
>
>                                       I also think Windows x64 ABI is
>                 different enough
>                              from Linux that
>                                       this
>                                       function could live somewhere in
>                 <os>_<cpu>.
>
>
>                                   This is a valid point. However, note
>                 that this is
>                              similar to the
>                                   issue we have for instance in 'shared'
>                                   MacroAssembler::call_VM_leaf_base,
>                 which was not put in
>                              <os>_<cpu>
>                                   files.
>
>                                   In fact, the right fix may be to use
>                 call_VM_leaf
>                              instead of call :-)
>
>                                   (alternatively we could use "#ifdef
>                 _WIN64" or check
>                                   "frame::arg_reg_save_area_bytes != 0",
>                 which may be
>                              even cleaner and
>                                   more portable)
>
>                                   All these solutions are fine for me.
>
>
>                                   Since we are near to feature freeze,
>                 this should
>                              probably be done in
>                                   a separate CR (unless David has enough
>                 time to test
>                 it).
>
>                                   Regards,
>
>                                   Bertrand.
>
>
>                                       Kind Regards, Thomas
>
>
>
>
>                                       On Mon, Nov 23, 2015 at 10:42 PM,
>                 David Holmes
>                                       <david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>
>                              <mailto:david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>>
>                              <mailto:david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>
>                              <mailto:david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>>>
>                                       <mailto:david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>
>                              <mailto:david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>>
>                                       <mailto:david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>
>                              <mailto:david.holmes at oracle.com
>                 <mailto:david.holmes at oracle.com>>>>> wrote:
>
>                                            On 23/11/2015 11:46 PM,
>                 Bertrand Delsart
>                 wrote:
>
>                                                On 23/11/2015 13:27,
>                 David Holmes wrote:
>
>                                                    Hi Bertrand,
>
>                                                    On 23/11/2015 8:14
>                 PM, Bertrand
>                              Delsart wrote:
>
>                                                        Hi David,
>
>                                                        Overall looks good.
>                                                        Thanks for the
>                 #ifndef
>                                       USE_LIBRARY_BASED_TLS_ONLY :-)
>
>                                                        One doubt, in
>                 case this has not
>                              been discussed
>                                       before.
>
>                                                        I'm still
>                 catching up on x86_64
>                              (after years of ARM
>                                                        assembly :-)) but it
>                                                        seems that there
>                 are some stack
>                              alignment
>                                       constraints on
>                                                        runtime calls,
>                                                        at least for some
>                 x86_64 ABIs.
>
>                                                        Some of the x86
>                              MacroAssembler::get_thread
>                                                        implementations
>                 had code to
>                                                        align the stack
>                 before calling
>                                       pthread_getspecific. See
>                                                        for instance
>                 http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>
>
>
>
>                                                    Sorry I'm not that
>                 familiar with the
>                              MacroAssembler
>                                       - is it
>                                                    this odd
>                                                    fragment:
>
>                                                    -   push(r10);
>                                                    -   // XXX
>                                                    -   mov(r10, rsp);
>                                                    -   andq(rsp, -16);
>                                                    -   push(r10);
>
>                                                    I'm not at all clear
>                 what that is
>                              doing - and if it
>                                       somehow
>                                                    changes the
>                                                    alignment wouldn't
>                 something need to
>                              be fixed up when
>                                                    popping the
>                                                    previous values ??
>
>                                                    To be honest I'm not
>                 even sure what an
>                              "unaligned
>                                       stack" means.
>
>
>                                                On some platforms, SP may
>                 have to be
>                              aligned on a 16 bytes
>                                                boundary when
>                                                calling another method.
>
>
>                                            Okay - that seems to be an
>                 x64 ABI requirement
>                              based on
>                                       other code
>                                            in MacroAssembler. The reason
>                 it is missing in
>                              the Solaris
>                                       code is
>                                            because I already removed it
>                 in the earlier
>                              changes that
>                                       were done
>                                            on Solaris:
>
>                 http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>
>                                            and nobody picked it up. :(
>
>                                            That said, the absence of
>                 this code has not
>                              caused any
>                                       problems so
>                                            presumably we are normally
>                 aligned.
>
>                                                After having pushed what
>                 needed to be
>                              saved, the code above
>                                                rounds SP
>                                                down and saves the old
>                 value. It will then
>                              also push
>                                       r11, which
>                                                results
>                                                in the expected alignment.
>
>                                                The conterpart, after the
>                 VM call, is the:
>                                                   pop(r11);
>                                                   pop(rsp);
>
>
>                                            I had assumed this extra
>                 manipulation was
>                              related to
>                                       pushing the arg
>                                            for the call.
>
>                                                        This alignment is
>                 no longer
>                              performed in the new
>                                                        (shared)
>                 implementation
>                 http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>
>
>
>
>
>                                                        Now, Solaris was
>                 not performing
>                              the alignment and
>                                                        Windows has a
>                 separate
>                                                        path for x86_64.
>                 Did we really
>                              need the
>                                       alignment for
>                                                        linux x86_64 and
>                                                        bsd_x86_64 ?
>                 Might it be needed
>                              for other ports ?
>
>                                                        IMHO, it might be
>                 safer to align
>                              the stack by
>                                       default,
>                                                        knowing it should
>                                                        not be expensive
>                 since we call
>                              get_thread
>                                       rarely for
>                                                        x86_64 (result is
>                                                        cached in r15).
>                 I'll let you see
>                              whether it is
>                                       worth
>                                                        adding an ifdef so
>                                                        as to explicitly
>                 deactivate the
>                              alignment on
>                                       some platforms
>                                                        (solaris_x86_64 ?)
>
>
>                                                    I don't have enough
>                 knowledge to even
>                              begin to
>                                       comment on
>                                                    this. I will
>                                                    have to rely on our
>                 x86_64
>                              macroassembler experts
>                                       to explain
>                                                    the old
>                                                    code and what the
>                 requirements are.
>
>
>                                                OK. This means that the
>                 alignment was not
>                              removed
>                                       purposefully.
>
>                                                In that case, you must
>                 either keep the per
>                              platform
>                                       code x86_64
>                                                (since
>                                                it was not identical for
>                 all platforms) or
>                              use the safest
>                                                version, with
>                                                the additional
>
>                                                     // XXX
>                                                     mov(r10, rsp);
>                                                     andq(rsp, -16);
>                                                     push(r10);
>
>                                                before the push(r11) and
>                 with the
>
>                                                     pop(rsp);
>
>                                                after the pop(r11). It
>                 should work on all
>                              x86_64 platforms.
>
>
>                                            Right, I will add this back
>                 to the code - and
>                              replace the
>                                            meaningless // XXX with an
>                 actual comment!
>
>                                            Many thanks,
>                                            David
>                                            -----
>
>
>                                                FYI, there is another way
>                 to do the
>                              alignment, based on
>                                       the fact
>                                                that
>                                                we are at least aligned
>                 on 8 bytes (see
>                 MacroAssembler::call_VM_leaf_base). I
>                              assume that this
>                                       second
>                                                version is
>                                                more efficient
>                 (particularly thanks to
>                              specilative
>                                       branches) but it
>                                                might be safer/simpler to
>                 continue using
>                              andq in
>                                       get_thread.
>
>                                                Bertrand.
>
>
>                                                    Thanks,
>                                                    David
>
>                                                        Regards,
>
>                                                        Bertrand.
>
>
>
>                                                        On 23/11/2015
>                 08:03, David Holmes
>                              wrote:
>
>                                                            After all the
>                 preliminary
>                              discussions here
>                                       are final
>                                                            proposed changes:
>
>                                                            bug:
>                 https://bugs.openjdk.java.net/browse/JDK-8132510
>
>                                                            Open webrev:
>                 http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>
>                                                            A simple (in
>                 principle) but
>                              wide-ranging change
>                                                            which should
>                 appeal to
>                                                            our Code
>                 Deletion Engineer's.
>                              We implement
>
>                 Thread::current() using a
>                 compiler/language-based
>                              thread-local
>                                       variable eg:
>
>                                                               static
>                 __thread Thread
>                              *_thr_current;
>
>                                                               inline Thread*
>                              Thread::current() {
>                                                                 return
>                 _thr_current;
>                                                               }
>
>                                                            with an
>                 appropriate setter of
>                              course. By
>                                       doing this
>                                                            we can completely
>                                                            remove the
>                 os_cpu-specific
>                              ThreadLocalStorage
>
>                 implementations, and the
>                                                            associated
>                              os::thread_local_storage* calls.
>
>                                                            As __thread
>                 is not
>                              async-signal-safe (either in
>                                                            theory or
>                 practice) we
>                                                            need a
>                 fallback library-based
>                              solution as used
>                                                            today. For
>                 this we use a
>                                                            very simple
>                 ThreadLocalStorage
>                              class and an
>
>                 implementation thereof for
>                                                            POSIX (covers
>                 AIX, BSD, Linux,
>                              OSX and
>                                       Solaris) using
>                 pthread_get/setspecific; and
>                              one for
>                                       Windows using
>                                                            its TLS library.
>                                                            While these
>                 library routines
>                              are not guaranteed
>
>                 async-signal-safe, they
>                                                            seem to be in
>                 practice and are
>                              what we have
>                                       been
>                                                            using all along.
>
>                                                            We also allow
>                 for use of
>                 only the
>                                       library-based TLS
>                                                            for platforms
>                 where
>
>                 compiler-based thread locals
>                              are not
>                                       supported (as
>                                                            will be needed in
>                                                            the
>                                                            Mobile
>                 project). This is
>                              enabled at build
>                                       time by
>                                                            defining
>                 USE_LIBRARY_BASED_TLS_ONLY.
>
>                                                            Thanks to
>                 Andrew Haley for
>                              providing the
>                                       Aarch64
>                                                            code; and for
>                 Thomas
>                                                            Stuefe for
>                 testing PPC and
>                 AIX.
>
>                                                            Testing:
>                                                               - JPRT
>                 (core platforms)
>                                                               - Jtreg
>                 tests (linux &
>                 solaris)
>                                                               -
>                 vm.runtime (core
>                 platforms)
>
>                                                            Performance:
>                                                               - still
>                 TBD - this is
>                              proving to be
>                                       extremely
>                                                            difficult. If
>                 anyone
>                                                            has
>                                                            any simple to run
>                              microbenchmarks to
>                                       suggest I'd
>                                                            give them a
>                 try as a
>                                                            sanity check.
>                 But I lack
>                              access to hardware for
>                                                            running serious
>                                                            benchmarking.
>
>                                                            Footprint:
>                                                            - varies by
>                 platform and the
>                              VM (server,
>                                       client,
>                                                            minimal)
>                                                            - worst-case:
>                 ~1% increase for
>                              server VM
>                                       and minimal VM
>                                                            - best-case:
>                 0.4% decrease
>                              for client VM
>
>                                                            Thanks,
>                                                            David
>
>
>
>
>
>
>
>
>                                   --
>                                   Bertrand Delsart,
>                   Grenoble
>                              Engineering Center
>                                   Oracle,         180 av. de l'Europe,
>                 ZIRST de
>                              Montbonnot
>                                   38330 Montbonnot Saint Martin, FRANCE
>                 bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>
>                              <mailto:bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>>
>                              <mailto:bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>
>                              <mailto:bertrand.delsart at oracle.com
>                 <mailto:bertrand.delsart at oracle.com>>>
>                                             Phone : +33 4 76 18 81 23
>
>                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>                                   NOTICE: This email message is for the
>                 sole use of the
>                              intended
>                                   recipient(s) and may contain
>                 confidential and
>                 privileged
>                                   information. Any unauthorized review, use,
>                 disclosure or
>                                   distribution is prohibited. If you are
>                 not the intended
>                              recipient,
>                                   please contact the sender by reply
>                 email and destroy
>                              all copies of
>                                   the original message.
>                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
>
>
>


More information about the hotspot-dev mailing list