RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Wed Nov 25 23:08:17 UTC 2015

If I understand the problem... :-)

The Win-X64 ABI requires that some extra space be allocated on
the stack in the caller for "homing" four registers...

This code uses MacroAssembler::call() which is a lower level
call() function that assumes you know the ABI...

9765 void MacroAssembler::get_thread(Register thread) {
<snip>
9784   call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));

If you switch to MacroAssembler::call_VM_leaf_base() it handles
more of the gory details for you than MacroAssembler::call():

src/cpu/x86/vm/macroAssembler_x86.cpp:

    524  void MacroAssembler::call_VM_leaf_base(address entry_point, int 
num_args
) {
    525    Label L, E;
    526
    527  #ifdef _WIN64
    528    // Windows always allocates space for it's register args
    529    assert(num_args <= 4, "only register arguments supported");
    530    subq(rsp,  frame::arg_reg_save_area_bytes);
    531  #endif
    532
    533    // Align stack if necessary
    534    testl(rsp, 15);
    535    jcc(Assembler::zero, L);
    536
    537    subq(rsp, 8);
    538    {
    539      call(RuntimeAddress(entry_point));
    540    }

So call_VM_leaf_base() allocates the "homing" space for register
args for Win-X64 on L530 and it handles any stack alignment before
calling MacroAssembler::call().

So if you switch to call_VM_leaf_base() you can drop
the stack alignment code:

9777   // ensure stack alignment
9778   mov(r10, rsp);
9779   andq(rsp, -16);
9780   push(r10);

and just switch to:

9777   push(rsp);

And you won't be exposed to potential saved register
overwrite by the "homing" of any registers in the call
Thread::current().

I _think_ I have that all correct... :-)

Dan

On 11/25/15 3:52 PM, David Holmes wrote:
> On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>> I created a bug to track this:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8144059
>
> Thanks. I admit I still don't really grok the exact problem here. It 
> is the alignment code that is the problem right?
>
> 79 mov(r10, rsp);
> 80 andq(rsp, -16);
> 81 push(r10);
>
> ?
>
> Thanks,
> David
>
>> Because Davids change is not submitted yet, the bug refers to the old
>> coding, but stays relevant also after this fix.
>>
>> Kind Regards, Thomas
>>
>> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>> <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>> 
>> wrote:
>>
>>     On 11/25/15 6:10 AM, David Holmes wrote:
>>
>>         On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>
>>
>>             On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
>>             <bertrand.delsart at oracle.com
>>             <mailto:bertrand.delsart at oracle.com>
>>             <mailto:bertrand.delsart at oracle.com
>>             <mailto:bertrand.delsart at oracle.com>>> wrote:
>>
>>                  Good finding Thomas,
>>
>>                  On 25/11/2015 09:40, Thomas Stüfe wrote:
>>
>>                      Hi David, Bertrand,
>>
>>                      about
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>
>>                      I am not sure this is correct for Windows x64.
>>
>>
>>                  Unless I'm misreading it, the new code looks a lot like
>>             the old one
>>                  (including the alignement part). Hence this does not
>>             look like a
>>                  regression caused by David's change.
>>
>>
>>             Yes, this seems like an old bug.
>>
>>             I also see that this is fixed in our codebase since 2010,
>>             exactly the
>>             way you propose (with MacroAssembler
>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>>
>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>>
>>             So, it is already used since five years. I would propose
>>             opening a
>>             separate bug for this one and we contribute our fix.
>>
>>
>>         That sounds good to me too. I also suspect additional code will
>>         need to be checked as this stack alignment is performed in a few
>>         places in ./cpu/x86/vm/macroAssembler_x86.cpp with no OS
>>         specific differences being considered.
>>
>>
>>     Looks like MacroAssembler::call_VM_leaf_base() does the "right 
>> thing"
>>     for stack alignment in the X64 case. Perhaps we need to audit both
>>     stack alignment _and_ uses of MacroAssembler::call() just to make
>>     sure...
>>
>>     Did I mention that all the assembly code stuff gives me a 
>> headache? :-)
>>
>>     Dan
>>
>>
>>
>>
>>         Thanks,
>>         David
>>
>>             Regards, Thomas
>>
>>
>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>
>>                      windows x64 Abi requires the register parameter
>>             area right above the
>>                      return address on the stack. It is not 
>> guaranteed to be
>>                      preserved during
>>                      the function call:
>>
>>                      "Even if the called function has fewer than 4
>>             parameters, these
>>                      4 stack
>>                      locations are effectively owned by the called
>>             function, and may
>>                      be used
>>                      by the called function for other purposes besides
>>             saving parameter
>>                      register values. Thus the caller may not save
>>             information in
>>                      this region
>>                      of stack across a function call."
>>
>>                      So, restoring parameters from it may not work if
>>                      Thread::current() uses
>>                      this area for any reason. In this case, r9, rsp,
>>             r10, r11 may be
>>                      overwritten by the callee.
>>
>>                      I also think Windows x64 ABI is different enough
>>             from Linux that
>>                      this
>>                      function could live somewhere in <os>_<cpu>.
>>
>>
>>                  This is a valid point. However, note that this is
>>             similar to the
>>                  issue we have for instance in 'shared'
>>                  MacroAssembler::call_VM_leaf_base, which was not put in
>>             <os>_<cpu>
>>                  files.
>>
>>                  In fact, the right fix may be to use call_VM_leaf
>>             instead of call :-)
>>
>>                  (alternatively we could use "#ifdef _WIN64" or check
>>                  "frame::arg_reg_save_area_bytes != 0", which may be
>>             even cleaner and
>>                  more portable)
>>
>>                  All these solutions are fine for me.
>>
>>
>>                  Since we are near to feature freeze, this should
>>             probably be done in
>>                  a separate CR (unless David has enough time to test 
>> it).
>>
>>                  Regards,
>>
>>                  Bertrand.
>>
>>
>>                      Kind Regards, Thomas
>>
>>
>>
>>
>>                      On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
>>                      <david.holmes at oracle.com
>>             <mailto:david.holmes at oracle.com>
>>             <mailto:david.holmes at oracle.com
>>             <mailto:david.holmes at oracle.com>>
>>                      <mailto:david.holmes at oracle.com
>>             <mailto:david.holmes at oracle.com>
>>                      <mailto:david.holmes at oracle.com
>>             <mailto:david.holmes at oracle.com>>>> wrote:
>>
>>                           On 23/11/2015 11:46 PM, Bertrand Delsart 
>> wrote:
>>
>>                               On 23/11/2015 13:27, David Holmes wrote:
>>
>>                                   Hi Bertrand,
>>
>>                                   On 23/11/2015 8:14 PM, Bertrand
>>             Delsart wrote:
>>
>>                                       Hi David,
>>
>>                                       Overall looks good.
>>                                       Thanks for the #ifndef
>>                      USE_LIBRARY_BASED_TLS_ONLY :-)
>>
>>                                       One doubt, in case this has not
>>             been discussed
>>                      before.
>>
>>                                       I'm still catching up on x86_64
>>             (after years of ARM
>>                                       assembly :-)) but it
>>                                       seems that there are some stack
>>             alignment
>>                      constraints on
>>                                       runtime calls,
>>                                       at least for some x86_64 ABIs.
>>
>>                                       Some of the x86
>>             MacroAssembler::get_thread
>>                                       implementations had code to
>>                                       align the stack before calling
>>                      pthread_getspecific. See
>>                                       for instance
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>
>>
>>
>>                                   Sorry I'm not that familiar with the
>>             MacroAssembler
>>                      - is it
>>                                   this odd
>>                                   fragment:
>>
>>                                   -   push(r10);
>>                                   -   // XXX
>>                                   -   mov(r10, rsp);
>>                                   -   andq(rsp, -16);
>>                                   -   push(r10);
>>
>>                                   I'm not at all clear what that is
>>             doing - and if it
>>                      somehow
>>                                   changes the
>>                                   alignment wouldn't something need to
>>             be fixed up when
>>                                   popping the
>>                                   previous values ??
>>
>>                                   To be honest I'm not even sure what an
>>             "unaligned
>>                      stack" means.
>>
>>
>>                               On some platforms, SP may have to be
>>             aligned on a 16 bytes
>>                               boundary when
>>                               calling another method.
>>
>>
>>                           Okay - that seems to be an x64 ABI requirement
>>             based on
>>                      other code
>>                           in MacroAssembler. The reason it is missing in
>>             the Solaris
>>                      code is
>>                           because I already removed it in the earlier
>>             changes that
>>                      were done
>>                           on Solaris:
>>
>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>
>>                           and nobody picked it up. :(
>>
>>                           That said, the absence of this code has not
>>             caused any
>>                      problems so
>>                           presumably we are normally aligned.
>>
>>                               After having pushed what needed to be
>>             saved, the code above
>>                               rounds SP
>>                               down and saves the old value. It will then
>>             also push
>>                      r11, which
>>                               results
>>                               in the expected alignment.
>>
>>                               The conterpart, after the VM call, is the:
>>                                  pop(r11);
>>                                  pop(rsp);
>>
>>
>>                           I had assumed this extra manipulation was
>>             related to
>>                      pushing the arg
>>                           for the call.
>>
>>                                       This alignment is no longer
>>             performed in the new
>>                                       (shared) implementation
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>
>>
>>
>>
>>                                       Now, Solaris was not performing
>>             the alignment and
>>                                       Windows has a separate
>>                                       path for x86_64. Did we really
>>             need the
>>                      alignment for
>>                                       linux x86_64 and
>>                                       bsd_x86_64 ? Might it be needed
>>             for other ports ?
>>
>>                                       IMHO, it might be safer to align
>>             the stack by
>>                      default,
>>                                       knowing it should
>>                                       not be expensive since we call
>>             get_thread
>>                      rarely for
>>                                       x86_64 (result is
>>                                       cached in r15). I'll let you see
>>             whether it is
>>                      worth
>>                                       adding an ifdef so
>>                                       as to explicitly deactivate the
>>             alignment on
>>                      some platforms
>>                                       (solaris_x86_64 ?)
>>
>>
>>                                   I don't have enough knowledge to even
>>             begin to
>>                      comment on
>>                                   this. I will
>>                                   have to rely on our x86_64
>>             macroassembler experts
>>                      to explain
>>                                   the old
>>                                   code and what the requirements are.
>>
>>
>>                               OK. This means that the alignment was not
>>             removed
>>                      purposefully.
>>
>>                               In that case, you must either keep the per
>>             platform
>>                      code x86_64
>>                               (since
>>                               it was not identical for all platforms) or
>>             use the safest
>>                               version, with
>>                               the additional
>>
>>                                    // XXX
>>                                    mov(r10, rsp);
>>                                    andq(rsp, -16);
>>                                    push(r10);
>>
>>                               before the push(r11) and with the
>>
>>                                    pop(rsp);
>>
>>                               after the pop(r11). It should work on all
>>             x86_64 platforms.
>>
>>
>>                           Right, I will add this back to the code - and
>>             replace the
>>                           meaningless // XXX with an actual comment!
>>
>>                           Many thanks,
>>                           David
>>                           -----
>>
>>
>>                               FYI, there is another way to do the
>>             alignment, based on
>>                      the fact
>>                               that
>>                               we are at least aligned on 8 bytes (see
>> MacroAssembler::call_VM_leaf_base). I
>>             assume that this
>>                      second
>>                               version is
>>                               more efficient (particularly thanks to
>>             specilative
>>                      branches) but it
>>                               might be safer/simpler to continue using
>>             andq in
>>                      get_thread.
>>
>>                               Bertrand.
>>
>>
>>                                   Thanks,
>>                                   David
>>
>>                                       Regards,
>>
>>                                       Bertrand.
>>
>>
>>
>>                                       On 23/11/2015 08:03, David Holmes
>>             wrote:
>>
>>                                           After all the preliminary
>>             discussions here
>>                      are final
>>                                           proposed changes:
>>
>>                                           bug:
>>             https://bugs.openjdk.java.net/browse/JDK-8132510
>>
>>                                           Open webrev:
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>
>>                                           A simple (in principle) but
>>             wide-ranging change
>>                                           which should appeal to
>>                                           our Code Deletion Engineer's.
>>             We implement
>>                                           Thread::current() using a
>> compiler/language-based
>>             thread-local
>>                      variable eg:
>>
>>                                              static __thread Thread
>>             *_thr_current;
>>
>>                                              inline Thread*
>>             Thread::current() {
>>                                                return _thr_current;
>>                                              }
>>
>>                                           with an appropriate setter of
>>             course. By
>>                      doing this
>>                                           we can completely
>>                                           remove the os_cpu-specific
>>             ThreadLocalStorage
>>                                           implementations, and the
>>                                           associated
>>             os::thread_local_storage* calls.
>>
>>                                           As __thread is not
>>             async-signal-safe (either in
>>                                           theory or practice) we
>>                                           need a fallback library-based
>>             solution as used
>>                                           today. For this we use a
>>                                           very simple ThreadLocalStorage
>>             class and an
>>                                           implementation thereof for
>>                                           POSIX (covers AIX, BSD, Linux,
>>             OSX and
>>                      Solaris) using
>> pthread_get/setspecific; and
>>             one for
>>                      Windows using
>>                                           its TLS library.
>>                                           While these library routines
>>             are not guaranteed
>>                                           async-signal-safe, they
>>                                           seem to be in practice and are
>>             what we have
>>                      been
>>                                           using all along.
>>
>>                                           We also allow for use of 
>> only the
>>                      library-based TLS
>>                                           for platforms where
>>                                           compiler-based thread locals
>>             are not
>>                      supported (as
>>                                           will be needed in
>>                                           the
>>                                           Mobile project). This is
>>             enabled at build
>>                      time by
>>                                           defining
>> USE_LIBRARY_BASED_TLS_ONLY.
>>
>>                                           Thanks to Andrew Haley for
>>             providing the
>>                      Aarch64
>>                                           code; and for Thomas
>>                                           Stuefe for testing PPC and 
>> AIX.
>>
>>                                           Testing:
>>                                              - JPRT (core platforms)
>>                                              - Jtreg tests (linux & 
>> solaris)
>>                                              - vm.runtime (core 
>> platforms)
>>
>>                                           Performance:
>>                                              - still TBD - this is
>>             proving to be
>>                      extremely
>>                                           difficult. If anyone
>>                                           has
>>                                           any simple to run
>>             microbenchmarks to
>>                      suggest I'd
>>                                           give them a try as a
>>                                           sanity check. But I lack
>>             access to hardware for
>>                                           running serious
>>                                           benchmarking.
>>
>>                                           Footprint:
>>                                           - varies by platform and the
>>             VM (server,
>>                      client,
>>                                           minimal)
>>                                           - worst-case: ~1% increase for
>>             server VM
>>                      and minimal VM
>>                                           - best-case:  0.4% decrease
>>             for client VM
>>
>>                                           Thanks,
>>                                           David
>>
>>
>>
>>
>>
>>
>>
>>
>>                  --
>>                  Bertrand Delsart,                     Grenoble
>>             Engineering Center
>>                  Oracle,         180 av. de l'Europe, ZIRST de
>>             Montbonnot
>>                  38330 Montbonnot Saint Martin, FRANCE
>>             bertrand.delsart at oracle.com
>>             <mailto:bertrand.delsart at oracle.com>
>>             <mailto:bertrand.delsart at oracle.com
>>             <mailto:bertrand.delsart at oracle.com>>
>>                            Phone : +33 4 76 18 81 23
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>                  NOTICE: This email message is for the sole use of the
>>             intended
>>                  recipient(s) and may contain confidential and 
>> privileged
>>                  information. Any unauthorized review, use, 
>> disclosure or
>>                  distribution is prohibited. If you are not the intended
>>             recipient,
>>                  please contact the sender by reply email and destroy
>>             all copies of
>>                  the original message.
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>
>>