RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Thomas Stüfe thomas.stuefe at gmail.com
Thu Nov 26 09:56:59 UTC 2015


Hi David,


On Thu, Nov 26, 2015 at 12:51 AM, David Holmes <david.holmes at oracle.com>
wrote:

> On 26/11/2015 9:08 AM, Daniel D. Daugherty wrote:
>
>> If I understand the problem... :-)
>>
>> The Win-X64 ABI requires that some extra space be allocated on
>> the stack in the caller for "homing" four registers...
>>
>> This code uses MacroAssembler::call() which is a lower level
>> call() function that assumes you know the ABI...
>>
>> 9765 void MacroAssembler::get_thread(Register thread) {
>> <snip>
>> 9784   call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));
>>
>
> So the failing here was nothing to do with register saving per-se, nor the
> alignment code, but "simply" not also saving the extra space prior to the
> call!
>
>
Yes. These 32 byte are a scratch area for the callee, which may be used for
parameter registers or anything else. We may just have been lucky so far
that TlsGetValue() did not use this area. As Bertrand already said, now we
call Thread::current(), which does magic tls accesses under the hood, so
this must be tested again.

However I think that a simple addition to your current fix could simply be
subtracting 32 from rsp before the call:

subq(rsp, 32)
call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));
addq(rsp, 32)

Also, we could probably skip saving/restoring rdi and rsi on Windows:

https://msdn.microsoft.com/en-us/library/6t169e9c.aspx

"The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile and
must be considered destroyed on function calls (unless otherwise
safety-provable by analysis such as whole program optimization). The
registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered
nonvolatile and must be saved and restored by a function that uses them."

Best Regards, Thomas


If you switch to MacroAssembler::call_VM_leaf_base() it handles
>> more of the gory details for you than MacroAssembler::call():
>>
>
> Wow! Not exactly a clearly-defined name for this. AFAICS there's nothing
> VM or leaf specific about this. Also the MacroAssembler is supposed to be a
> platform-agnostic higher-level abstraction that is easy to use - but in
> this case every use of the plain call on Windows-x64 is potentially broken
> if the programmer was not aware of this ABI constraint which only applies
> to Windows (and we're in OS-neutral files!).
>
> This seems fundamentally broken to me.
>
> Thanks for the info.
>
> David
>
>
> src/cpu/x86/vm/macroAssembler_x86.cpp:
>>
>>     524  void MacroAssembler::call_VM_leaf_base(address entry_point, int
>> num_args
>> ) {
>>     525    Label L, E;
>>     526
>>     527  #ifdef _WIN64
>>     528    // Windows always allocates space for it's register args
>>     529    assert(num_args <= 4, "only register arguments supported");
>>     530    subq(rsp,  frame::arg_reg_save_area_bytes);
>>     531  #endif
>>     532
>>     533    // Align stack if necessary
>>     534    testl(rsp, 15);
>>     535    jcc(Assembler::zero, L);
>>     536
>>     537    subq(rsp, 8);
>>     538    {
>>     539      call(RuntimeAddress(entry_point));
>>     540    }
>>
>> So call_VM_leaf_base() allocates the "homing" space for register
>> args for Win-X64 on L530 and it handles any stack alignment before
>> calling MacroAssembler::call().
>>
>> So if you switch to call_VM_leaf_base() you can drop
>> the stack alignment code:
>>
>> 9777   // ensure stack alignment
>> 9778   mov(r10, rsp);
>> 9779   andq(rsp, -16);
>> 9780   push(r10);
>>
>> and just switch to:
>>
>> 9777   push(rsp);
>>
>> And you won't be exposed to potential saved register
>> overwrite by the "homing" of any registers in the call
>> Thread::current().
>>
>> I _think_ I have that all correct... :-)
>>
>> Dan
>>
>>
>> On 11/25/15 3:52 PM, David Holmes wrote:
>>
>>> On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>>>
>>>> I created a bug to track this:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8144059
>>>>
>>>
>>> Thanks. I admit I still don't really grok the exact problem here. It
>>> is the alignment code that is the problem right?
>>>
>>> 79 mov(r10, rsp);
>>> 80 andq(rsp, -16);
>>> 81 push(r10);
>>>
>>> ?
>>>
>>> Thanks,
>>> David
>>>
>>> Because Davids change is not submitted yet, the bug refers to the old
>>>> coding, but stays relevant also after this fix.
>>>>
>>>> Kind Regards, Thomas
>>>>
>>>> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>>>> <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>>
>>>> wrote:
>>>>
>>>>     On 11/25/15 6:10 AM, David Holmes wrote:
>>>>
>>>>         On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>>>
>>>>
>>>>             On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
>>>>             <bertrand.delsart at oracle.com
>>>>             <mailto:bertrand.delsart at oracle.com>
>>>>             <mailto:bertrand.delsart at oracle.com
>>>>             <mailto:bertrand.delsart at oracle.com>>> wrote:
>>>>
>>>>                  Good finding Thomas,
>>>>
>>>>                  On 25/11/2015 09:40, Thomas Stüfe wrote:
>>>>
>>>>                      Hi David, Bertrand,
>>>>
>>>>                      about
>>>>
>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>>>
>>>>
>>>>                      I am not sure this is correct for Windows x64.
>>>>
>>>>
>>>>                  Unless I'm misreading it, the new code looks a lot like
>>>>             the old one
>>>>                  (including the alignement part). Hence this does not
>>>>             look like a
>>>>                  regression caused by David's change.
>>>>
>>>>
>>>>             Yes, this seems like an old bug.
>>>>
>>>>             I also see that this is fixed in our codebase since 2010,
>>>>             exactly the
>>>>             way you propose (with MacroAssembler
>>>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot
>>>> >::call_VM_leaf_base
>>>>
>>>>
>>>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot
>>>> >).
>>>>
>>>>
>>>>             So, it is already used since five years. I would propose
>>>>             opening a
>>>>             separate bug for this one and we contribute our fix.
>>>>
>>>>
>>>>         That sounds good to me too. I also suspect additional code will
>>>>         need to be checked as this stack alignment is performed in a few
>>>>         places in ./cpu/x86/vm/macroAssembler_x86.cpp with no OS
>>>>         specific differences being considered.
>>>>
>>>>
>>>>     Looks like MacroAssembler::call_VM_leaf_base() does the "right
>>>> thing"
>>>>     for stack alignment in the X64 case. Perhaps we need to audit both
>>>>     stack alignment _and_ uses of MacroAssembler::call() just to make
>>>>     sure...
>>>>
>>>>     Did I mention that all the assembly code stuff gives me a
>>>> headache? :-)
>>>>
>>>>     Dan
>>>>
>>>>
>>>>
>>>>
>>>>         Thanks,
>>>>         David
>>>>
>>>>             Regards, Thomas
>>>>
>>>>
>>>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>>>
>>>>                      windows x64 Abi requires the register parameter
>>>>             area right above the
>>>>                      return address on the stack. It is not
>>>> guaranteed to be
>>>>                      preserved during
>>>>                      the function call:
>>>>
>>>>                      "Even if the called function has fewer than 4
>>>>             parameters, these
>>>>                      4 stack
>>>>                      locations are effectively owned by the called
>>>>             function, and may
>>>>                      be used
>>>>                      by the called function for other purposes besides
>>>>             saving parameter
>>>>                      register values. Thus the caller may not save
>>>>             information in
>>>>                      this region
>>>>                      of stack across a function call."
>>>>
>>>>                      So, restoring parameters from it may not work if
>>>>                      Thread::current() uses
>>>>                      this area for any reason. In this case, r9, rsp,
>>>>             r10, r11 may be
>>>>                      overwritten by the callee.
>>>>
>>>>                      I also think Windows x64 ABI is different enough
>>>>             from Linux that
>>>>                      this
>>>>                      function could live somewhere in <os>_<cpu>.
>>>>
>>>>
>>>>                  This is a valid point. However, note that this is
>>>>             similar to the
>>>>                  issue we have for instance in 'shared'
>>>>                  MacroAssembler::call_VM_leaf_base, which was not put in
>>>>             <os>_<cpu>
>>>>                  files.
>>>>
>>>>                  In fact, the right fix may be to use call_VM_leaf
>>>>             instead of call :-)
>>>>
>>>>                  (alternatively we could use "#ifdef _WIN64" or check
>>>>                  "frame::arg_reg_save_area_bytes != 0", which may be
>>>>             even cleaner and
>>>>                  more portable)
>>>>
>>>>                  All these solutions are fine for me.
>>>>
>>>>
>>>>                  Since we are near to feature freeze, this should
>>>>             probably be done in
>>>>                  a separate CR (unless David has enough time to test
>>>> it).
>>>>
>>>>                  Regards,
>>>>
>>>>                  Bertrand.
>>>>
>>>>
>>>>                      Kind Regards, Thomas
>>>>
>>>>
>>>>
>>>>
>>>>                      On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
>>>>                      <david.holmes at oracle.com
>>>>             <mailto:david.holmes at oracle.com>
>>>>             <mailto:david.holmes at oracle.com
>>>>             <mailto:david.holmes at oracle.com>>
>>>>                      <mailto:david.holmes at oracle.com
>>>>             <mailto:david.holmes at oracle.com>
>>>>                      <mailto:david.holmes at oracle.com
>>>>             <mailto:david.holmes at oracle.com>>>> wrote:
>>>>
>>>>                           On 23/11/2015 11:46 PM, Bertrand Delsart
>>>> wrote:
>>>>
>>>>                               On 23/11/2015 13:27, David Holmes wrote:
>>>>
>>>>                                   Hi Bertrand,
>>>>
>>>>                                   On 23/11/2015 8:14 PM, Bertrand
>>>>             Delsart wrote:
>>>>
>>>>                                       Hi David,
>>>>
>>>>                                       Overall looks good.
>>>>                                       Thanks for the #ifndef
>>>>                      USE_LIBRARY_BASED_TLS_ONLY :-)
>>>>
>>>>                                       One doubt, in case this has not
>>>>             been discussed
>>>>                      before.
>>>>
>>>>                                       I'm still catching up on x86_64
>>>>             (after years of ARM
>>>>                                       assembly :-)) but it
>>>>                                       seems that there are some stack
>>>>             alignment
>>>>                      constraints on
>>>>                                       runtime calls,
>>>>                                       at least for some x86_64 ABIs.
>>>>
>>>>                                       Some of the x86
>>>>             MacroAssembler::get_thread
>>>>                                       implementations had code to
>>>>                                       align the stack before calling
>>>>                      pthread_getspecific. See
>>>>                                       for instance
>>>>
>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>>
>>>>
>>>>
>>>>
>>>>                                   Sorry I'm not that familiar with the
>>>>             MacroAssembler
>>>>                      - is it
>>>>                                   this odd
>>>>                                   fragment:
>>>>
>>>>                                   -   push(r10);
>>>>                                   -   // XXX
>>>>                                   -   mov(r10, rsp);
>>>>                                   -   andq(rsp, -16);
>>>>                                   -   push(r10);
>>>>
>>>>                                   I'm not at all clear what that is
>>>>             doing - and if it
>>>>                      somehow
>>>>                                   changes the
>>>>                                   alignment wouldn't something need to
>>>>             be fixed up when
>>>>                                   popping the
>>>>                                   previous values ??
>>>>
>>>>                                   To be honest I'm not even sure what an
>>>>             "unaligned
>>>>                      stack" means.
>>>>
>>>>
>>>>                               On some platforms, SP may have to be
>>>>             aligned on a 16 bytes
>>>>                               boundary when
>>>>                               calling another method.
>>>>
>>>>
>>>>                           Okay - that seems to be an x64 ABI requirement
>>>>             based on
>>>>                      other code
>>>>                           in MacroAssembler. The reason it is missing in
>>>>             the Solaris
>>>>                      code is
>>>>                           because I already removed it in the earlier
>>>>             changes that
>>>>                      were done
>>>>                           on Solaris:
>>>>
>>>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>>>
>>>>                           and nobody picked it up. :(
>>>>
>>>>                           That said, the absence of this code has not
>>>>             caused any
>>>>                      problems so
>>>>                           presumably we are normally aligned.
>>>>
>>>>                               After having pushed what needed to be
>>>>             saved, the code above
>>>>                               rounds SP
>>>>                               down and saves the old value. It will then
>>>>             also push
>>>>                      r11, which
>>>>                               results
>>>>                               in the expected alignment.
>>>>
>>>>                               The conterpart, after the VM call, is the:
>>>>                                  pop(r11);
>>>>                                  pop(rsp);
>>>>
>>>>
>>>>                           I had assumed this extra manipulation was
>>>>             related to
>>>>                      pushing the arg
>>>>                           for the call.
>>>>
>>>>                                       This alignment is no longer
>>>>             performed in the new
>>>>                                       (shared) implementation
>>>>
>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                                       Now, Solaris was not performing
>>>>             the alignment and
>>>>                                       Windows has a separate
>>>>                                       path for x86_64. Did we really
>>>>             need the
>>>>                      alignment for
>>>>                                       linux x86_64 and
>>>>                                       bsd_x86_64 ? Might it be needed
>>>>             for other ports ?
>>>>
>>>>                                       IMHO, it might be safer to align
>>>>             the stack by
>>>>                      default,
>>>>                                       knowing it should
>>>>                                       not be expensive since we call
>>>>             get_thread
>>>>                      rarely for
>>>>                                       x86_64 (result is
>>>>                                       cached in r15). I'll let you see
>>>>             whether it is
>>>>                      worth
>>>>                                       adding an ifdef so
>>>>                                       as to explicitly deactivate the
>>>>             alignment on
>>>>                      some platforms
>>>>                                       (solaris_x86_64 ?)
>>>>
>>>>
>>>>                                   I don't have enough knowledge to even
>>>>             begin to
>>>>                      comment on
>>>>                                   this. I will
>>>>                                   have to rely on our x86_64
>>>>             macroassembler experts
>>>>                      to explain
>>>>                                   the old
>>>>                                   code and what the requirements are.
>>>>
>>>>
>>>>                               OK. This means that the alignment was not
>>>>             removed
>>>>                      purposefully.
>>>>
>>>>                               In that case, you must either keep the per
>>>>             platform
>>>>                      code x86_64
>>>>                               (since
>>>>                               it was not identical for all platforms) or
>>>>             use the safest
>>>>                               version, with
>>>>                               the additional
>>>>
>>>>                                    // XXX
>>>>                                    mov(r10, rsp);
>>>>                                    andq(rsp, -16);
>>>>                                    push(r10);
>>>>
>>>>                               before the push(r11) and with the
>>>>
>>>>                                    pop(rsp);
>>>>
>>>>                               after the pop(r11). It should work on all
>>>>             x86_64 platforms.
>>>>
>>>>
>>>>                           Right, I will add this back to the code - and
>>>>             replace the
>>>>                           meaningless // XXX with an actual comment!
>>>>
>>>>                           Many thanks,
>>>>                           David
>>>>                           -----
>>>>
>>>>
>>>>                               FYI, there is another way to do the
>>>>             alignment, based on
>>>>                      the fact
>>>>                               that
>>>>                               we are at least aligned on 8 bytes (see
>>>> MacroAssembler::call_VM_leaf_base). I
>>>>             assume that this
>>>>                      second
>>>>                               version is
>>>>                               more efficient (particularly thanks to
>>>>             specilative
>>>>                      branches) but it
>>>>                               might be safer/simpler to continue using
>>>>             andq in
>>>>                      get_thread.
>>>>
>>>>                               Bertrand.
>>>>
>>>>
>>>>                                   Thanks,
>>>>                                   David
>>>>
>>>>                                       Regards,
>>>>
>>>>                                       Bertrand.
>>>>
>>>>
>>>>
>>>>                                       On 23/11/2015 08:03, David Holmes
>>>>             wrote:
>>>>
>>>>                                           After all the preliminary
>>>>             discussions here
>>>>                      are final
>>>>                                           proposed changes:
>>>>
>>>>                                           bug:
>>>>             https://bugs.openjdk.java.net/browse/JDK-8132510
>>>>
>>>>                                           Open webrev:
>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>>
>>>>                                           A simple (in principle) but
>>>>             wide-ranging change
>>>>                                           which should appeal to
>>>>                                           our Code Deletion Engineer's.
>>>>             We implement
>>>>                                           Thread::current() using a
>>>> compiler/language-based
>>>>             thread-local
>>>>                      variable eg:
>>>>
>>>>                                              static __thread Thread
>>>>             *_thr_current;
>>>>
>>>>                                              inline Thread*
>>>>             Thread::current() {
>>>>                                                return _thr_current;
>>>>                                              }
>>>>
>>>>                                           with an appropriate setter of
>>>>             course. By
>>>>                      doing this
>>>>                                           we can completely
>>>>                                           remove the os_cpu-specific
>>>>             ThreadLocalStorage
>>>>                                           implementations, and the
>>>>                                           associated
>>>>             os::thread_local_storage* calls.
>>>>
>>>>                                           As __thread is not
>>>>             async-signal-safe (either in
>>>>                                           theory or practice) we
>>>>                                           need a fallback library-based
>>>>             solution as used
>>>>                                           today. For this we use a
>>>>                                           very simple ThreadLocalStorage
>>>>             class and an
>>>>                                           implementation thereof for
>>>>                                           POSIX (covers AIX, BSD, Linux,
>>>>             OSX and
>>>>                      Solaris) using
>>>> pthread_get/setspecific; and
>>>>             one for
>>>>                      Windows using
>>>>                                           its TLS library.
>>>>                                           While these library routines
>>>>             are not guaranteed
>>>>                                           async-signal-safe, they
>>>>                                           seem to be in practice and are
>>>>             what we have
>>>>                      been
>>>>                                           using all along.
>>>>
>>>>                                           We also allow for use of
>>>> only the
>>>>                      library-based TLS
>>>>                                           for platforms where
>>>>                                           compiler-based thread locals
>>>>             are not
>>>>                      supported (as
>>>>                                           will be needed in
>>>>                                           the
>>>>                                           Mobile project). This is
>>>>             enabled at build
>>>>                      time by
>>>>                                           defining
>>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>>
>>>>                                           Thanks to Andrew Haley for
>>>>             providing the
>>>>                      Aarch64
>>>>                                           code; and for Thomas
>>>>                                           Stuefe for testing PPC and
>>>> AIX.
>>>>
>>>>                                           Testing:
>>>>                                              - JPRT (core platforms)
>>>>                                              - Jtreg tests (linux &
>>>> solaris)
>>>>                                              - vm.runtime (core
>>>> platforms)
>>>>
>>>>                                           Performance:
>>>>                                              - still TBD - this is
>>>>             proving to be
>>>>                      extremely
>>>>                                           difficult. If anyone
>>>>                                           has
>>>>                                           any simple to run
>>>>             microbenchmarks to
>>>>                      suggest I'd
>>>>                                           give them a try as a
>>>>                                           sanity check. But I lack
>>>>             access to hardware for
>>>>                                           running serious
>>>>                                           benchmarking.
>>>>
>>>>                                           Footprint:
>>>>                                           - varies by platform and the
>>>>             VM (server,
>>>>                      client,
>>>>                                           minimal)
>>>>                                           - worst-case: ~1% increase for
>>>>             server VM
>>>>                      and minimal VM
>>>>                                           - best-case:  0.4% decrease
>>>>             for client VM
>>>>
>>>>                                           Thanks,
>>>>                                           David
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>                  --
>>>>                  Bertrand Delsart,                     Grenoble
>>>>             Engineering Center
>>>>                  Oracle,         180 av. de l'Europe, ZIRST de
>>>>             Montbonnot
>>>>                  38330 Montbonnot Saint Martin, FRANCE
>>>>             bertrand.delsart at oracle.com
>>>>             <mailto:bertrand.delsart at oracle.com>
>>>>             <mailto:bertrand.delsart at oracle.com
>>>>             <mailto:bertrand.delsart at oracle.com>>
>>>>                            Phone : +33 4 76 18 81 23
>>>>
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>                  NOTICE: This email message is for the sole use of the
>>>>             intended
>>>>                  recipient(s) and may contain confidential and
>>>> privileged
>>>>                  information. Any unauthorized review, use,
>>>> disclosure or
>>>>                  distribution is prohibited. If you are not the intended
>>>>             recipient,
>>>>                  please contact the sender by reply email and destroy
>>>>             all copies of
>>>>                  the original message.
>>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>
>>>>
>>>>
>>>>
>>>>
>>


More information about the hotspot-dev mailing list