RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

David Holmes david.holmes at oracle.com
Wed Nov 25 23:51:26 UTC 2015


On 26/11/2015 9:08 AM, Daniel D. Daugherty wrote:
> If I understand the problem... :-)
>
> The Win-X64 ABI requires that some extra space be allocated on
> the stack in the caller for "homing" four registers...
>
> This code uses MacroAssembler::call() which is a lower level
> call() function that assumes you know the ABI...
>
> 9765 void MacroAssembler::get_thread(Register thread) {
> <snip>
> 9784   call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));

So the failing here was nothing to do with register saving per-se, nor 
the alignment code, but "simply" not also saving the extra space prior 
to the call!

> If you switch to MacroAssembler::call_VM_leaf_base() it handles
> more of the gory details for you than MacroAssembler::call():

Wow! Not exactly a clearly-defined name for this. AFAICS there's nothing 
VM or leaf specific about this. Also the MacroAssembler is supposed to 
be a platform-agnostic higher-level abstraction that is easy to use - 
but in this case every use of the plain call on Windows-x64 is 
potentially broken if the programmer was not aware of this ABI 
constraint which only applies to Windows (and we're in OS-neutral files!).

This seems fundamentally broken to me.

Thanks for the info.

David

> src/cpu/x86/vm/macroAssembler_x86.cpp:
>
>     524  void MacroAssembler::call_VM_leaf_base(address entry_point, int
> num_args
> ) {
>     525    Label L, E;
>     526
>     527  #ifdef _WIN64
>     528    // Windows always allocates space for it's register args
>     529    assert(num_args <= 4, "only register arguments supported");
>     530    subq(rsp,  frame::arg_reg_save_area_bytes);
>     531  #endif
>     532
>     533    // Align stack if necessary
>     534    testl(rsp, 15);
>     535    jcc(Assembler::zero, L);
>     536
>     537    subq(rsp, 8);
>     538    {
>     539      call(RuntimeAddress(entry_point));
>     540    }
>
> So call_VM_leaf_base() allocates the "homing" space for register
> args for Win-X64 on L530 and it handles any stack alignment before
> calling MacroAssembler::call().
>
> So if you switch to call_VM_leaf_base() you can drop
> the stack alignment code:
>
> 9777   // ensure stack alignment
> 9778   mov(r10, rsp);
> 9779   andq(rsp, -16);
> 9780   push(r10);
>
> and just switch to:
>
> 9777   push(rsp);
>
> And you won't be exposed to potential saved register
> overwrite by the "homing" of any registers in the call
> Thread::current().
>
> I _think_ I have that all correct... :-)
>
> Dan
>
>
> On 11/25/15 3:52 PM, David Holmes wrote:
>> On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>>> I created a bug to track this:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8144059
>>
>> Thanks. I admit I still don't really grok the exact problem here. It
>> is the alignment code that is the problem right?
>>
>> 79 mov(r10, rsp);
>> 80 andq(rsp, -16);
>> 81 push(r10);
>>
>> ?
>>
>> Thanks,
>> David
>>
>>> Because Davids change is not submitted yet, the bug refers to the old
>>> coding, but stays relevant also after this fix.
>>>
>>> Kind Regards, Thomas
>>>
>>> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>>> <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>>
>>> wrote:
>>>
>>>     On 11/25/15 6:10 AM, David Holmes wrote:
>>>
>>>         On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>>
>>>
>>>             On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
>>>             <bertrand.delsart at oracle.com
>>>             <mailto:bertrand.delsart at oracle.com>
>>>             <mailto:bertrand.delsart at oracle.com
>>>             <mailto:bertrand.delsart at oracle.com>>> wrote:
>>>
>>>                  Good finding Thomas,
>>>
>>>                  On 25/11/2015 09:40, Thomas Stüfe wrote:
>>>
>>>                      Hi David, Bertrand,
>>>
>>>                      about
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>>
>>>
>>>                      I am not sure this is correct for Windows x64.
>>>
>>>
>>>                  Unless I'm misreading it, the new code looks a lot like
>>>             the old one
>>>                  (including the alignement part). Hence this does not
>>>             look like a
>>>                  regression caused by David's change.
>>>
>>>
>>>             Yes, this seems like an old bug.
>>>
>>>             I also see that this is fixed in our codebase since 2010,
>>>             exactly the
>>>             way you propose (with MacroAssembler
>>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>>>
>>>
>>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>>>
>>>
>>>             So, it is already used since five years. I would propose
>>>             opening a
>>>             separate bug for this one and we contribute our fix.
>>>
>>>
>>>         That sounds good to me too. I also suspect additional code will
>>>         need to be checked as this stack alignment is performed in a few
>>>         places in ./cpu/x86/vm/macroAssembler_x86.cpp with no OS
>>>         specific differences being considered.
>>>
>>>
>>>     Looks like MacroAssembler::call_VM_leaf_base() does the "right
>>> thing"
>>>     for stack alignment in the X64 case. Perhaps we need to audit both
>>>     stack alignment _and_ uses of MacroAssembler::call() just to make
>>>     sure...
>>>
>>>     Did I mention that all the assembly code stuff gives me a
>>> headache? :-)
>>>
>>>     Dan
>>>
>>>
>>>
>>>
>>>         Thanks,
>>>         David
>>>
>>>             Regards, Thomas
>>>
>>>
>>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>>
>>>                      windows x64 Abi requires the register parameter
>>>             area right above the
>>>                      return address on the stack. It is not
>>> guaranteed to be
>>>                      preserved during
>>>                      the function call:
>>>
>>>                      "Even if the called function has fewer than 4
>>>             parameters, these
>>>                      4 stack
>>>                      locations are effectively owned by the called
>>>             function, and may
>>>                      be used
>>>                      by the called function for other purposes besides
>>>             saving parameter
>>>                      register values. Thus the caller may not save
>>>             information in
>>>                      this region
>>>                      of stack across a function call."
>>>
>>>                      So, restoring parameters from it may not work if
>>>                      Thread::current() uses
>>>                      this area for any reason. In this case, r9, rsp,
>>>             r10, r11 may be
>>>                      overwritten by the callee.
>>>
>>>                      I also think Windows x64 ABI is different enough
>>>             from Linux that
>>>                      this
>>>                      function could live somewhere in <os>_<cpu>.
>>>
>>>
>>>                  This is a valid point. However, note that this is
>>>             similar to the
>>>                  issue we have for instance in 'shared'
>>>                  MacroAssembler::call_VM_leaf_base, which was not put in
>>>             <os>_<cpu>
>>>                  files.
>>>
>>>                  In fact, the right fix may be to use call_VM_leaf
>>>             instead of call :-)
>>>
>>>                  (alternatively we could use "#ifdef _WIN64" or check
>>>                  "frame::arg_reg_save_area_bytes != 0", which may be
>>>             even cleaner and
>>>                  more portable)
>>>
>>>                  All these solutions are fine for me.
>>>
>>>
>>>                  Since we are near to feature freeze, this should
>>>             probably be done in
>>>                  a separate CR (unless David has enough time to test
>>> it).
>>>
>>>                  Regards,
>>>
>>>                  Bertrand.
>>>
>>>
>>>                      Kind Regards, Thomas
>>>
>>>
>>>
>>>
>>>                      On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
>>>                      <david.holmes at oracle.com
>>>             <mailto:david.holmes at oracle.com>
>>>             <mailto:david.holmes at oracle.com
>>>             <mailto:david.holmes at oracle.com>>
>>>                      <mailto:david.holmes at oracle.com
>>>             <mailto:david.holmes at oracle.com>
>>>                      <mailto:david.holmes at oracle.com
>>>             <mailto:david.holmes at oracle.com>>>> wrote:
>>>
>>>                           On 23/11/2015 11:46 PM, Bertrand Delsart
>>> wrote:
>>>
>>>                               On 23/11/2015 13:27, David Holmes wrote:
>>>
>>>                                   Hi Bertrand,
>>>
>>>                                   On 23/11/2015 8:14 PM, Bertrand
>>>             Delsart wrote:
>>>
>>>                                       Hi David,
>>>
>>>                                       Overall looks good.
>>>                                       Thanks for the #ifndef
>>>                      USE_LIBRARY_BASED_TLS_ONLY :-)
>>>
>>>                                       One doubt, in case this has not
>>>             been discussed
>>>                      before.
>>>
>>>                                       I'm still catching up on x86_64
>>>             (after years of ARM
>>>                                       assembly :-)) but it
>>>                                       seems that there are some stack
>>>             alignment
>>>                      constraints on
>>>                                       runtime calls,
>>>                                       at least for some x86_64 ABIs.
>>>
>>>                                       Some of the x86
>>>             MacroAssembler::get_thread
>>>                                       implementations had code to
>>>                                       align the stack before calling
>>>                      pthread_getspecific. See
>>>                                       for instance
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>>                                   Sorry I'm not that familiar with the
>>>             MacroAssembler
>>>                      - is it
>>>                                   this odd
>>>                                   fragment:
>>>
>>>                                   -   push(r10);
>>>                                   -   // XXX
>>>                                   -   mov(r10, rsp);
>>>                                   -   andq(rsp, -16);
>>>                                   -   push(r10);
>>>
>>>                                   I'm not at all clear what that is
>>>             doing - and if it
>>>                      somehow
>>>                                   changes the
>>>                                   alignment wouldn't something need to
>>>             be fixed up when
>>>                                   popping the
>>>                                   previous values ??
>>>
>>>                                   To be honest I'm not even sure what an
>>>             "unaligned
>>>                      stack" means.
>>>
>>>
>>>                               On some platforms, SP may have to be
>>>             aligned on a 16 bytes
>>>                               boundary when
>>>                               calling another method.
>>>
>>>
>>>                           Okay - that seems to be an x64 ABI requirement
>>>             based on
>>>                      other code
>>>                           in MacroAssembler. The reason it is missing in
>>>             the Solaris
>>>                      code is
>>>                           because I already removed it in the earlier
>>>             changes that
>>>                      were done
>>>                           on Solaris:
>>>
>>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>>
>>>                           and nobody picked it up. :(
>>>
>>>                           That said, the absence of this code has not
>>>             caused any
>>>                      problems so
>>>                           presumably we are normally aligned.
>>>
>>>                               After having pushed what needed to be
>>>             saved, the code above
>>>                               rounds SP
>>>                               down and saves the old value. It will then
>>>             also push
>>>                      r11, which
>>>                               results
>>>                               in the expected alignment.
>>>
>>>                               The conterpart, after the VM call, is the:
>>>                                  pop(r11);
>>>                                  pop(rsp);
>>>
>>>
>>>                           I had assumed this extra manipulation was
>>>             related to
>>>                      pushing the arg
>>>                           for the call.
>>>
>>>                                       This alignment is no longer
>>>             performed in the new
>>>                                       (shared) implementation
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>>
>>>                                       Now, Solaris was not performing
>>>             the alignment and
>>>                                       Windows has a separate
>>>                                       path for x86_64. Did we really
>>>             need the
>>>                      alignment for
>>>                                       linux x86_64 and
>>>                                       bsd_x86_64 ? Might it be needed
>>>             for other ports ?
>>>
>>>                                       IMHO, it might be safer to align
>>>             the stack by
>>>                      default,
>>>                                       knowing it should
>>>                                       not be expensive since we call
>>>             get_thread
>>>                      rarely for
>>>                                       x86_64 (result is
>>>                                       cached in r15). I'll let you see
>>>             whether it is
>>>                      worth
>>>                                       adding an ifdef so
>>>                                       as to explicitly deactivate the
>>>             alignment on
>>>                      some platforms
>>>                                       (solaris_x86_64 ?)
>>>
>>>
>>>                                   I don't have enough knowledge to even
>>>             begin to
>>>                      comment on
>>>                                   this. I will
>>>                                   have to rely on our x86_64
>>>             macroassembler experts
>>>                      to explain
>>>                                   the old
>>>                                   code and what the requirements are.
>>>
>>>
>>>                               OK. This means that the alignment was not
>>>             removed
>>>                      purposefully.
>>>
>>>                               In that case, you must either keep the per
>>>             platform
>>>                      code x86_64
>>>                               (since
>>>                               it was not identical for all platforms) or
>>>             use the safest
>>>                               version, with
>>>                               the additional
>>>
>>>                                    // XXX
>>>                                    mov(r10, rsp);
>>>                                    andq(rsp, -16);
>>>                                    push(r10);
>>>
>>>                               before the push(r11) and with the
>>>
>>>                                    pop(rsp);
>>>
>>>                               after the pop(r11). It should work on all
>>>             x86_64 platforms.
>>>
>>>
>>>                           Right, I will add this back to the code - and
>>>             replace the
>>>                           meaningless // XXX with an actual comment!
>>>
>>>                           Many thanks,
>>>                           David
>>>                           -----
>>>
>>>
>>>                               FYI, there is another way to do the
>>>             alignment, based on
>>>                      the fact
>>>                               that
>>>                               we are at least aligned on 8 bytes (see
>>> MacroAssembler::call_VM_leaf_base). I
>>>             assume that this
>>>                      second
>>>                               version is
>>>                               more efficient (particularly thanks to
>>>             specilative
>>>                      branches) but it
>>>                               might be safer/simpler to continue using
>>>             andq in
>>>                      get_thread.
>>>
>>>                               Bertrand.
>>>
>>>
>>>                                   Thanks,
>>>                                   David
>>>
>>>                                       Regards,
>>>
>>>                                       Bertrand.
>>>
>>>
>>>
>>>                                       On 23/11/2015 08:03, David Holmes
>>>             wrote:
>>>
>>>                                           After all the preliminary
>>>             discussions here
>>>                      are final
>>>                                           proposed changes:
>>>
>>>                                           bug:
>>>             https://bugs.openjdk.java.net/browse/JDK-8132510
>>>
>>>                                           Open webrev:
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>
>>>                                           A simple (in principle) but
>>>             wide-ranging change
>>>                                           which should appeal to
>>>                                           our Code Deletion Engineer's.
>>>             We implement
>>>                                           Thread::current() using a
>>> compiler/language-based
>>>             thread-local
>>>                      variable eg:
>>>
>>>                                              static __thread Thread
>>>             *_thr_current;
>>>
>>>                                              inline Thread*
>>>             Thread::current() {
>>>                                                return _thr_current;
>>>                                              }
>>>
>>>                                           with an appropriate setter of
>>>             course. By
>>>                      doing this
>>>                                           we can completely
>>>                                           remove the os_cpu-specific
>>>             ThreadLocalStorage
>>>                                           implementations, and the
>>>                                           associated
>>>             os::thread_local_storage* calls.
>>>
>>>                                           As __thread is not
>>>             async-signal-safe (either in
>>>                                           theory or practice) we
>>>                                           need a fallback library-based
>>>             solution as used
>>>                                           today. For this we use a
>>>                                           very simple ThreadLocalStorage
>>>             class and an
>>>                                           implementation thereof for
>>>                                           POSIX (covers AIX, BSD, Linux,
>>>             OSX and
>>>                      Solaris) using
>>> pthread_get/setspecific; and
>>>             one for
>>>                      Windows using
>>>                                           its TLS library.
>>>                                           While these library routines
>>>             are not guaranteed
>>>                                           async-signal-safe, they
>>>                                           seem to be in practice and are
>>>             what we have
>>>                      been
>>>                                           using all along.
>>>
>>>                                           We also allow for use of
>>> only the
>>>                      library-based TLS
>>>                                           for platforms where
>>>                                           compiler-based thread locals
>>>             are not
>>>                      supported (as
>>>                                           will be needed in
>>>                                           the
>>>                                           Mobile project). This is
>>>             enabled at build
>>>                      time by
>>>                                           defining
>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>
>>>                                           Thanks to Andrew Haley for
>>>             providing the
>>>                      Aarch64
>>>                                           code; and for Thomas
>>>                                           Stuefe for testing PPC and
>>> AIX.
>>>
>>>                                           Testing:
>>>                                              - JPRT (core platforms)
>>>                                              - Jtreg tests (linux &
>>> solaris)
>>>                                              - vm.runtime (core
>>> platforms)
>>>
>>>                                           Performance:
>>>                                              - still TBD - this is
>>>             proving to be
>>>                      extremely
>>>                                           difficult. If anyone
>>>                                           has
>>>                                           any simple to run
>>>             microbenchmarks to
>>>                      suggest I'd
>>>                                           give them a try as a
>>>                                           sanity check. But I lack
>>>             access to hardware for
>>>                                           running serious
>>>                                           benchmarking.
>>>
>>>                                           Footprint:
>>>                                           - varies by platform and the
>>>             VM (server,
>>>                      client,
>>>                                           minimal)
>>>                                           - worst-case: ~1% increase for
>>>             server VM
>>>                      and minimal VM
>>>                                           - best-case:  0.4% decrease
>>>             for client VM
>>>
>>>                                           Thanks,
>>>                                           David
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>                  --
>>>                  Bertrand Delsart,                     Grenoble
>>>             Engineering Center
>>>                  Oracle,         180 av. de l'Europe, ZIRST de
>>>             Montbonnot
>>>                  38330 Montbonnot Saint Martin, FRANCE
>>>             bertrand.delsart at oracle.com
>>>             <mailto:bertrand.delsart at oracle.com>
>>>             <mailto:bertrand.delsart at oracle.com
>>>             <mailto:bertrand.delsart at oracle.com>>
>>>                            Phone : +33 4 76 18 81 23
>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>                  NOTICE: This email message is for the sole use of the
>>>             intended
>>>                  recipient(s) and may contain confidential and
>>> privileged
>>>                  information. Any unauthorized review, use,
>>> disclosure or
>>>                  distribution is prohibited. If you are not the intended
>>>             recipient,
>>>                  please contact the sender by reply email and destroy
>>>             all copies of
>>>                  the original message.
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>>
>>>
>


More information about the hotspot-dev mailing list