RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Thomas Stüfe thomas.stuefe at gmail.com
Wed Nov 25 16:53:56 UTC 2015


I created a bug to track this:

https://bugs.openjdk.java.net/browse/JDK-8144059

Because Davids change is not submitted yet, the bug refers to the old
coding, but stays relevant also after this fix.

Kind Regards, Thomas

On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty <
daniel.daugherty at oracle.com> wrote:

> On 11/25/15 6:10 AM, David Holmes wrote:
>
>> On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>
>>>
>>> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
>>> <bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>>
>>> wrote:
>>>
>>>     Good finding Thomas,
>>>
>>>     On 25/11/2015 09:40, Thomas Stüfe wrote:
>>>
>>>         Hi David, Bertrand,
>>>
>>>         about
>>>
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>>
>>>         I am not sure this is correct for Windows x64.
>>>
>>>
>>>     Unless I'm misreading it, the new code looks a lot like the old one
>>>     (including the alignement part). Hence this does not look like a
>>>     regression caused by David's change.
>>>
>>>
>>> Yes, this seems like an old bug.
>>>
>>> I also see that this is fixed in our codebase since 2010, exactly the
>>> way you propose (with MacroAssembler
>>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>>>
>>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>>>
>>> So, it is already used since five years. I would propose opening a
>>> separate bug for this one and we contribute our fix.
>>>
>>
>> That sounds good to me too. I also suspect additional code will need to
>> be checked as this stack alignment is performed in a few places in
>> ./cpu/x86/vm/macroAssembler_x86.cpp with no OS specific differences being
>> considered.
>>
>
> Looks like MacroAssembler::call_VM_leaf_base() does the "right thing"
> for stack alignment in the X64 case. Perhaps we need to audit both
> stack alignment _and_ uses of MacroAssembler::call() just to make
> sure...
>
> Did I mention that all the assembly code stuff gives me a headache? :-)
>
> Dan
>
>
>
>
>> Thanks,
>> David
>>
>> Regards, Thomas
>>>
>>>
>>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>>
>>>         windows x64 Abi requires the register parameter area right above
>>> the
>>>         return address on the stack. It is not guaranteed to be
>>>         preserved during
>>>         the function call:
>>>
>>>         "Even if the called function has fewer than 4 parameters, these
>>>         4 stack
>>>         locations are effectively owned by the called function, and may
>>>         be used
>>>         by the called function for other purposes besides saving
>>> parameter
>>>         register values. Thus the caller may not save information in
>>>         this region
>>>         of stack across a function call."
>>>
>>>         So, restoring parameters from it may not work if
>>>         Thread::current() uses
>>>         this area for any reason. In this case, r9, rsp, r10, r11 may be
>>>         overwritten by the callee.
>>>
>>>         I also think Windows x64 ABI is different enough from Linux that
>>>         this
>>>         function could live somewhere in <os>_<cpu>.
>>>
>>>
>>>     This is a valid point. However, note that this is similar to the
>>>     issue we have for instance in 'shared'
>>>     MacroAssembler::call_VM_leaf_base, which was not put in <os>_<cpu>
>>>     files.
>>>
>>>     In fact, the right fix may be to use call_VM_leaf instead of call :-)
>>>
>>>     (alternatively we could use "#ifdef _WIN64" or check
>>>     "frame::arg_reg_save_area_bytes != 0", which may be even cleaner and
>>>     more portable)
>>>
>>>     All these solutions are fine for me.
>>>
>>>
>>>     Since we are near to feature freeze, this should probably be done in
>>>     a separate CR (unless David has enough time to test it).
>>>
>>>     Regards,
>>>
>>>     Bertrand.
>>>
>>>
>>>         Kind Regards, Thomas
>>>
>>>
>>>
>>>
>>>         On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
>>>         <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>>         <mailto:david.holmes at oracle.com
>>>         <mailto:david.holmes at oracle.com>>> wrote:
>>>
>>>              On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>>>
>>>                  On 23/11/2015 13:27, David Holmes wrote:
>>>
>>>                      Hi Bertrand,
>>>
>>>                      On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>>>
>>>                          Hi David,
>>>
>>>                          Overall looks good.
>>>                          Thanks for the #ifndef
>>>         USE_LIBRARY_BASED_TLS_ONLY :-)
>>>
>>>                          One doubt, in case this has not been discussed
>>>         before.
>>>
>>>                          I'm still catching up on x86_64 (after years of
>>> ARM
>>>                          assembly :-)) but it
>>>                          seems that there are some stack alignment
>>>         constraints on
>>>                          runtime calls,
>>>                          at least for some x86_64 ABIs.
>>>
>>>                          Some of the x86 MacroAssembler::get_thread
>>>                          implementations had code to
>>>                          align the stack before calling
>>>         pthread_getspecific. See
>>>                          for instance
>>>
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>
>>>
>>>
>>>                      Sorry I'm not that familiar with the MacroAssembler
>>>         - is it
>>>                      this odd
>>>                      fragment:
>>>
>>>                      -   push(r10);
>>>                      -   // XXX
>>>                      -   mov(r10, rsp);
>>>                      -   andq(rsp, -16);
>>>                      -   push(r10);
>>>
>>>                      I'm not at all clear what that is doing - and if it
>>>         somehow
>>>                      changes the
>>>                      alignment wouldn't something need to be fixed up
>>> when
>>>                      popping the
>>>                      previous values ??
>>>
>>>                      To be honest I'm not even sure what an "unaligned
>>>         stack" means.
>>>
>>>
>>>                  On some platforms, SP may have to be aligned on a 16
>>> bytes
>>>                  boundary when
>>>                  calling another method.
>>>
>>>
>>>              Okay - that seems to be an x64 ABI requirement based on
>>>         other code
>>>              in MacroAssembler. The reason it is missing in the Solaris
>>>         code is
>>>              because I already removed it in the earlier changes that
>>>         were done
>>>              on Solaris:
>>>
>>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>>
>>>              and nobody picked it up. :(
>>>
>>>              That said, the absence of this code has not caused any
>>>         problems so
>>>              presumably we are normally aligned.
>>>
>>>                  After having pushed what needed to be saved, the code
>>> above
>>>                  rounds SP
>>>                  down and saves the old value. It will then also push
>>>         r11, which
>>>                  results
>>>                  in the expected alignment.
>>>
>>>                  The conterpart, after the VM call, is the:
>>>                     pop(r11);
>>>                     pop(rsp);
>>>
>>>
>>>              I had assumed this extra manipulation was related to
>>>         pushing the arg
>>>              for the call.
>>>
>>>                          This alignment is no longer performed in the new
>>>                          (shared) implementation
>>>
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>>                          Now, Solaris was not performing the alignment
>>> and
>>>                          Windows has a separate
>>>                          path for x86_64. Did we really need the
>>>         alignment for
>>>                          linux x86_64 and
>>>                          bsd_x86_64 ? Might it be needed for other ports
>>> ?
>>>
>>>                          IMHO, it might be safer to align the stack by
>>>         default,
>>>                          knowing it should
>>>                          not be expensive since we call get_thread
>>>         rarely for
>>>                          x86_64 (result is
>>>                          cached in r15). I'll let you see whether it is
>>>         worth
>>>                          adding an ifdef so
>>>                          as to explicitly deactivate the alignment on
>>>         some platforms
>>>                          (solaris_x86_64 ?)
>>>
>>>
>>>                      I don't have enough knowledge to even begin to
>>>         comment on
>>>                      this. I will
>>>                      have to rely on our x86_64 macroassembler experts
>>>         to explain
>>>                      the old
>>>                      code and what the requirements are.
>>>
>>>
>>>                  OK. This means that the alignment was not removed
>>>         purposefully.
>>>
>>>                  In that case, you must either keep the per platform
>>>         code x86_64
>>>                  (since
>>>                  it was not identical for all platforms) or use the
>>> safest
>>>                  version, with
>>>                  the additional
>>>
>>>                       // XXX
>>>                       mov(r10, rsp);
>>>                       andq(rsp, -16);
>>>                       push(r10);
>>>
>>>                  before the push(r11) and with the
>>>
>>>                       pop(rsp);
>>>
>>>                  after the pop(r11). It should work on all x86_64
>>> platforms.
>>>
>>>
>>>              Right, I will add this back to the code - and replace the
>>>              meaningless // XXX with an actual comment!
>>>
>>>              Many thanks,
>>>              David
>>>              -----
>>>
>>>
>>>                  FYI, there is another way to do the alignment, based on
>>>         the fact
>>>                  that
>>>                  we are at least aligned on 8 bytes (see
>>>                  MacroAssembler::call_VM_leaf_base). I assume that this
>>>         second
>>>                  version is
>>>                  more efficient (particularly thanks to specilative
>>>         branches) but it
>>>                  might be safer/simpler to continue using andq in
>>>         get_thread.
>>>
>>>                  Bertrand.
>>>
>>>
>>>                      Thanks,
>>>                      David
>>>
>>>                          Regards,
>>>
>>>                          Bertrand.
>>>
>>>
>>>
>>>                          On 23/11/2015 08:03, David Holmes wrote:
>>>
>>>                              After all the preliminary discussions here
>>>         are final
>>>                              proposed changes:
>>>
>>>                              bug:
>>>         https://bugs.openjdk.java.net/browse/JDK-8132510
>>>
>>>                              Open webrev:
>>>         http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>
>>>                              A simple (in principle) but wide-ranging
>>> change
>>>                              which should appeal to
>>>                              our Code Deletion Engineer's. We implement
>>>                              Thread::current() using a
>>>                              compiler/language-based thread-local
>>>         variable eg:
>>>
>>>                                 static __thread Thread *_thr_current;
>>>
>>>                                 inline Thread* Thread::current() {
>>>                                   return _thr_current;
>>>                                 }
>>>
>>>                              with an appropriate setter of course. By
>>>         doing this
>>>                              we can completely
>>>                              remove the os_cpu-specific
>>> ThreadLocalStorage
>>>                              implementations, and the
>>>                              associated os::thread_local_storage* calls.
>>>
>>>                              As __thread is not async-signal-safe
>>> (either in
>>>                              theory or practice) we
>>>                              need a fallback library-based solution as
>>> used
>>>                              today. For this we use a
>>>                              very simple ThreadLocalStorage class and an
>>>                              implementation thereof for
>>>                              POSIX (covers AIX, BSD, Linux, OSX and
>>>         Solaris) using
>>>                              pthread_get/setspecific; and one for
>>>         Windows using
>>>                              its TLS library.
>>>                              While these library routines are not
>>> guaranteed
>>>                              async-signal-safe, they
>>>                              seem to be in practice and are what we have
>>>         been
>>>                              using all along.
>>>
>>>                              We also allow for use of only the
>>>         library-based TLS
>>>                              for platforms where
>>>                              compiler-based thread locals are not
>>>         supported (as
>>>                              will be needed in
>>>                              the
>>>                              Mobile project). This is enabled at build
>>>         time by
>>>                              defining
>>>                              USE_LIBRARY_BASED_TLS_ONLY.
>>>
>>>                              Thanks to Andrew Haley for providing the
>>>         Aarch64
>>>                              code; and for Thomas
>>>                              Stuefe for testing PPC and AIX.
>>>
>>>                              Testing:
>>>                                 - JPRT (core platforms)
>>>                                 - Jtreg tests (linux & solaris)
>>>                                 - vm.runtime (core platforms)
>>>
>>>                              Performance:
>>>                                 - still TBD - this is proving to be
>>>         extremely
>>>                              difficult. If anyone
>>>                              has
>>>                              any simple to run microbenchmarks to
>>>         suggest I'd
>>>                              give them a try as a
>>>                              sanity check. But I lack access to hardware
>>> for
>>>                              running serious
>>>                              benchmarking.
>>>
>>>                              Footprint:
>>>                              - varies by platform and the VM (server,
>>>         client,
>>>                              minimal)
>>>                              - worst-case: ~1% increase for server VM
>>>         and minimal VM
>>>                              - best-case:  0.4% decrease for client VM
>>>
>>>                              Thanks,
>>>                              David
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>     --
>>>     Bertrand Delsart,                     Grenoble Engineering Center
>>>     Oracle,         180 av. de l'Europe,          ZIRST de Montbonnot
>>>     38330 Montbonnot Saint Martin, FRANCE
>>>     bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>
>>>               Phone : +33 4 76 18 81 23
>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>     NOTICE: This email message is for the sole use of the intended
>>>     recipient(s) and may contain confidential and privileged
>>>     information. Any unauthorized review, use, disclosure or
>>>     distribution is prohibited. If you are not the intended recipient,
>>>     please contact the sender by reply email and destroy all copies of
>>>     the original message.
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>>
>


More information about the hotspot-dev mailing list