RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Wed Nov 25 14:52:41 UTC 2015

On 11/25/15 6:10 AM, David Holmes wrote:
> On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>
>> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
>> <bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>> 
>> wrote:
>>
>>     Good finding Thomas,
>>
>>     On 25/11/2015 09:40, Thomas Stüfe wrote:
>>
>>         Hi David, Bertrand,
>>
>>         about
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>
>>         I am not sure this is correct for Windows x64.
>>
>>
>>     Unless I'm misreading it, the new code looks a lot like the old one
>>     (including the alignement part). Hence this does not look like a
>>     regression caused by David's change.
>>
>>
>> Yes, this seems like an old bug.
>>
>> I also see that this is fixed in our codebase since 2010, exactly the
>> way you propose (with MacroAssembler
>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base 
>>
>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>). 
>>
>> So, it is already used since five years. I would propose opening a
>> separate bug for this one and we contribute our fix.
>
> That sounds good to me too. I also suspect additional code will need 
> to be checked as this stack alignment is performed in a few places in 
> ./cpu/x86/vm/macroAssembler_x86.cpp with no OS specific differences 
> being considered.

Looks like MacroAssembler::call_VM_leaf_base() does the "right thing"
for stack alignment in the X64 case. Perhaps we need to audit both
stack alignment _and_ uses of MacroAssembler::call() just to make
sure...

Did I mention that all the assembly code stuff gives me a headache? :-)

Dan

>
> Thanks,
> David
>
>> Regards, Thomas
>>
>>
>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>
>>         windows x64 Abi requires the register parameter area right 
>> above the
>>         return address on the stack. It is not guaranteed to be
>>         preserved during
>>         the function call:
>>
>>         "Even if the called function has fewer than 4 parameters, these
>>         4 stack
>>         locations are effectively owned by the called function, and may
>>         be used
>>         by the called function for other purposes besides saving 
>> parameter
>>         register values. Thus the caller may not save information in
>>         this region
>>         of stack across a function call."
>>
>>         So, restoring parameters from it may not work if
>>         Thread::current() uses
>>         this area for any reason. In this case, r9, rsp, r10, r11 may be
>>         overwritten by the callee.
>>
>>         I also think Windows x64 ABI is different enough from Linux that
>>         this
>>         function could live somewhere in <os>_<cpu>.
>>
>>
>>     This is a valid point. However, note that this is similar to the
>>     issue we have for instance in 'shared'
>>     MacroAssembler::call_VM_leaf_base, which was not put in <os>_<cpu>
>>     files.
>>
>>     In fact, the right fix may be to use call_VM_leaf instead of call 
>> :-)
>>
>>     (alternatively we could use "#ifdef _WIN64" or check
>>     "frame::arg_reg_save_area_bytes != 0", which may be even cleaner and
>>     more portable)
>>
>>     All these solutions are fine for me.
>>
>>
>>     Since we are near to feature freeze, this should probably be done in
>>     a separate CR (unless David has enough time to test it).
>>
>>     Regards,
>>
>>     Bertrand.
>>
>>
>>         Kind Regards, Thomas
>>
>>
>>
>>
>>         On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
>>         <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>         <mailto:david.holmes at oracle.com
>>         <mailto:david.holmes at oracle.com>>> wrote:
>>
>>              On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>>
>>                  On 23/11/2015 13:27, David Holmes wrote:
>>
>>                      Hi Bertrand,
>>
>>                      On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>>
>>                          Hi David,
>>
>>                          Overall looks good.
>>                          Thanks for the #ifndef
>>         USE_LIBRARY_BASED_TLS_ONLY :-)
>>
>>                          One doubt, in case this has not been discussed
>>         before.
>>
>>                          I'm still catching up on x86_64 (after years 
>> of ARM
>>                          assembly :-)) but it
>>                          seems that there are some stack alignment
>>         constraints on
>>                          runtime calls,
>>                          at least for some x86_64 ABIs.
>>
>>                          Some of the x86 MacroAssembler::get_thread
>>                          implementations had code to
>>                          align the stack before calling
>>         pthread_getspecific. See
>>                          for instance
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>
>>
>>
>>                      Sorry I'm not that familiar with the MacroAssembler
>>         - is it
>>                      this odd
>>                      fragment:
>>
>>                      -   push(r10);
>>                      -   // XXX
>>                      -   mov(r10, rsp);
>>                      -   andq(rsp, -16);
>>                      -   push(r10);
>>
>>                      I'm not at all clear what that is doing - and if it
>>         somehow
>>                      changes the
>>                      alignment wouldn't something need to be fixed up 
>> when
>>                      popping the
>>                      previous values ??
>>
>>                      To be honest I'm not even sure what an "unaligned
>>         stack" means.
>>
>>
>>                  On some platforms, SP may have to be aligned on a 16 
>> bytes
>>                  boundary when
>>                  calling another method.
>>
>>
>>              Okay - that seems to be an x64 ABI requirement based on
>>         other code
>>              in MacroAssembler. The reason it is missing in the Solaris
>>         code is
>>              because I already removed it in the earlier changes that
>>         were done
>>              on Solaris:
>>
>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>
>>              and nobody picked it up. :(
>>
>>              That said, the absence of this code has not caused any
>>         problems so
>>              presumably we are normally aligned.
>>
>>                  After having pushed what needed to be saved, the 
>> code above
>>                  rounds SP
>>                  down and saves the old value. It will then also push
>>         r11, which
>>                  results
>>                  in the expected alignment.
>>
>>                  The conterpart, after the VM call, is the:
>>                     pop(r11);
>>                     pop(rsp);
>>
>>
>>              I had assumed this extra manipulation was related to
>>         pushing the arg
>>              for the call.
>>
>>                          This alignment is no longer performed in the 
>> new
>>                          (shared) implementation
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>
>>
>>
>>
>>                          Now, Solaris was not performing the 
>> alignment and
>>                          Windows has a separate
>>                          path for x86_64. Did we really need the
>>         alignment for
>>                          linux x86_64 and
>>                          bsd_x86_64 ? Might it be needed for other 
>> ports ?
>>
>>                          IMHO, it might be safer to align the stack by
>>         default,
>>                          knowing it should
>>                          not be expensive since we call get_thread
>>         rarely for
>>                          x86_64 (result is
>>                          cached in r15). I'll let you see whether it is
>>         worth
>>                          adding an ifdef so
>>                          as to explicitly deactivate the alignment on
>>         some platforms
>>                          (solaris_x86_64 ?)
>>
>>
>>                      I don't have enough knowledge to even begin to
>>         comment on
>>                      this. I will
>>                      have to rely on our x86_64 macroassembler experts
>>         to explain
>>                      the old
>>                      code and what the requirements are.
>>
>>
>>                  OK. This means that the alignment was not removed
>>         purposefully.
>>
>>                  In that case, you must either keep the per platform
>>         code x86_64
>>                  (since
>>                  it was not identical for all platforms) or use the 
>> safest
>>                  version, with
>>                  the additional
>>
>>                       // XXX
>>                       mov(r10, rsp);
>>                       andq(rsp, -16);
>>                       push(r10);
>>
>>                  before the push(r11) and with the
>>
>>                       pop(rsp);
>>
>>                  after the pop(r11). It should work on all x86_64 
>> platforms.
>>
>>
>>              Right, I will add this back to the code - and replace the
>>              meaningless // XXX with an actual comment!
>>
>>              Many thanks,
>>              David
>>              -----
>>
>>
>>                  FYI, there is another way to do the alignment, based on
>>         the fact
>>                  that
>>                  we are at least aligned on 8 bytes (see
>>                  MacroAssembler::call_VM_leaf_base). I assume that this
>>         second
>>                  version is
>>                  more efficient (particularly thanks to specilative
>>         branches) but it
>>                  might be safer/simpler to continue using andq in
>>         get_thread.
>>
>>                  Bertrand.
>>
>>
>>                      Thanks,
>>                      David
>>
>>                          Regards,
>>
>>                          Bertrand.
>>
>>
>>
>>                          On 23/11/2015 08:03, David Holmes wrote:
>>
>>                              After all the preliminary discussions here
>>         are final
>>                              proposed changes:
>>
>>                              bug:
>>         https://bugs.openjdk.java.net/browse/JDK-8132510
>>
>>                              Open webrev:
>>         http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>
>>                              A simple (in principle) but wide-ranging 
>> change
>>                              which should appeal to
>>                              our Code Deletion Engineer's. We implement
>>                              Thread::current() using a
>>                              compiler/language-based thread-local
>>         variable eg:
>>
>>                                 static __thread Thread *_thr_current;
>>
>>                                 inline Thread* Thread::current() {
>>                                   return _thr_current;
>>                                 }
>>
>>                              with an appropriate setter of course. By
>>         doing this
>>                              we can completely
>>                              remove the os_cpu-specific 
>> ThreadLocalStorage
>>                              implementations, and the
>>                              associated os::thread_local_storage* calls.
>>
>>                              As __thread is not async-signal-safe 
>> (either in
>>                              theory or practice) we
>>                              need a fallback library-based solution 
>> as used
>>                              today. For this we use a
>>                              very simple ThreadLocalStorage class and an
>>                              implementation thereof for
>>                              POSIX (covers AIX, BSD, Linux, OSX and
>>         Solaris) using
>>                              pthread_get/setspecific; and one for
>>         Windows using
>>                              its TLS library.
>>                              While these library routines are not 
>> guaranteed
>>                              async-signal-safe, they
>>                              seem to be in practice and are what we have
>>         been
>>                              using all along.
>>
>>                              We also allow for use of only the
>>         library-based TLS
>>                              for platforms where
>>                              compiler-based thread locals are not
>>         supported (as
>>                              will be needed in
>>                              the
>>                              Mobile project). This is enabled at build
>>         time by
>>                              defining
>>                              USE_LIBRARY_BASED_TLS_ONLY.
>>
>>                              Thanks to Andrew Haley for providing the
>>         Aarch64
>>                              code; and for Thomas
>>                              Stuefe for testing PPC and AIX.
>>
>>                              Testing:
>>                                 - JPRT (core platforms)
>>                                 - Jtreg tests (linux & solaris)
>>                                 - vm.runtime (core platforms)
>>
>>                              Performance:
>>                                 - still TBD - this is proving to be
>>         extremely
>>                              difficult. If anyone
>>                              has
>>                              any simple to run microbenchmarks to
>>         suggest I'd
>>                              give them a try as a
>>                              sanity check. But I lack access to 
>> hardware for
>>                              running serious
>>                              benchmarking.
>>
>>                              Footprint:
>>                              - varies by platform and the VM (server,
>>         client,
>>                              minimal)
>>                              - worst-case: ~1% increase for server VM
>>         and minimal VM
>>                              - best-case:  0.4% decrease for client VM
>>
>>                              Thanks,
>>                              David
>>
>>
>>
>>
>>
>>
>>
>>
>>     --
>>     Bertrand Delsart,                     Grenoble Engineering Center
>>     Oracle,         180 av. de l'Europe,          ZIRST de Montbonnot
>>     38330 Montbonnot Saint Martin, FRANCE
>>     bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>
>>               Phone : +33 4 76 18 81 23
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     NOTICE: This email message is for the sole use of the intended
>>     recipient(s) and may contain confidential and privileged
>>     information. Any unauthorized review, use, disclosure or
>>     distribution is prohibited. If you are not the intended recipient,
>>     please contact the sender by reply email and destroy all copies of
>>     the original message.
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>