(L) Prelim RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Tue Nov 10 11:26:10 UTC 2015

Sorry the formatting of the replies is getting totally screwed up now :(

On 10/11/2015 8:20 PM, Thomas Stüfe wrote:
> Hi David,
>
> On Fri, Nov 6, 2015 at 11:20 PM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
>     On 6/11/2015 9:52 PM, Thomas Stüfe wrote:
>
>         Hi David,
>
>         On Fri, Nov 6, 2015 at 7:26 AM, David Holmes
>         <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>         <mailto:david.holmes at oracle.com
>         <mailto:david.holmes at oracle.com>>> wrote:
>
>              Hi Jeremy,
>
>              Okay I have read:
>
>         https://sourceware.org/glibc/wiki/TLSandSignals
>
>              and the tree of mail postings referenced therefrom - great
>         reading! :)
>
>              So basic problem: access to __thread variables is not
>         async-signal-safe
>
>              Exacerbator to problem: first attempt to even read a __thread
>              variable can lead to allocation which is the real problem in
>              relation to async-signal-safety
>
>              I mention the exacerbator because pthread_getspecific and
>              pthread_setSpecific are also not async-signal-safe but we
>         already
>              use them. However, pthread_getspecific is in fact (per
>         email threads
>              linked above) effectively async-signal-safe, and further a
>         call to
>              pthread_getspecific never results in a call to
>         pthread_setspecific
>              or an allocation. Hence the pthread functions are almost,
>         if not
>              completely, safe in practice with reasonable uses (ie only
>         read from
>              signal handler). Which explain this code in existing
>         Thread::current()
>
>              #ifdef PARANOID
>                 // Signal handler should call
>         ThreadLocalStorage::get_thread_slow()
>                 Thread* t = ThreadLocalStorage::get_thread_slow();
>                 assert(t != NULL && !t->is_inside_signal_handler(),
>                        "Don't use Thread::current() inside signal handler");
>              #endif
>
>              So problem scenario is: use of __thread variable (that
>         belongs to
>              the shared-library) in a signal handler.
>
>              Solution 0: don't do that. Seriously - like any other
>              async-signal-unsafe stuff we should not be using it in real
>         signal
>              handlers. The crash handler is a different matter - we try
>         all sorts
>              there because it might work and you can't die twice.
>
>              Otherwise: narrow the window of exposure.
>
>              1. We ensure we initialize thread_current (even with a
>         dummy value)
>              as early as possible in the thread that loads libjvm. As we
>         have no
>              signal handlers installed at that point that might use the same
>              variable, we can not hit the problem scenario.
>
>              2. We ensure we initialize thread_current in a new thread
>         with all
>              signals blocked. This again avoids the problem scenario.
>
>              3. We initialize thread_current in an attaching thread as
>         soon as
>              possible and we again first block all signals.
>
>              That still leaves the problem of an unattached native
>         thread taking
>              a signal whilst in async-signal-unsafe code, and executing
>         a signal
>              handler which in turns tries to access thread_current for
>         the first
>              time. This signal handler need not be an actual JVM
>         handler, but one
>              attached by other native code eg an agent. I'm not clear in the
>              latter case how reasonable it is for an agent's handler to
>         try and
>              do things from an unattached thread - and we don't claim
>         any JNI
>              interfaces can, or should, be called from a signal handler
>         - but it
>              is something you can probably get away with today.
>
>              Let me also point out that we already effectively have this
>         code in
>              Solaris already (at the ThreadLocalStorage class level). So
>         if there
>              is something here that will prevent the current proposal we
>         already
>              have a problem on Solaris. :(
>
>              Thoughts/comments/suggestions?
>
>
>         The first problem: thread initializes TLS variable, gets
>         interrupted and
>         accesses the half-initialized variable from within the signal
>         handler.
>         This could happen today too, or? but I think we never saw this.
>
>
>     That depends on the state of signal masks at the time of the
>     initialization. For threads created in the VM and for threads
>     attached to the VM it is likely not an issue. Unattached threads
>     could in theory try to access a TLS variable from a signal handler,
>     but they will never be initializing that variable. Of course the
>     unattached thread could be initializing a completely distinct TLS
>     variable, but reading a different TLS variable from the signal
>     handler does not seem to be an issue (in theory it may be but this
>     is an extreme corner case).
>
>         In theory, it could be mitigated by some careful testing before
>         using
>         the Thread::current() value in the signal handler. Like, put an
>         eyecatcher at the beginning of the Thread structure and check
>         that using
>         SafeFetch.
>
>
>     There is no way to access the Thread structure before calling
>     Thread::current(). And the potential problem is with unattached
>     threads which have no Thread structure. For threads attached to the
>     VM, or attaching, my three steps will deal with any potential problems.
>
>         As for the second problem - recursive malloc() deadlocks - I am at a
>         loss. I do not fully understand though why pthread_getspecific is
>         different - does it not have to allocate place for the TLS
>         variable too?
>
>
>     No, pthread_getspecific does not have to allocate. Presumably it is
>     written in a way that attempting to index a TLS variable that has
>     not been allocated just returns an error (EINVAL?).
>
>
> It would return NULL.
>
>     The problem with __thread is that even a read will attempt to do the
>     allocation - arguably (as the Google folk did argue) this is wrong,
>     or else should be done in an async-safe way.
>
>
> I looked up the implementation of pthread_getspecific and
> pthread_setspecific in the glibc and now understand better why pthread
> tls is considered safe here.
>
> glibc allows for 1024 tls slots which are organized as a 32x32 sparse
> array, whose second level arrays are only allocated if the first slot in
> it is used by pthread_setspecific. So, only when writing the slot. It
> also means that the number of allocation calls is smaller than with
> __thread - instead of (presumably) calling malloc() for every instance
> of __thread, it only calls at the maximum 32 times. And the first 32
> slots are already allocated in the pthread structure, so they are free.
> This means that even if one were to write to a TLS slot in the signal
> handler, chances of it mallocing are quite small.

__thread will only need to malloc (in the context we are discussing) 
when a pre-existing thread first references a TLS variable from a lazily 
loaded DSO. Otherwise the space for the DSO's TLS variables are 
allocated "statically" when the thread control structures are created.

>
>     This does leave me wondering exactly what affect the:
>
>     static __thread Thread* _thr_current = NULL;
>
>     has in terms of any per-thread allocation. ??
>
>     Anyway to reiterate the problem scenario:
>     - VM has been loaded in a process and signal handlers have been
>     installed (maybe VM, maybe agent)
>     - unattached thread is doing a malloc when it takes a signal
>     - signal handler tries to read __thread variable and we get a malloc
>     deadlock
>
>     As I said I need to determine what signal handlers in the VM might
>     ever run on an unattached thread, and what they might do.
>
>
> I don't understand - our signal handler is globally active, no? So any
> unattached thread may execute our signal handler at any time, and the
> first thing our signal handler does is Thread::current(). If there was a
> third party signal handler, it is getting called as chained handler, but
> only after our signal handler ran.

The current code uses ThreadLocalStorage::get_thread_slow() in the 
signal handler, which uses pthread_getspecific, which is "safe". The new 
code would access the __thread variable and have to malloc - which is 
unsafe.

Using the JDK launchers there are no unattached threads created before 
libjvm is loaded - so the problem would never arise.

I have been looking hard at our signal handlers though and found they 
don't seem to match how they are described in various parts of the code 
that set the signal masks. My main concern is with process-directed 
signals (SIGQUIT, SIGTERM) that trigger specific actions (thread dump, 
orderly shutdown). Synchronous signals are not an issue - if the 
unattached thread triggers a segv while in malloc then the process is 
doomed anyway and a deadlock is less problematic (not great but hardly 
in the same league as deadlocking an active fully working VM). But I'm 
still having trouble joining all the dots in this code and figuring out 
how an unattached thread might react today. I'll continue untangling 
this tomorrow.

> Thanks, Thomas
>
> (My current feeling is that I'd prefer to keep the pthread TLS solution
> but I like your simplifications to the code and would like to keep that
> too...)

It was all the complex, inconsistent caching mechanisms employed over 
the top of the library based TLS that motivated the cleanup - especially 
as the cache on Solaris was shown to be broken. If it was just a simple 
library based TLS layer, there would be less motivation to try the 
__thread approach - but __thread had the appeal of removing a lot of 
duplicated code. A simple library based scheme might be an alternative 
if it is performant enough - but not sure I have the time to go back to 
square one on this.

Thanks,
David

>     For a "third-party" signal handler there's really nothing I can do -
>     they should not be accessing the VM's __thread variables though (and
>     they cal always introduce their own independent deadlocks by
>     performing non-async-safe actions).
>
>     Thanks,
>     David
>
>         Regards, Thomas
>
>
>              Thanks,
>              David
>
>
>              On 6/11/2015 1:09 PM, David Holmes wrote:
>
>                  Hi Jeremy,
>
>                  I was going to ask you to elaborate :)
>
>                  On 6/11/2015 12:24 PM, Jeremy Manson wrote:
>
>                      I should probably elaborate on this.  With glibc +
>         ELF, the
>                      first time a
>                      thread accesses a variable declared __thread, if that
>                      variable is in a
>                      shared library (as opposed to the main executable), the
>                      system calls
>                      malloc() to allocate the space for it.  If that
>         happens in a
>                      signal that
>                      is being delivered during a call to malloc(), then you
>                      usually get a
>                      crash.
>
>
>                  My understanding of the ELF ABI for thread-locals -
>         which I read
>                  about
>                  in the Solaris 11.1 Linkers and libraries guide - does
>         require
>                  use of
>                  the dynamic TLS model for any dynamically loaded shared
>         object which
>                  defines a thread-local, but that is what we use as I
>         understand
>                  it. The
>                  docs state:
>
>                  "A shared object containing only dynamic TLS can be
>         loaded following
>                  process startup without limitations. The runtime linker
>         extends
>                  the list
>                  of initialization records to include the initialization
>         template
>                  of the
>                  new object. The new object is given an index of m = M +
>         1. The
>                  counter M is incremented by 1. However, the allocation
>         of new
>                  TLS blocks
>                  is deferred until the blocks are actually referenced."
>
>                  Now I guess "extends the list" might be implemented
>         using malloc
>                  ... but
>                  this will only occur in the main thread (the one
>         started by the
>                  launcher
>                  to load the JVM and become the main thread), at the
>         time libjvm is
>                  loaded - which will all be over before any agent etc
>         can run and do
>                  anything. But "allocation ... is deferred" suggests we
>         may have a
>                  problem until either the first call to Thread::current
>         or the
>                  call to
>                  Thread::initialize_thread_current. If it is the former
>         then that
>                  should
>                  occur well before any agent etc can be loaded. And I
>         can easily
>                  inject
>                  an initial dummy call to initialize_thread_current(null) to
>                  force the
>                  TLS allocation.
>
>                      This may bite you if AsyncGetCallTrace uses
>                      Thread::current(), and you
>                      use system timers to do profiling.  If a thread doing a
>                      malloc() prior
>                      to the first time it accesses Thread::current(),
>         and it gets
>                      delivered a
>                      signal, it might die.  This is especially likely
>         for pure
>                      native threads
>                      started by native code.
>
>                      I believe that this is a use case you support, so
>         you might
>                      want to make
>                      sure it is okay.
>
>
>                  For a VM embedded in a process, which already contains
>         native
>                  threads,
>                  that will later attach to the VM, this may indeed be a
>         problem. One
>                  would have hoped however that the implementation of TLS
>         would be
>                  completely robust, at least for something as simple as
>         getting a
>                  signal
>                  whilst in the allocator.
>
>                  I'm unclear how to test for or check for this kind of
>         problem.
>                  Arguably
>                  there could be many things that are async-unsafe in
>         this way.
>
>                  Need to think more about this and do some more
>         research. Would also
>                  appreciate any insight from any glibc and/or ELF gurus.
>
>                  Thanks.
>                  David
>
>                      Jeremy
>
>                      On Thu, Nov 5, 2015 at 5:58 PM, Jeremy Manson
>                      <jeremymanson at google.com
>         <mailto:jeremymanson at google.com> <mailto:jeremymanson at google.com
>         <mailto:jeremymanson at google.com>>
>                      <mailto:jeremymanson at google.com
>         <mailto:jeremymanson at google.com>
>                      <mailto:jeremymanson at google.com
>         <mailto:jeremymanson at google.com>>>> wrote:
>
>                           Something that's bitten me with __thread: it isn't
>                      async-safe when
>                           called from a shared object on Linux.  Have
>         you vetted
>                      to make sure
>                           this doesn't make HS less async-safe?
>
>                           Jeremy
>
>                           On Sun, Nov 1, 2015 at 10:40 PM, David Holmes
>                           <david.holmes at oracle.com
>         <mailto:david.holmes at oracle.com>
>                      <mailto:david.holmes at oracle.com
>         <mailto:david.holmes at oracle.com>>
>                      <mailto:david.holmes at oracle.com
>         <mailto:david.holmes at oracle.com>
>
>                      <mailto:david.holmes at oracle.com
>         <mailto:david.holmes at oracle.com>>>> wrote:
>
>                               bug:
>         https://bugs.openjdk.java.net/browse/JDK-8132510
>
>                               Open webrev:
>         http://cr.openjdk.java.net/~dholmes/8132510/webrev.v2/
>
>                               A simple (in principle) but wide-ranging
>         change
>                      which should
>                               appeal to our Code Deletion Engineer's. We
>         implement
>                               Thread::current() using a
>         compiler/language-based
>                      thread-local
>                               variable eg:
>
>
>                                 static __thread Thread *_thr_current;
>
>                                 inline Thread* Thread::current() {
>                                   return _thr_current;
>                                 }
>
>                               with an appropriate setter of course. By
>         doing this
>                      we can
>                               completely remove the platform-specific
>                      ThreadLocalStorage
>                               implementations, and the associated
>                      os::thread_local_storage*
>                               calls, plus all the uses of
>                      ThreadLocalStorage::thread() and
>                               ThreadLocalStorage::get_thread_slow().
>         This extends
>                      the previous
>                               work done on Solaris to implement
>                      ThreadLocalStorage::thread()
>                               using compiler-based thread-locals.
>
>                               We can also consolidate nearly all the os_cpu
>                      versions of
>                               MacroAssembler::get_thread on x86 into one cpu
>                      specific one ( a
>                               special variant is still needed for 32-bit
>         Windows).
>
>                               As a result of this change we have further
>                      potential cleanups:
>                               - all the
>         src/os/<os>/vm/thread_<os>.inline.hpp
>                      files are now
>                               completely empty and could also be removed
>                               - the MINIMIZE_RAM_USAGE define (which
>         avoids use
>                      of the linux
>                               sp-map "cache" on 32-bit) now has no
>         affect and so
>                      could be
>                               completely removed from the build system
>
>                               I plan to do the MINIMIZE_RAM_USAGE
>         removal as a
>                      follow up CR,
>                               but could add the removal of the "inline"
>         files to
>                      this CR if
>                               people think it worth removing them.
>
>                               I have one missing piece on Aarch64 - I
>         need to change
>                               MacroAssembler::get_thread to simply call
>                      Thread::current() as
>                               on other platforms, but I don't know how
>         to write
>                      that. I would
>                               appreciate it if someone could give me the
>         right
>                      code for that.
>
>                               I would also appreciate comments/testing
>         by the AIX
>                      and PPC64
>                               folk as well.
>
>                               A concern about memory-leaks had
>         previously been
>                      raised, but
>                               experiments using simple C code on linux
>         86 and
>                      Solaris showed
>                               no issues. Also note that Aarch64 already
>         uses this
>                      kind of
>                               thread-local.
>
>                               Thanks,
>                               David
>
>
>
>
>