RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Wed Nov 25 05:07:22 UTC 2015

Hi Dan,

Many thanks for the meticulous review!

On 25/11/2015 11:34 AM, Daniel D. Daugherty wrote:
> On 11/23/15 7:19 PM, David Holmes wrote:
>> Sorry about that:
>>
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v7/
>
> Please double check the copyright updates before you push.

Did you notice something missing? I use:

hg status -man | xargs grep Copyright | grep -v 2015

to find any copyright lines without 2015 present (it turns up the 
non-Oracle copyrights in a few files).

Trimming to files with comments ...

> src/os/windows/vm/os_windows.cpp
>      L5998:       os::os_exception_wrapper(
> (java_call_t)call_wrapper_dummy,
>          Wrong indent; should only be two spaces.
>          Please delete space after first '('.
>
>      L5999:                                 NULL, NULL, NULL, NULL);
>          Indent will have to change to match any fixes on previous line.

Fixed.

> src/os/windows/vm/os_windows.hpp
>      L127:   static inline void set_thread_ptr_offset( int offset ) {
>          Please delete space after '(' and before ')'.

Fixed (was a copy of code from now deleted file)

> src/share/vm/gc/parallel/gcTaskThread.cpp
>      No comments.
>
>      Note: blank first line in this file (not your change)

Fixed.

> src/share/vm/prims/jni.cpp
>      L4312:   // If the thread has already been detached the operations
> is a no-op
>          Typo: 'the operations is a' -> 'the operation is a'

Fixed

> src/share/vm/prims/jvmtiExport.cpp
>      L377:       JavaThread* current_thread = JavaThread::current();
>          Wrong indent; two less spaces

Fixed.

> src/share/vm/runtime/threadLocalStorage.hpp
>      L37: // Platforms without compiler-based TLS (ie __thread
> storage-class modifier)
>          Typo: 'ie' -> 'i.e.'

Fixed.

> src/share/vm/runtime/vm_operations.cpp
>
>      (old) L396:   Thread * thr_cur =
> ThreadLocalStorage::get_thread_slow();
>      (new) L396:   Thread * thr_cur = Thread::current();
>
>          I think this should be 'current_or_null()' like L374.

Actually L374 should just be Thread::current() as NULL is neither 
possible nor expected. These VM_Exit routines can only be called by an 
attached thread. E.g. set_vm_exited is only called from 
Threads::destroy_vm(), while wait_for_threads_in_native_to_block is only 
called on the VMThread.

> src/os/posix/vm/threadLocalStorage_posix.cpp
>      L38:     ThreadLocalStorage::set_thread((Thread*) p);
>          Wrong indent; should be two spaces.

Fixed.

> src/os/windows/vm/threadLocalStorage_windows.cpp
>      L54:   assert(current != 0 || GetLastError() == ERROR_SUCCESS, '
>          'current != 0' -> 'current != NULL'
>          jcheck will complain about the blank at the end of this line
>
>      L62:   assert(res, "TlsSetValue failed with error code: %lu",
> GetLastError());
>          jcheck will complain about the blank at the end of this line

Fixed.

Generated new webrev:

http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/

which has been rebased to current hs-rt content.

Thanks,
David

>
> Dan
>
>
>>
>> David
>>
>> On 24/11/2015 11:17 AM, David Holmes wrote:
>>> Updated webrev:
>>>
>>> cr.openjdk.java.net/~dholmes/8132510/webrev.v7/
>>>
>>> Only change is in macroAssembler_x86.cpp
>>>
>>> Thanks,
>>> David
>>>
>>> On 24/11/2015 7:42 AM, David Holmes wrote:
>>>> On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>>>>> On 23/11/2015 13:27, David Holmes wrote:
>>>>>> Hi Bertrand,
>>>>>>
>>>>>> On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>>>>>>> Hi David,
>>>>>>>
>>>>>>> Overall looks good.
>>>>>>> Thanks for the #ifndef USE_LIBRARY_BASED_TLS_ONLY :-)
>>>>>>>
>>>>>>> One doubt, in case this has not been discussed before.
>>>>>>>
>>>>>>> I'm still catching up on x86_64 (after years of ARM assembly :-))
>>>>>>> but it
>>>>>>> seems that there are some stack alignment constraints on runtime
>>>>>>> calls,
>>>>>>> at least for some x86_64 ABIs.
>>>>>>>
>>>>>>> Some of the x86 MacroAssembler::get_thread implementations had
>>>>>>> code to
>>>>>>> align the stack before calling pthread_getspecific. See for instance
>>>>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Sorry I'm not that familiar with the MacroAssembler - is it this odd
>>>>>> fragment:
>>>>>>
>>>>>> -   push(r10);
>>>>>> -   // XXX
>>>>>> -   mov(r10, rsp);
>>>>>> -   andq(rsp, -16);
>>>>>> -   push(r10);
>>>>>>
>>>>>> I'm not at all clear what that is doing - and if it somehow
>>>>>> changes the
>>>>>> alignment wouldn't something need to be fixed up when popping the
>>>>>> previous values ??
>>>>>>
>>>>>> To be honest I'm not even sure what an "unaligned stack" means.
>>>>>
>>>>> On some platforms, SP may have to be aligned on a 16 bytes boundary
>>>>> when
>>>>> calling another method.
>>>>
>>>> Okay - that seems to be an x64 ABI requirement based on other code in
>>>> MacroAssembler. The reason it is missing in the Solaris code is because
>>>> I already removed it in the earlier changes that were done on Solaris:
>>>>
>>>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>>>
>>>> and nobody picked it up. :(
>>>>
>>>> That said, the absence of this code has not caused any problems so
>>>> presumably we are normally aligned.
>>>>
>>>>> After having pushed what needed to be saved, the code above rounds SP
>>>>> down and saves the old value. It will then also push r11, which
>>>>> results
>>>>> in the expected alignment.
>>>>>
>>>>> The conterpart, after the VM call, is the:
>>>>>   pop(r11);
>>>>>   pop(rsp);
>>>>
>>>> I had assumed this extra manipulation was related to pushing the arg
>>>> for
>>>> the call.
>>>>
>>>>>>> This alignment is no longer performed in the new (shared)
>>>>>>> implementation
>>>>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Now, Solaris was not performing the alignment and Windows has a
>>>>>>> separate
>>>>>>> path for x86_64. Did we really need the alignment for linux
>>>>>>> x86_64 and
>>>>>>> bsd_x86_64 ? Might it be needed for other ports ?
>>>>>>>
>>>>>>> IMHO, it might be safer to align the stack by default, knowing it
>>>>>>> should
>>>>>>> not be expensive since we call get_thread rarely for x86_64
>>>>>>> (result is
>>>>>>> cached in r15). I'll let you see whether it is worth adding an
>>>>>>> ifdef so
>>>>>>> as to explicitly deactivate the alignment on some platforms
>>>>>>> (solaris_x86_64 ?)
>>>>>>
>>>>>> I don't have enough knowledge to even begin to comment on this. I
>>>>>> will
>>>>>> have to rely on our x86_64 macroassembler experts to explain the old
>>>>>> code and what the requirements are.
>>>>>
>>>>> OK. This means that the alignment was not removed purposefully.
>>>>>
>>>>> In that case, you must either keep the per platform code x86_64 (since
>>>>> it was not identical for all platforms) or use the safest version,
>>>>> with
>>>>> the additional
>>>>>
>>>>>     // XXX
>>>>>     mov(r10, rsp);
>>>>>     andq(rsp, -16);
>>>>>     push(r10);
>>>>>
>>>>> before the push(r11) and with the
>>>>>
>>>>>     pop(rsp);
>>>>>
>>>>> after the pop(r11). It should work on all x86_64 platforms.
>>>>
>>>> Right, I will add this back to the code - and replace the
>>>> meaningless //
>>>> XXX with an actual comment!
>>>>
>>>> Many thanks,
>>>> David
>>>> -----
>>>>
>>>>> FYI, there is another way to do the alignment, based on the fact that
>>>>> we are at least aligned on 8 bytes (see
>>>>> MacroAssembler::call_VM_leaf_base). I assume that this second
>>>>> version is
>>>>> more efficient (particularly thanks to specilative branches) but it
>>>>> might be safer/simpler to continue using andq in get_thread.
>>>>>
>>>>> Bertrand.
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Bertrand.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 23/11/2015 08:03, David Holmes wrote:
>>>>>>>> After all the preliminary discussions here are final proposed
>>>>>>>> changes:
>>>>>>>>
>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8132510
>>>>>>>>
>>>>>>>> Open webrev: http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>>>>>>
>>>>>>>> A simple (in principle) but wide-ranging change which should
>>>>>>>> appeal to
>>>>>>>> our Code Deletion Engineer's. We implement Thread::current()
>>>>>>>> using a
>>>>>>>> compiler/language-based thread-local variable eg:
>>>>>>>>
>>>>>>>>   static __thread Thread *_thr_current;
>>>>>>>>
>>>>>>>>   inline Thread* Thread::current() {
>>>>>>>>     return _thr_current;
>>>>>>>>   }
>>>>>>>>
>>>>>>>> with an appropriate setter of course. By doing this we can
>>>>>>>> completely
>>>>>>>> remove the os_cpu-specific ThreadLocalStorage implementations, and
>>>>>>>> the
>>>>>>>> associated os::thread_local_storage* calls.
>>>>>>>>
>>>>>>>> As __thread is not async-signal-safe (either in theory or
>>>>>>>> practice) we
>>>>>>>> need a fallback library-based solution as used today. For this we
>>>>>>>> use a
>>>>>>>> very simple ThreadLocalStorage class and an implementation thereof
>>>>>>>> for
>>>>>>>> POSIX (covers AIX, BSD, Linux, OSX and Solaris) using
>>>>>>>> pthread_get/setspecific; and one for Windows using its TLS library.
>>>>>>>> While these library routines are not guaranteed async-signal-safe,
>>>>>>>> they
>>>>>>>> seem to be in practice and are what we have been using all along.
>>>>>>>>
>>>>>>>> We also allow for use of only the library-based TLS for platforms
>>>>>>>> where
>>>>>>>> compiler-based thread locals are not supported (as will be
>>>>>>>> needed in
>>>>>>>> the
>>>>>>>> Mobile project). This is enabled at build time by defining
>>>>>>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>>>>>>
>>>>>>>> Thanks to Andrew Haley for providing the Aarch64 code; and for
>>>>>>>> Thomas
>>>>>>>> Stuefe for testing PPC and AIX.
>>>>>>>>
>>>>>>>> Testing:
>>>>>>>>   - JPRT (core platforms)
>>>>>>>>   - Jtreg tests (linux & solaris)
>>>>>>>>   - vm.runtime (core platforms)
>>>>>>>>
>>>>>>>> Performance:
>>>>>>>>   - still TBD - this is proving to be extremely difficult. If
>>>>>>>> anyone
>>>>>>>> has
>>>>>>>> any simple to run microbenchmarks to suggest I'd give them a try
>>>>>>>> as a
>>>>>>>> sanity check. But I lack access to hardware for running serious
>>>>>>>> benchmarking.
>>>>>>>>
>>>>>>>> Footprint:
>>>>>>>> - varies by platform and the VM (server, client, minimal)
>>>>>>>> - worst-case: ~1% increase for server VM and minimal VM
>>>>>>>> - best-case:  0.4% decrease for client VM
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>