RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Mon Nov 23 21:42:58 UTC 2015

On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
> On 23/11/2015 13:27, David Holmes wrote:
>> Hi Bertrand,
>>
>> On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>>> Hi David,
>>>
>>> Overall looks good.
>>> Thanks for the #ifndef USE_LIBRARY_BASED_TLS_ONLY :-)
>>>
>>> One doubt, in case this has not been discussed before.
>>>
>>> I'm still catching up on x86_64 (after years of ARM assembly :-)) but it
>>> seems that there are some stack alignment constraints on runtime calls,
>>> at least for some x86_64 ABIs.
>>>
>>> Some of the x86 MacroAssembler::get_thread implementations had code to
>>> align the stack before calling pthread_getspecific. See for instance
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>
>>>
>>
>> Sorry I'm not that familiar with the MacroAssembler - is it this odd
>> fragment:
>>
>> -   push(r10);
>> -   // XXX
>> -   mov(r10, rsp);
>> -   andq(rsp, -16);
>> -   push(r10);
>>
>> I'm not at all clear what that is doing - and if it somehow changes the
>> alignment wouldn't something need to be fixed up when popping the
>> previous values ??
>>
>> To be honest I'm not even sure what an "unaligned stack" means.
>
> On some platforms, SP may have to be aligned on a 16 bytes boundary when
> calling another method.

Okay - that seems to be an x64 ABI requirement based on other code in 
MacroAssembler. The reason it is missing in the Solaris code is because 
I already removed it in the earlier changes that were done on Solaris:

http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10

and nobody picked it up. :(

That said, the absence of this code has not caused any problems so 
presumably we are normally aligned.

> After having pushed what needed to be saved, the code above rounds SP
> down and saves the old value. It will then also push r11, which results
> in the expected alignment.
>
> The conterpart, after the VM call, is the:
>   pop(r11);
>   pop(rsp);

I had assumed this extra manipulation was related to pushing the arg for 
the call.

>>> This alignment is no longer performed in the new (shared) implementation
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>> Now, Solaris was not performing the alignment and Windows has a separate
>>> path for x86_64. Did we really need the alignment for linux x86_64 and
>>> bsd_x86_64 ? Might it be needed for other ports ?
>>>
>>> IMHO, it might be safer to align the stack by default, knowing it should
>>> not be expensive since we call get_thread rarely for x86_64 (result is
>>> cached in r15). I'll let you see whether it is worth adding an ifdef so
>>> as to explicitly deactivate the alignment on some platforms
>>> (solaris_x86_64 ?)
>>
>> I don't have enough knowledge to even begin to comment on this. I will
>> have to rely on our x86_64 macroassembler experts to explain the old
>> code and what the requirements are.
>
> OK. This means that the alignment was not removed purposefully.
>
> In that case, you must either keep the per platform code x86_64 (since
> it was not identical for all platforms) or use the safest version, with
> the additional
>
>     // XXX
>     mov(r10, rsp);
>     andq(rsp, -16);
>     push(r10);
>
> before the push(r11) and with the
>
>     pop(rsp);
>
> after the pop(r11). It should work on all x86_64 platforms.

Right, I will add this back to the code - and replace the meaningless // 
XXX with an actual comment!

Many thanks,
David
-----

> FYI, there is another way to do the alignment, based on the fact that
> we are at least aligned on 8 bytes (see
> MacroAssembler::call_VM_leaf_base). I assume that this second version is
> more efficient (particularly thanks to specilative branches) but it
> might be safer/simpler to continue using andq in get_thread.
>
> Bertrand.
>
>>
>> Thanks,
>> David
>>
>>> Regards,
>>>
>>> Bertrand.
>>>
>>>
>>>
>>> On 23/11/2015 08:03, David Holmes wrote:
>>>> After all the preliminary discussions here are final proposed changes:
>>>>
>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8132510
>>>>
>>>> Open webrev: http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>>
>>>> A simple (in principle) but wide-ranging change which should appeal to
>>>> our Code Deletion Engineer's. We implement Thread::current() using a
>>>> compiler/language-based thread-local variable eg:
>>>>
>>>>   static __thread Thread *_thr_current;
>>>>
>>>>   inline Thread* Thread::current() {
>>>>     return _thr_current;
>>>>   }
>>>>
>>>> with an appropriate setter of course. By doing this we can completely
>>>> remove the os_cpu-specific ThreadLocalStorage implementations, and the
>>>> associated os::thread_local_storage* calls.
>>>>
>>>> As __thread is not async-signal-safe (either in theory or practice) we
>>>> need a fallback library-based solution as used today. For this we use a
>>>> very simple ThreadLocalStorage class and an implementation thereof for
>>>> POSIX (covers AIX, BSD, Linux, OSX and Solaris) using
>>>> pthread_get/setspecific; and one for Windows using its TLS library.
>>>> While these library routines are not guaranteed async-signal-safe, they
>>>> seem to be in practice and are what we have been using all along.
>>>>
>>>> We also allow for use of only the library-based TLS for platforms where
>>>> compiler-based thread locals are not supported (as will be needed in
>>>> the
>>>> Mobile project). This is enabled at build time by defining
>>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>>
>>>> Thanks to Andrew Haley for providing the Aarch64 code; and for Thomas
>>>> Stuefe for testing PPC and AIX.
>>>>
>>>> Testing:
>>>>   - JPRT (core platforms)
>>>>   - Jtreg tests (linux & solaris)
>>>>   - vm.runtime (core platforms)
>>>>
>>>> Performance:
>>>>   - still TBD - this is proving to be extremely difficult. If anyone
>>>> has
>>>> any simple to run microbenchmarks to suggest I'd give them a try as a
>>>> sanity check. But I lack access to hardware for running serious
>>>> benchmarking.
>>>>
>>>> Footprint:
>>>> - varies by platform and the VM (server, client, minimal)
>>>> - worst-case: ~1% increase for server VM and minimal VM
>>>> - best-case:  0.4% decrease for client VM
>>>>
>>>> Thanks,
>>>> David
>>>
>>>
>
>