RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

Tue Nov 24 01:17:02 UTC 2015

Updated webrev:

cr.openjdk.java.net/~dholmes/8132510/webrev.v7/

Only change is in macroAssembler_x86.cpp

Thanks,
David

On 24/11/2015 7:42 AM, David Holmes wrote:
> On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>> On 23/11/2015 13:27, David Holmes wrote:
>>> Hi Bertrand,
>>>
>>> On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>>>> Hi David,
>>>>
>>>> Overall looks good.
>>>> Thanks for the #ifndef USE_LIBRARY_BASED_TLS_ONLY :-)
>>>>
>>>> One doubt, in case this has not been discussed before.
>>>>
>>>> I'm still catching up on x86_64 (after years of ARM assembly :-))
>>>> but it
>>>> seems that there are some stack alignment constraints on runtime calls,
>>>> at least for some x86_64 ABIs.
>>>>
>>>> Some of the x86 MacroAssembler::get_thread implementations had code to
>>>> align the stack before calling pthread_getspecific. See for instance
>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>>
>>>>
>>>>
>>>
>>> Sorry I'm not that familiar with the MacroAssembler - is it this odd
>>> fragment:
>>>
>>> -   push(r10);
>>> -   // XXX
>>> -   mov(r10, rsp);
>>> -   andq(rsp, -16);
>>> -   push(r10);
>>>
>>> I'm not at all clear what that is doing - and if it somehow changes the
>>> alignment wouldn't something need to be fixed up when popping the
>>> previous values ??
>>>
>>> To be honest I'm not even sure what an "unaligned stack" means.
>>
>> On some platforms, SP may have to be aligned on a 16 bytes boundary when
>> calling another method.
>
> Okay - that seems to be an x64 ABI requirement based on other code in
> MacroAssembler. The reason it is missing in the Solaris code is because
> I already removed it in the earlier changes that were done on Solaris:
>
> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>
> and nobody picked it up. :(
>
> That said, the absence of this code has not caused any problems so
> presumably we are normally aligned.
>
>> After having pushed what needed to be saved, the code above rounds SP
>> down and saves the old value. It will then also push r11, which results
>> in the expected alignment.
>>
>> The conterpart, after the VM call, is the:
>>   pop(r11);
>>   pop(rsp);
>
> I had assumed this extra manipulation was related to pushing the arg for
> the call.
>
>>>> This alignment is no longer performed in the new (shared)
>>>> implementation
>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Now, Solaris was not performing the alignment and Windows has a
>>>> separate
>>>> path for x86_64. Did we really need the alignment for linux x86_64 and
>>>> bsd_x86_64 ? Might it be needed for other ports ?
>>>>
>>>> IMHO, it might be safer to align the stack by default, knowing it
>>>> should
>>>> not be expensive since we call get_thread rarely for x86_64 (result is
>>>> cached in r15). I'll let you see whether it is worth adding an ifdef so
>>>> as to explicitly deactivate the alignment on some platforms
>>>> (solaris_x86_64 ?)
>>>
>>> I don't have enough knowledge to even begin to comment on this. I will
>>> have to rely on our x86_64 macroassembler experts to explain the old
>>> code and what the requirements are.
>>
>> OK. This means that the alignment was not removed purposefully.
>>
>> In that case, you must either keep the per platform code x86_64 (since
>> it was not identical for all platforms) or use the safest version, with
>> the additional
>>
>>     // XXX
>>     mov(r10, rsp);
>>     andq(rsp, -16);
>>     push(r10);
>>
>> before the push(r11) and with the
>>
>>     pop(rsp);
>>
>> after the pop(r11). It should work on all x86_64 platforms.
>
> Right, I will add this back to the code - and replace the meaningless //
> XXX with an actual comment!
>
> Many thanks,
> David
> -----
>
>> FYI, there is another way to do the alignment, based on the fact that
>> we are at least aligned on 8 bytes (see
>> MacroAssembler::call_VM_leaf_base). I assume that this second version is
>> more efficient (particularly thanks to specilative branches) but it
>> might be safer/simpler to continue using andq in get_thread.
>>
>> Bertrand.
>>
>>>
>>> Thanks,
>>> David
>>>
>>>> Regards,
>>>>
>>>> Bertrand.
>>>>
>>>>
>>>>
>>>> On 23/11/2015 08:03, David Holmes wrote:
>>>>> After all the preliminary discussions here are final proposed changes:
>>>>>
>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8132510
>>>>>
>>>>> Open webrev: http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>>>
>>>>> A simple (in principle) but wide-ranging change which should appeal to
>>>>> our Code Deletion Engineer's. We implement Thread::current() using a
>>>>> compiler/language-based thread-local variable eg:
>>>>>
>>>>>   static __thread Thread *_thr_current;
>>>>>
>>>>>   inline Thread* Thread::current() {
>>>>>     return _thr_current;
>>>>>   }
>>>>>
>>>>> with an appropriate setter of course. By doing this we can completely
>>>>> remove the os_cpu-specific ThreadLocalStorage implementations, and the
>>>>> associated os::thread_local_storage* calls.
>>>>>
>>>>> As __thread is not async-signal-safe (either in theory or practice) we
>>>>> need a fallback library-based solution as used today. For this we
>>>>> use a
>>>>> very simple ThreadLocalStorage class and an implementation thereof for
>>>>> POSIX (covers AIX, BSD, Linux, OSX and Solaris) using
>>>>> pthread_get/setspecific; and one for Windows using its TLS library.
>>>>> While these library routines are not guaranteed async-signal-safe,
>>>>> they
>>>>> seem to be in practice and are what we have been using all along.
>>>>>
>>>>> We also allow for use of only the library-based TLS for platforms
>>>>> where
>>>>> compiler-based thread locals are not supported (as will be needed in
>>>>> the
>>>>> Mobile project). This is enabled at build time by defining
>>>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>>>
>>>>> Thanks to Andrew Haley for providing the Aarch64 code; and for Thomas
>>>>> Stuefe for testing PPC and AIX.
>>>>>
>>>>> Testing:
>>>>>   - JPRT (core platforms)
>>>>>   - Jtreg tests (linux & solaris)
>>>>>   - vm.runtime (core platforms)
>>>>>
>>>>> Performance:
>>>>>   - still TBD - this is proving to be extremely difficult. If anyone
>>>>> has
>>>>> any simple to run microbenchmarks to suggest I'd give them a try as a
>>>>> sanity check. But I lack access to hardware for running serious
>>>>> benchmarking.
>>>>>
>>>>> Footprint:
>>>>> - varies by platform and the VM (server, client, minimal)
>>>>> - worst-case: ~1% increase for server VM and minimal VM
>>>>> - best-case:  0.4% decrease for client VM
>>>>>
>>>>> Thanks,
>>>>> David
>>>>
>>>>
>>
>>