RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables

David Holmes david.holmes at oracle.com
Tue Nov 24 02:19:28 UTC 2015


Sorry about that:

http://cr.openjdk.java.net/~dholmes/8132510/webrev.v7/

David

On 24/11/2015 11:17 AM, David Holmes wrote:
> Updated webrev:
>
> cr.openjdk.java.net/~dholmes/8132510/webrev.v7/
>
> Only change is in macroAssembler_x86.cpp
>
> Thanks,
> David
>
> On 24/11/2015 7:42 AM, David Holmes wrote:
>> On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>>> On 23/11/2015 13:27, David Holmes wrote:
>>>> Hi Bertrand,
>>>>
>>>> On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>>>>> Hi David,
>>>>>
>>>>> Overall looks good.
>>>>> Thanks for the #ifndef USE_LIBRARY_BASED_TLS_ONLY :-)
>>>>>
>>>>> One doubt, in case this has not been discussed before.
>>>>>
>>>>> I'm still catching up on x86_64 (after years of ARM assembly :-))
>>>>> but it
>>>>> seems that there are some stack alignment constraints on runtime
>>>>> calls,
>>>>> at least for some x86_64 ABIs.
>>>>>
>>>>> Some of the x86 MacroAssembler::get_thread implementations had code to
>>>>> align the stack before calling pthread_getspecific. See for instance
>>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Sorry I'm not that familiar with the MacroAssembler - is it this odd
>>>> fragment:
>>>>
>>>> -   push(r10);
>>>> -   // XXX
>>>> -   mov(r10, rsp);
>>>> -   andq(rsp, -16);
>>>> -   push(r10);
>>>>
>>>> I'm not at all clear what that is doing - and if it somehow changes the
>>>> alignment wouldn't something need to be fixed up when popping the
>>>> previous values ??
>>>>
>>>> To be honest I'm not even sure what an "unaligned stack" means.
>>>
>>> On some platforms, SP may have to be aligned on a 16 bytes boundary when
>>> calling another method.
>>
>> Okay - that seems to be an x64 ABI requirement based on other code in
>> MacroAssembler. The reason it is missing in the Solaris code is because
>> I already removed it in the earlier changes that were done on Solaris:
>>
>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>
>> and nobody picked it up. :(
>>
>> That said, the absence of this code has not caused any problems so
>> presumably we are normally aligned.
>>
>>> After having pushed what needed to be saved, the code above rounds SP
>>> down and saves the old value. It will then also push r11, which results
>>> in the expected alignment.
>>>
>>> The conterpart, after the VM call, is the:
>>>   pop(r11);
>>>   pop(rsp);
>>
>> I had assumed this extra manipulation was related to pushing the arg for
>> the call.
>>
>>>>> This alignment is no longer performed in the new (shared)
>>>>> implementation
>>>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Now, Solaris was not performing the alignment and Windows has a
>>>>> separate
>>>>> path for x86_64. Did we really need the alignment for linux x86_64 and
>>>>> bsd_x86_64 ? Might it be needed for other ports ?
>>>>>
>>>>> IMHO, it might be safer to align the stack by default, knowing it
>>>>> should
>>>>> not be expensive since we call get_thread rarely for x86_64 (result is
>>>>> cached in r15). I'll let you see whether it is worth adding an
>>>>> ifdef so
>>>>> as to explicitly deactivate the alignment on some platforms
>>>>> (solaris_x86_64 ?)
>>>>
>>>> I don't have enough knowledge to even begin to comment on this. I will
>>>> have to rely on our x86_64 macroassembler experts to explain the old
>>>> code and what the requirements are.
>>>
>>> OK. This means that the alignment was not removed purposefully.
>>>
>>> In that case, you must either keep the per platform code x86_64 (since
>>> it was not identical for all platforms) or use the safest version, with
>>> the additional
>>>
>>>     // XXX
>>>     mov(r10, rsp);
>>>     andq(rsp, -16);
>>>     push(r10);
>>>
>>> before the push(r11) and with the
>>>
>>>     pop(rsp);
>>>
>>> after the pop(r11). It should work on all x86_64 platforms.
>>
>> Right, I will add this back to the code - and replace the meaningless //
>> XXX with an actual comment!
>>
>> Many thanks,
>> David
>> -----
>>
>>> FYI, there is another way to do the alignment, based on the fact that
>>> we are at least aligned on 8 bytes (see
>>> MacroAssembler::call_VM_leaf_base). I assume that this second version is
>>> more efficient (particularly thanks to specilative branches) but it
>>> might be safer/simpler to continue using andq in get_thread.
>>>
>>> Bertrand.
>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>> Regards,
>>>>>
>>>>> Bertrand.
>>>>>
>>>>>
>>>>>
>>>>> On 23/11/2015 08:03, David Holmes wrote:
>>>>>> After all the preliminary discussions here are final proposed
>>>>>> changes:
>>>>>>
>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8132510
>>>>>>
>>>>>> Open webrev: http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>>>>
>>>>>> A simple (in principle) but wide-ranging change which should
>>>>>> appeal to
>>>>>> our Code Deletion Engineer's. We implement Thread::current() using a
>>>>>> compiler/language-based thread-local variable eg:
>>>>>>
>>>>>>   static __thread Thread *_thr_current;
>>>>>>
>>>>>>   inline Thread* Thread::current() {
>>>>>>     return _thr_current;
>>>>>>   }
>>>>>>
>>>>>> with an appropriate setter of course. By doing this we can completely
>>>>>> remove the os_cpu-specific ThreadLocalStorage implementations, and
>>>>>> the
>>>>>> associated os::thread_local_storage* calls.
>>>>>>
>>>>>> As __thread is not async-signal-safe (either in theory or
>>>>>> practice) we
>>>>>> need a fallback library-based solution as used today. For this we
>>>>>> use a
>>>>>> very simple ThreadLocalStorage class and an implementation thereof
>>>>>> for
>>>>>> POSIX (covers AIX, BSD, Linux, OSX and Solaris) using
>>>>>> pthread_get/setspecific; and one for Windows using its TLS library.
>>>>>> While these library routines are not guaranteed async-signal-safe,
>>>>>> they
>>>>>> seem to be in practice and are what we have been using all along.
>>>>>>
>>>>>> We also allow for use of only the library-based TLS for platforms
>>>>>> where
>>>>>> compiler-based thread locals are not supported (as will be needed in
>>>>>> the
>>>>>> Mobile project). This is enabled at build time by defining
>>>>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>>>>
>>>>>> Thanks to Andrew Haley for providing the Aarch64 code; and for Thomas
>>>>>> Stuefe for testing PPC and AIX.
>>>>>>
>>>>>> Testing:
>>>>>>   - JPRT (core platforms)
>>>>>>   - Jtreg tests (linux & solaris)
>>>>>>   - vm.runtime (core platforms)
>>>>>>
>>>>>> Performance:
>>>>>>   - still TBD - this is proving to be extremely difficult. If anyone
>>>>>> has
>>>>>> any simple to run microbenchmarks to suggest I'd give them a try as a
>>>>>> sanity check. But I lack access to hardware for running serious
>>>>>> benchmarking.
>>>>>>
>>>>>> Footprint:
>>>>>> - varies by platform and the VM (server, client, minimal)
>>>>>> - worst-case: ~1% increase for server VM and minimal VM
>>>>>> - best-case:  0.4% decrease for client VM
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>
>>>>>
>>>
>>>


More information about the hotspot-dev mailing list