RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables
Thomas Stüfe
thomas.stuefe at gmail.com
Wed Nov 25 10:46:24 UTC 2015
On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart <
bertrand.delsart at oracle.com> wrote:
> Good finding Thomas,
>
> On 25/11/2015 09:40, Thomas Stüfe wrote:
>
>> Hi David, Bertrand,
>>
>> about
>>
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>
>> I am not sure this is correct for Windows x64.
>>
>
> Unless I'm misreading it, the new code looks a lot like the old one
> (including the alignement part). Hence this does not look like a regression
> caused by David's change.
>
>
Yes, this seems like an old bug.
I also see that this is fixed in our codebase since 2010, exactly the way
you propose (with MacroAssembler
<http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::
call_VM_leaf_base
<http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
So, it is already used since five years. I would propose opening a separate
bug for this one and we contribute our fix.
Regards, Thomas
>
>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>
>> windows x64 Abi requires the register parameter area right above the
>> return address on the stack. It is not guaranteed to be preserved during
>> the function call:
>>
>> "Even if the called function has fewer than 4 parameters, these 4 stack
>> locations are effectively owned by the called function, and may be used
>> by the called function for other purposes besides saving parameter
>> register values. Thus the caller may not save information in this region
>> of stack across a function call."
>>
>> So, restoring parameters from it may not work if Thread::current() uses
>> this area for any reason. In this case, r9, rsp, r10, r11 may be
>> overwritten by the callee.
>>
>> I also think Windows x64 ABI is different enough from Linux that this
>> function could live somewhere in <os>_<cpu>.
>>
>
> This is a valid point. However, note that this is similar to the issue we
> have for instance in 'shared' MacroAssembler::call_VM_leaf_base, which was
> not put in <os>_<cpu> files.
>
> In fact, the right fix may be to use call_VM_leaf instead of call :-)
>
> (alternatively we could use "#ifdef _WIN64" or check
> "frame::arg_reg_save_area_bytes != 0", which may be even cleaner and more
> portable)
>
> All these solutions are fine for me.
>
>
> Since we are near to feature freeze, this should probably be done in a
> separate CR (unless David has enough time to test it).
>
> Regards,
>
> Bertrand.
>
>
>> Kind Regards, Thomas
>>
>>
>>
>>
>> On Mon, Nov 23, 2015 at 10:42 PM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>> On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>>
>> On 23/11/2015 13:27, David Holmes wrote:
>>
>> Hi Bertrand,
>>
>> On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>>
>> Hi David,
>>
>> Overall looks good.
>> Thanks for the #ifndef USE_LIBRARY_BASED_TLS_ONLY :-)
>>
>> One doubt, in case this has not been discussed before.
>>
>> I'm still catching up on x86_64 (after years of ARM
>> assembly :-)) but it
>> seems that there are some stack alignment constraints on
>> runtime calls,
>> at least for some x86_64 ABIs.
>>
>> Some of the x86 MacroAssembler::get_thread
>> implementations had code to
>> align the stack before calling pthread_getspecific. See
>> for instance
>>
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>
>>
>>
>> Sorry I'm not that familiar with the MacroAssembler - is it
>> this odd
>> fragment:
>>
>> - push(r10);
>> - // XXX
>> - mov(r10, rsp);
>> - andq(rsp, -16);
>> - push(r10);
>>
>> I'm not at all clear what that is doing - and if it somehow
>> changes the
>> alignment wouldn't something need to be fixed up when
>> popping the
>> previous values ??
>>
>> To be honest I'm not even sure what an "unaligned stack"
>> means.
>>
>>
>> On some platforms, SP may have to be aligned on a 16 bytes
>> boundary when
>> calling another method.
>>
>>
>> Okay - that seems to be an x64 ABI requirement based on other code
>> in MacroAssembler. The reason it is missing in the Solaris code is
>> because I already removed it in the earlier changes that were done
>> on Solaris:
>>
>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>
>> and nobody picked it up. :(
>>
>> That said, the absence of this code has not caused any problems so
>> presumably we are normally aligned.
>>
>> After having pushed what needed to be saved, the code above
>> rounds SP
>> down and saves the old value. It will then also push r11, which
>> results
>> in the expected alignment.
>>
>> The conterpart, after the VM call, is the:
>> pop(r11);
>> pop(rsp);
>>
>>
>> I had assumed this extra manipulation was related to pushing the arg
>> for the call.
>>
>> This alignment is no longer performed in the new
>> (shared) implementation
>>
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>
>>
>>
>>
>> Now, Solaris was not performing the alignment and
>> Windows has a separate
>> path for x86_64. Did we really need the alignment for
>> linux x86_64 and
>> bsd_x86_64 ? Might it be needed for other ports ?
>>
>> IMHO, it might be safer to align the stack by default,
>> knowing it should
>> not be expensive since we call get_thread rarely for
>> x86_64 (result is
>> cached in r15). I'll let you see whether it is worth
>> adding an ifdef so
>> as to explicitly deactivate the alignment on some
>> platforms
>> (solaris_x86_64 ?)
>>
>>
>> I don't have enough knowledge to even begin to comment on
>> this. I will
>> have to rely on our x86_64 macroassembler experts to explain
>> the old
>> code and what the requirements are.
>>
>>
>> OK. This means that the alignment was not removed purposefully.
>>
>> In that case, you must either keep the per platform code x86_64
>> (since
>> it was not identical for all platforms) or use the safest
>> version, with
>> the additional
>>
>> // XXX
>> mov(r10, rsp);
>> andq(rsp, -16);
>> push(r10);
>>
>> before the push(r11) and with the
>>
>> pop(rsp);
>>
>> after the pop(r11). It should work on all x86_64 platforms.
>>
>>
>> Right, I will add this back to the code - and replace the
>> meaningless // XXX with an actual comment!
>>
>> Many thanks,
>> David
>> -----
>>
>>
>> FYI, there is another way to do the alignment, based on the fact
>> that
>> we are at least aligned on 8 bytes (see
>> MacroAssembler::call_VM_leaf_base). I assume that this second
>> version is
>> more efficient (particularly thanks to specilative branches) but
>> it
>> might be safer/simpler to continue using andq in get_thread.
>>
>> Bertrand.
>>
>>
>> Thanks,
>> David
>>
>> Regards,
>>
>> Bertrand.
>>
>>
>>
>> On 23/11/2015 08:03, David Holmes wrote:
>>
>> After all the preliminary discussions here are final
>> proposed changes:
>>
>> bug: https://bugs.openjdk.java.net/browse/JDK-8132510
>>
>> Open webrev:
>>
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>
>> A simple (in principle) but wide-ranging change
>> which should appeal to
>> our Code Deletion Engineer's. We implement
>> Thread::current() using a
>> compiler/language-based thread-local variable eg:
>>
>> static __thread Thread *_thr_current;
>>
>> inline Thread* Thread::current() {
>> return _thr_current;
>> }
>>
>> with an appropriate setter of course. By doing this
>> we can completely
>> remove the os_cpu-specific ThreadLocalStorage
>> implementations, and the
>> associated os::thread_local_storage* calls.
>>
>> As __thread is not async-signal-safe (either in
>> theory or practice) we
>> need a fallback library-based solution as used
>> today. For this we use a
>> very simple ThreadLocalStorage class and an
>> implementation thereof for
>> POSIX (covers AIX, BSD, Linux, OSX and Solaris) using
>> pthread_get/setspecific; and one for Windows using
>> its TLS library.
>> While these library routines are not guaranteed
>> async-signal-safe, they
>> seem to be in practice and are what we have been
>> using all along.
>>
>> We also allow for use of only the library-based TLS
>> for platforms where
>> compiler-based thread locals are not supported (as
>> will be needed in
>> the
>> Mobile project). This is enabled at build time by
>> defining
>> USE_LIBRARY_BASED_TLS_ONLY.
>>
>> Thanks to Andrew Haley for providing the Aarch64
>> code; and for Thomas
>> Stuefe for testing PPC and AIX.
>>
>> Testing:
>> - JPRT (core platforms)
>> - Jtreg tests (linux & solaris)
>> - vm.runtime (core platforms)
>>
>> Performance:
>> - still TBD - this is proving to be extremely
>> difficult. If anyone
>> has
>> any simple to run microbenchmarks to suggest I'd
>> give them a try as a
>> sanity check. But I lack access to hardware for
>> running serious
>> benchmarking.
>>
>> Footprint:
>> - varies by platform and the VM (server, client,
>> minimal)
>> - worst-case: ~1% increase for server VM and minimal
>> VM
>> - best-case: 0.4% decrease for client VM
>>
>> Thanks,
>> David
>>
>>
>>
>>
>>
>>
>>
>
> --
> Bertrand Delsart, Grenoble Engineering Center
> Oracle, 180 av. de l'Europe, ZIRST de Montbonnot
> 38330 Montbonnot Saint Martin, FRANCE
> bertrand.delsart at oracle.com Phone : +33 4 76 18 81 23
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient,
> please contact the sender by reply email and destroy all copies of
> the original message.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
More information about the hotspot-dev
mailing list