RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables
Daniel D. Daugherty
daniel.daugherty at oracle.com
Wed Nov 25 23:08:17 UTC 2015
If I understand the problem... :-)
The Win-X64 ABI requires that some extra space be allocated on
the stack in the caller for "homing" four registers...
This code uses MacroAssembler::call() which is a lower level
call() function that assumes you know the ABI...
9765 void MacroAssembler::get_thread(Register thread) {
<snip>
9784 call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));
If you switch to MacroAssembler::call_VM_leaf_base() it handles
more of the gory details for you than MacroAssembler::call():
src/cpu/x86/vm/macroAssembler_x86.cpp:
524 void MacroAssembler::call_VM_leaf_base(address entry_point, int
num_args
) {
525 Label L, E;
526
527 #ifdef _WIN64
528 // Windows always allocates space for it's register args
529 assert(num_args <= 4, "only register arguments supported");
530 subq(rsp, frame::arg_reg_save_area_bytes);
531 #endif
532
533 // Align stack if necessary
534 testl(rsp, 15);
535 jcc(Assembler::zero, L);
536
537 subq(rsp, 8);
538 {
539 call(RuntimeAddress(entry_point));
540 }
So call_VM_leaf_base() allocates the "homing" space for register
args for Win-X64 on L530 and it handles any stack alignment before
calling MacroAssembler::call().
So if you switch to call_VM_leaf_base() you can drop
the stack alignment code:
9777 // ensure stack alignment
9778 mov(r10, rsp);
9779 andq(rsp, -16);
9780 push(r10);
and just switch to:
9777 push(rsp);
And you won't be exposed to potential saved register
overwrite by the "homing" of any registers in the call
Thread::current().
I _think_ I have that all correct... :-)
Dan
On 11/25/15 3:52 PM, David Holmes wrote:
> On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>> I created a bug to track this:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8144059
>
> Thanks. I admit I still don't really grok the exact problem here. It
> is the alignment code that is the problem right?
>
> 79 mov(r10, rsp);
> 80 andq(rsp, -16);
> 81 push(r10);
>
> ?
>
> Thanks,
> David
>
>> Because Davids change is not submitted yet, the bug refers to the old
>> coding, but stays relevant also after this fix.
>>
>> Kind Regards, Thomas
>>
>> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>> <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>>
>> wrote:
>>
>> On 11/25/15 6:10 AM, David Holmes wrote:
>>
>> On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>
>>
>> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
>> <bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>>> wrote:
>>
>> Good finding Thomas,
>>
>> On 25/11/2015 09:40, Thomas Stüfe wrote:
>>
>> Hi David, Bertrand,
>>
>> about
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>
>> I am not sure this is correct for Windows x64.
>>
>>
>> Unless I'm misreading it, the new code looks a lot like
>> the old one
>> (including the alignement part). Hence this does not
>> look like a
>> regression caused by David's change.
>>
>>
>> Yes, this seems like an old bug.
>>
>> I also see that this is fixed in our codebase since 2010,
>> exactly the
>> way you propose (with MacroAssembler
>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>>
>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>>
>> So, it is already used since five years. I would propose
>> opening a
>> separate bug for this one and we contribute our fix.
>>
>>
>> That sounds good to me too. I also suspect additional code will
>> need to be checked as this stack alignment is performed in a few
>> places in ./cpu/x86/vm/macroAssembler_x86.cpp with no OS
>> specific differences being considered.
>>
>>
>> Looks like MacroAssembler::call_VM_leaf_base() does the "right
>> thing"
>> for stack alignment in the X64 case. Perhaps we need to audit both
>> stack alignment _and_ uses of MacroAssembler::call() just to make
>> sure...
>>
>> Did I mention that all the assembly code stuff gives me a
>> headache? :-)
>>
>> Dan
>>
>>
>>
>>
>> Thanks,
>> David
>>
>> Regards, Thomas
>>
>>
>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>
>> windows x64 Abi requires the register parameter
>> area right above the
>> return address on the stack. It is not
>> guaranteed to be
>> preserved during
>> the function call:
>>
>> "Even if the called function has fewer than 4
>> parameters, these
>> 4 stack
>> locations are effectively owned by the called
>> function, and may
>> be used
>> by the called function for other purposes besides
>> saving parameter
>> register values. Thus the caller may not save
>> information in
>> this region
>> of stack across a function call."
>>
>> So, restoring parameters from it may not work if
>> Thread::current() uses
>> this area for any reason. In this case, r9, rsp,
>> r10, r11 may be
>> overwritten by the callee.
>>
>> I also think Windows x64 ABI is different enough
>> from Linux that
>> this
>> function could live somewhere in <os>_<cpu>.
>>
>>
>> This is a valid point. However, note that this is
>> similar to the
>> issue we have for instance in 'shared'
>> MacroAssembler::call_VM_leaf_base, which was not put in
>> <os>_<cpu>
>> files.
>>
>> In fact, the right fix may be to use call_VM_leaf
>> instead of call :-)
>>
>> (alternatively we could use "#ifdef _WIN64" or check
>> "frame::arg_reg_save_area_bytes != 0", which may be
>> even cleaner and
>> more portable)
>>
>> All these solutions are fine for me.
>>
>>
>> Since we are near to feature freeze, this should
>> probably be done in
>> a separate CR (unless David has enough time to test
>> it).
>>
>> Regards,
>>
>> Bertrand.
>>
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>> On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
>> <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>>>> wrote:
>>
>> On 23/11/2015 11:46 PM, Bertrand Delsart
>> wrote:
>>
>> On 23/11/2015 13:27, David Holmes wrote:
>>
>> Hi Bertrand,
>>
>> On 23/11/2015 8:14 PM, Bertrand
>> Delsart wrote:
>>
>> Hi David,
>>
>> Overall looks good.
>> Thanks for the #ifndef
>> USE_LIBRARY_BASED_TLS_ONLY :-)
>>
>> One doubt, in case this has not
>> been discussed
>> before.
>>
>> I'm still catching up on x86_64
>> (after years of ARM
>> assembly :-)) but it
>> seems that there are some stack
>> alignment
>> constraints on
>> runtime calls,
>> at least for some x86_64 ABIs.
>>
>> Some of the x86
>> MacroAssembler::get_thread
>> implementations had code to
>> align the stack before calling
>> pthread_getspecific. See
>> for instance
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>
>>
>>
>> Sorry I'm not that familiar with the
>> MacroAssembler
>> - is it
>> this odd
>> fragment:
>>
>> - push(r10);
>> - // XXX
>> - mov(r10, rsp);
>> - andq(rsp, -16);
>> - push(r10);
>>
>> I'm not at all clear what that is
>> doing - and if it
>> somehow
>> changes the
>> alignment wouldn't something need to
>> be fixed up when
>> popping the
>> previous values ??
>>
>> To be honest I'm not even sure what an
>> "unaligned
>> stack" means.
>>
>>
>> On some platforms, SP may have to be
>> aligned on a 16 bytes
>> boundary when
>> calling another method.
>>
>>
>> Okay - that seems to be an x64 ABI requirement
>> based on
>> other code
>> in MacroAssembler. The reason it is missing in
>> the Solaris
>> code is
>> because I already removed it in the earlier
>> changes that
>> were done
>> on Solaris:
>>
>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>
>> and nobody picked it up. :(
>>
>> That said, the absence of this code has not
>> caused any
>> problems so
>> presumably we are normally aligned.
>>
>> After having pushed what needed to be
>> saved, the code above
>> rounds SP
>> down and saves the old value. It will then
>> also push
>> r11, which
>> results
>> in the expected alignment.
>>
>> The conterpart, after the VM call, is the:
>> pop(r11);
>> pop(rsp);
>>
>>
>> I had assumed this extra manipulation was
>> related to
>> pushing the arg
>> for the call.
>>
>> This alignment is no longer
>> performed in the new
>> (shared) implementation
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>
>>
>>
>>
>> Now, Solaris was not performing
>> the alignment and
>> Windows has a separate
>> path for x86_64. Did we really
>> need the
>> alignment for
>> linux x86_64 and
>> bsd_x86_64 ? Might it be needed
>> for other ports ?
>>
>> IMHO, it might be safer to align
>> the stack by
>> default,
>> knowing it should
>> not be expensive since we call
>> get_thread
>> rarely for
>> x86_64 (result is
>> cached in r15). I'll let you see
>> whether it is
>> worth
>> adding an ifdef so
>> as to explicitly deactivate the
>> alignment on
>> some platforms
>> (solaris_x86_64 ?)
>>
>>
>> I don't have enough knowledge to even
>> begin to
>> comment on
>> this. I will
>> have to rely on our x86_64
>> macroassembler experts
>> to explain
>> the old
>> code and what the requirements are.
>>
>>
>> OK. This means that the alignment was not
>> removed
>> purposefully.
>>
>> In that case, you must either keep the per
>> platform
>> code x86_64
>> (since
>> it was not identical for all platforms) or
>> use the safest
>> version, with
>> the additional
>>
>> // XXX
>> mov(r10, rsp);
>> andq(rsp, -16);
>> push(r10);
>>
>> before the push(r11) and with the
>>
>> pop(rsp);
>>
>> after the pop(r11). It should work on all
>> x86_64 platforms.
>>
>>
>> Right, I will add this back to the code - and
>> replace the
>> meaningless // XXX with an actual comment!
>>
>> Many thanks,
>> David
>> -----
>>
>>
>> FYI, there is another way to do the
>> alignment, based on
>> the fact
>> that
>> we are at least aligned on 8 bytes (see
>> MacroAssembler::call_VM_leaf_base). I
>> assume that this
>> second
>> version is
>> more efficient (particularly thanks to
>> specilative
>> branches) but it
>> might be safer/simpler to continue using
>> andq in
>> get_thread.
>>
>> Bertrand.
>>
>>
>> Thanks,
>> David
>>
>> Regards,
>>
>> Bertrand.
>>
>>
>>
>> On 23/11/2015 08:03, David Holmes
>> wrote:
>>
>> After all the preliminary
>> discussions here
>> are final
>> proposed changes:
>>
>> bug:
>> https://bugs.openjdk.java.net/browse/JDK-8132510
>>
>> Open webrev:
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>
>> A simple (in principle) but
>> wide-ranging change
>> which should appeal to
>> our Code Deletion Engineer's.
>> We implement
>> Thread::current() using a
>> compiler/language-based
>> thread-local
>> variable eg:
>>
>> static __thread Thread
>> *_thr_current;
>>
>> inline Thread*
>> Thread::current() {
>> return _thr_current;
>> }
>>
>> with an appropriate setter of
>> course. By
>> doing this
>> we can completely
>> remove the os_cpu-specific
>> ThreadLocalStorage
>> implementations, and the
>> associated
>> os::thread_local_storage* calls.
>>
>> As __thread is not
>> async-signal-safe (either in
>> theory or practice) we
>> need a fallback library-based
>> solution as used
>> today. For this we use a
>> very simple ThreadLocalStorage
>> class and an
>> implementation thereof for
>> POSIX (covers AIX, BSD, Linux,
>> OSX and
>> Solaris) using
>> pthread_get/setspecific; and
>> one for
>> Windows using
>> its TLS library.
>> While these library routines
>> are not guaranteed
>> async-signal-safe, they
>> seem to be in practice and are
>> what we have
>> been
>> using all along.
>>
>> We also allow for use of
>> only the
>> library-based TLS
>> for platforms where
>> compiler-based thread locals
>> are not
>> supported (as
>> will be needed in
>> the
>> Mobile project). This is
>> enabled at build
>> time by
>> defining
>> USE_LIBRARY_BASED_TLS_ONLY.
>>
>> Thanks to Andrew Haley for
>> providing the
>> Aarch64
>> code; and for Thomas
>> Stuefe for testing PPC and
>> AIX.
>>
>> Testing:
>> - JPRT (core platforms)
>> - Jtreg tests (linux &
>> solaris)
>> - vm.runtime (core
>> platforms)
>>
>> Performance:
>> - still TBD - this is
>> proving to be
>> extremely
>> difficult. If anyone
>> has
>> any simple to run
>> microbenchmarks to
>> suggest I'd
>> give them a try as a
>> sanity check. But I lack
>> access to hardware for
>> running serious
>> benchmarking.
>>
>> Footprint:
>> - varies by platform and the
>> VM (server,
>> client,
>> minimal)
>> - worst-case: ~1% increase for
>> server VM
>> and minimal VM
>> - best-case: 0.4% decrease
>> for client VM
>>
>> Thanks,
>> David
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Bertrand Delsart, Grenoble
>> Engineering Center
>> Oracle, 180 av. de l'Europe, ZIRST de
>> Montbonnot
>> 38330 Montbonnot Saint Martin, FRANCE
>> bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>>
>> Phone : +33 4 76 18 81 23
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> NOTICE: This email message is for the sole use of the
>> intended
>> recipient(s) and may contain confidential and
>> privileged
>> information. Any unauthorized review, use,
>> disclosure or
>> distribution is prohibited. If you are not the intended
>> recipient,
>> please contact the sender by reply email and destroy
>> all copies of
>> the original message.
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>
>>
More information about the hotspot-dev
mailing list