RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue Dec 1 14:28:36 UTC 2015
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v10
Rereviewed by comparing this webrev's patch with the patch from
webrev.v8.
Thumbs up.
Dan
On 11/27/15 12:26 AM, David Holmes wrote:
> Hi Thomas,
>
> On 26/11/2015 7:56 PM, Thomas Stüfe wrote:
>>
>> On Thu, Nov 26, 2015 at 12:51 AM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>> On 26/11/2015 9:08 AM, Daniel D. Daugherty wrote:
>>
>> If I understand the problem... :-)
>>
>> The Win-X64 ABI requires that some extra space be allocated on
>> the stack in the caller for "homing" four registers...
>>
>> This code uses MacroAssembler::call() which is a lower level
>> call() function that assumes you know the ABI...
>>
>> 9765 void MacroAssembler::get_thread(Register thread) {
>> <snip>
>> 9784 call(RuntimeAddress(CAST_FROM_FN_PTR(address,
>> Thread::current)));
>>
>> So the failing here was nothing to do with register saving per-se,
>> nor the alignment code, but "simply" not also saving the extra space
>> prior to the call!
>>
>> Yes. These 32 byte are a scratch area for the callee, which may be used
>> for parameter registers or anything else. We may just have been lucky so
>> far that TlsGetValue() did not use this area. As Bertrand already said,
>> now we call Thread::current(), which does magic tls accesses under the
>> hood, so this must be tested again.
>
> Right. All testing so far has been successful but ...
>
>> However I think that a simple addition to your current fix could simply
>> be subtracting 32 from rsp before the call:
>>
>> subq(rsp, 32)
>> call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));
>> addq(rsp, 32)
>
> ... I decided to do what was previously suggested and use:
>
> call_VM_leaf_base(CAST_FROM_FN_PTR(address, Thread::current), 0);
>
> as it takes care of everything (stack alignment if needed and
> preserving space for the register args on win64).
>
>> Also, we could probably skip saving/restoring rdi and rsi on Windows:
>
> Maybe - but we don't seem to consider that anywhere else ie:
>
> void MacroAssembler::push_callee_saved_registers() {
> push(rsi);
> push(rdi);
> push(rdx);
> push(rcx);
> }
>
> so I don't want to be the trail-blazer here. :)
>
> Updated webrev:
>
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v10
>
> Only change to macroAssembler_x86.cpp. Re-tested on Windows 32-bit and
> 64-bit.
>
> Thanks,
> David
>
>> https://msdn.microsoft.com/en-us/library/6t169e9c.aspx
>>
>> "The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile
>> and must be considered destroyed on function calls (unless otherwise
>> safety-provable by analysis such as whole program optimization). The
>> registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered
>> nonvolatile and must be saved and restored by a function that uses
>> them."
>
>
>
>> Best Regards, Thomas
>>
>>
>> If you switch to MacroAssembler::call_VM_leaf_base() it handles
>> more of the gory details for you than MacroAssembler::call():
>>
>>
>> Wow! Not exactly a clearly-defined name for this. AFAICS there's
>> nothing VM or leaf specific about this. Also the MacroAssembler is
>> supposed to be a platform-agnostic higher-level abstraction that is
>> easy to use - but in this case every use of the plain call on
>> Windows-x64 is potentially broken if the programmer was not aware of
>> this ABI constraint which only applies to Windows (and we're in
>> OS-neutral files!).
>>
>> This seems fundamentally broken to me.
>>
>> Thanks for the info.
>>
>> David
>>
>>
>> src/cpu/x86/vm/macroAssembler_x86.cpp:
>>
>> 524 void MacroAssembler::call_VM_leaf_base(address
>> entry_point, int
>> num_args
>> ) {
>> 525 Label L, E;
>> 526
>> 527 #ifdef _WIN64
>> 528 // Windows always allocates space for it's
>> register args
>> 529 assert(num_args <= 4, "only register arguments
>> supported");
>> 530 subq(rsp, frame::arg_reg_save_area_bytes);
>> 531 #endif
>> 532
>> 533 // Align stack if necessary
>> 534 testl(rsp, 15);
>> 535 jcc(Assembler::zero, L);
>> 536
>> 537 subq(rsp, 8);
>> 538 {
>> 539 call(RuntimeAddress(entry_point));
>> 540 }
>>
>> So call_VM_leaf_base() allocates the "homing" space for register
>> args for Win-X64 on L530 and it handles any stack alignment
>> before
>> calling MacroAssembler::call().
>>
>> So if you switch to call_VM_leaf_base() you can drop
>> the stack alignment code:
>>
>> 9777 // ensure stack alignment
>> 9778 mov(r10, rsp);
>> 9779 andq(rsp, -16);
>> 9780 push(r10);
>>
>> and just switch to:
>>
>> 9777 push(rsp);
>>
>> And you won't be exposed to potential saved register
>> overwrite by the "homing" of any registers in the call
>> Thread::current().
>>
>> I _think_ I have that all correct... :-)
>>
>> Dan
>>
>>
>> On 11/25/15 3:52 PM, David Holmes wrote:
>>
>> On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>>
>> I created a bug to track this:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8144059
>>
>>
>> Thanks. I admit I still don't really grok the exact problem
>> here. It
>> is the alignment code that is the problem right?
>>
>> 79 mov(r10, rsp);
>> 80 andq(rsp, -16);
>> 81 push(r10);
>>
>> ?
>>
>> Thanks,
>> David
>>
>> Because Davids change is not submitted yet, the bug
>> refers to the old
>> coding, but stays relevant also after this fix.
>>
>> Kind Regards, Thomas
>>
>> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>> <daniel.daugherty at oracle.com
>> <mailto:daniel.daugherty at oracle.com>
>> <mailto:daniel.daugherty at oracle.com
>> <mailto:daniel.daugherty at oracle.com>>>
>> wrote:
>>
>> On 11/25/15 6:10 AM, David Holmes wrote:
>>
>> On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>
>>
>> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand
>> Delsart
>> <bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>>>> wrote:
>>
>> Good finding Thomas,
>>
>> On 25/11/2015 09:40, Thomas Stüfe
>> wrote:
>>
>> Hi David, Bertrand,
>>
>> about
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>
>>
>> I am not sure this is correct for
>> Windows x64.
>>
>>
>> Unless I'm misreading it, the new code
>> looks a lot like
>> the old one
>> (including the alignement part). Hence
>> this does not
>> look like a
>> regression caused by David's change.
>>
>>
>> Yes, this seems like an old bug.
>>
>> I also see that this is fixed in our
>> codebase since 2010,
>> exactly the
>> way you propose (with MacroAssembler
>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>>
>>
>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>>
>>
>> So, it is already used since five years. I
>> would propose
>> opening a
>> separate bug for this one and we contribute
>> our fix.
>>
>>
>> That sounds good to me too. I also suspect
>> additional code will
>> need to be checked as this stack alignment is
>> performed in a few
>> places in ./cpu/x86/vm/macroAssembler_x86.cpp
>> with no OS
>> specific differences being considered.
>>
>>
>> Looks like MacroAssembler::call_VM_leaf_base() does
>> the "right
>> thing"
>> for stack alignment in the X64 case. Perhaps we
>> need to audit both
>> stack alignment _and_ uses of
>> MacroAssembler::call() just to make
>> sure...
>>
>> Did I mention that all the assembly code stuff
>> gives me a
>> headache? :-)
>>
>> Dan
>>
>>
>>
>>
>> Thanks,
>> David
>>
>> Regards, Thomas
>>
>>
>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>
>> windows x64 Abi requires the
>> register parameter
>> area right above the
>> return address on the stack. It
>> is not
>> guaranteed to be
>> preserved during
>> the function call:
>>
>> "Even if the called function has
>> fewer than 4
>> parameters, these
>> 4 stack
>> locations are effectively owned by
>> the called
>> function, and may
>> be used
>> by the called function for other
>> purposes besides
>> saving parameter
>> register values. Thus the caller
>> may not save
>> information in
>> this region
>> of stack across a function call."
>>
>> So, restoring parameters from it
>> may not work if
>> Thread::current() uses
>> this area for any reason. In this
>> case, r9, rsp,
>> r10, r11 may be
>> overwritten by the callee.
>>
>> I also think Windows x64 ABI is
>> different enough
>> from Linux that
>> this
>> function could live somewhere in
>> <os>_<cpu>.
>>
>>
>> This is a valid point. However, note
>> that this is
>> similar to the
>> issue we have for instance in 'shared'
>> MacroAssembler::call_VM_leaf_base,
>> which was not put in
>> <os>_<cpu>
>> files.
>>
>> In fact, the right fix may be to use
>> call_VM_leaf
>> instead of call :-)
>>
>> (alternatively we could use "#ifdef
>> _WIN64" or check
>> "frame::arg_reg_save_area_bytes != 0",
>> which may be
>> even cleaner and
>> more portable)
>>
>> All these solutions are fine for me.
>>
>>
>> Since we are near to feature freeze,
>> this should
>> probably be done in
>> a separate CR (unless David has enough
>> time to test
>> it).
>>
>> Regards,
>>
>> Bertrand.
>>
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>> On Mon, Nov 23, 2015 at 10:42 PM,
>> David Holmes
>> <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>>>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>
>> <mailto:david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>>>>> wrote:
>>
>> On 23/11/2015 11:46 PM,
>> Bertrand Delsart
>> wrote:
>>
>> On 23/11/2015 13:27,
>> David Holmes wrote:
>>
>> Hi Bertrand,
>>
>> On 23/11/2015 8:14
>> PM, Bertrand
>> Delsart wrote:
>>
>> Hi David,
>>
>> Overall looks
>> good.
>> Thanks for the
>> #ifndef
>> USE_LIBRARY_BASED_TLS_ONLY :-)
>>
>> One doubt, in
>> case this has not
>> been discussed
>> before.
>>
>> I'm still
>> catching up on x86_64
>> (after years of ARM
>> assembly :-))
>> but it
>> seems that there
>> are some stack
>> alignment
>> constraints on
>> runtime calls,
>> at least for some
>> x86_64 ABIs.
>>
>> Some of the x86
>> MacroAssembler::get_thread
>> implementations
>> had code to
>> align the stack
>> before calling
>> pthread_getspecific. See
>> for instance
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>
>>
>>
>>
>> Sorry I'm not that
>> familiar with the
>> MacroAssembler
>> - is it
>> this odd
>> fragment:
>>
>> - push(r10);
>> - // XXX
>> - mov(r10, rsp);
>> - andq(rsp, -16);
>> - push(r10);
>>
>> I'm not at all clear
>> what that is
>> doing - and if it
>> somehow
>> changes the
>> alignment wouldn't
>> something need to
>> be fixed up when
>> popping the
>> previous values ??
>>
>> To be honest I'm not
>> even sure what an
>> "unaligned
>> stack" means.
>>
>>
>> On some platforms, SP may
>> have to be
>> aligned on a 16 bytes
>> boundary when
>> calling another method.
>>
>>
>> Okay - that seems to be an
>> x64 ABI requirement
>> based on
>> other code
>> in MacroAssembler. The reason
>> it is missing in
>> the Solaris
>> code is
>> because I already removed it
>> in the earlier
>> changes that
>> were done
>> on Solaris:
>>
>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>
>> and nobody picked it up. :(
>>
>> That said, the absence of
>> this code has not
>> caused any
>> problems so
>> presumably we are normally
>> aligned.
>>
>> After having pushed what
>> needed to be
>> saved, the code above
>> rounds SP
>> down and saves the old
>> value. It will then
>> also push
>> r11, which
>> results
>> in the expected
>> alignment.
>>
>> The conterpart, after the
>> VM call, is the:
>> pop(r11);
>> pop(rsp);
>>
>>
>> I had assumed this extra
>> manipulation was
>> related to
>> pushing the arg
>> for the call.
>>
>> This alignment is
>> no longer
>> performed in the new
>> (shared)
>> implementation
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>
>>
>>
>>
>>
>> Now, Solaris was
>> not performing
>> the alignment and
>> Windows has a
>> separate
>> path for x86_64.
>> Did we really
>> need the
>> alignment for
>> linux x86_64 and
>> bsd_x86_64 ?
>> Might it be needed
>> for other ports ?
>>
>> IMHO, it might be
>> safer to align
>> the stack by
>> default,
>> knowing it should
>> not be expensive
>> since we call
>> get_thread
>> rarely for
>> x86_64 (result is
>> cached in r15).
>> I'll let you see
>> whether it is
>> worth
>> adding an
>> ifdef so
>> as to explicitly
>> deactivate the
>> alignment on
>> some platforms
>> (solaris_x86_64 ?)
>>
>>
>> I don't have enough
>> knowledge to even
>> begin to
>> comment on
>> this. I will
>> have to rely on our
>> x86_64
>> macroassembler experts
>> to explain
>> the old
>> code and what the
>> requirements are.
>>
>>
>> OK. This means that the
>> alignment was not
>> removed
>> purposefully.
>>
>> In that case, you must
>> either keep the per
>> platform
>> code x86_64
>> (since
>> it was not identical for
>> all platforms) or
>> use the safest
>> version, with
>> the additional
>>
>> // XXX
>> mov(r10, rsp);
>> andq(rsp, -16);
>> push(r10);
>>
>> before the push(r11) and
>> with the
>>
>> pop(rsp);
>>
>> after the pop(r11). It
>> should work on all
>> x86_64 platforms.
>>
>>
>> Right, I will add this back
>> to the code - and
>> replace the
>> meaningless // XXX with an
>> actual comment!
>>
>> Many thanks,
>> David
>> -----
>>
>>
>> FYI, there is another way
>> to do the
>> alignment, based on
>> the fact
>> that
>> we are at least aligned
>> on 8 bytes (see
>> MacroAssembler::call_VM_leaf_base). I
>> assume that this
>> second
>> version is
>> more efficient
>> (particularly thanks to
>> specilative
>> branches) but it
>> might be safer/simpler to
>> continue using
>> andq in
>> get_thread.
>>
>> Bertrand.
>>
>>
>> Thanks,
>> David
>>
>> Regards,
>>
>> Bertrand.
>>
>>
>>
>> On 23/11/2015
>> 08:03, David Holmes
>> wrote:
>>
>> After all the
>> preliminary
>> discussions here
>> are final
>> proposed changes:
>>
>> bug:
>> https://bugs.openjdk.java.net/browse/JDK-8132510
>>
>> Open webrev:
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>
>> A simple (in
>> principle) but
>> wide-ranging change
>> which should
>> appeal to
>> our Code
>> Deletion Engineer's.
>> We implement
>>
>> Thread::current() using a
>> compiler/language-based
>> thread-local
>> variable eg:
>>
>> static
>> __thread Thread
>> *_thr_current;
>>
>> inline Thread*
>> Thread::current() {
>> return
>> _thr_current;
>> }
>>
>> with an
>> appropriate setter of
>> course. By
>> doing this
>> we can
>> completely
>> remove the
>> os_cpu-specific
>> ThreadLocalStorage
>>
>> implementations, and the
>> associated
>> os::thread_local_storage* calls.
>>
>> As __thread
>> is not
>> async-signal-safe (either in
>> theory or
>> practice) we
>> need a
>> fallback library-based
>> solution as used
>> today. For
>> this we use a
>> very simple
>> ThreadLocalStorage
>> class and an
>>
>> implementation thereof for
>> POSIX (covers
>> AIX, BSD, Linux,
>> OSX and
>> Solaris) using
>> pthread_get/setspecific; and
>> one for
>> Windows using
>> its TLS
>> library.
>> While these
>> library routines
>> are not guaranteed
>>
>> async-signal-safe, they
>> seem to be in
>> practice and are
>> what we have
>> been
>> using all
>> along.
>>
>> We also allow
>> for use of
>> only the
>> library-based TLS
>> for platforms
>> where
>>
>> compiler-based thread locals
>> are not
>> supported (as
>> will be
>> needed in
>> the
>> Mobile
>> project). This is
>> enabled at build
>> time by
>> defining
>> USE_LIBRARY_BASED_TLS_ONLY.
>>
>> Thanks to
>> Andrew Haley for
>> providing the
>> Aarch64
>> code; and for
>> Thomas
>> Stuefe for
>> testing PPC and
>> AIX.
>>
>> Testing:
>> - JPRT
>> (core platforms)
>> - Jtreg
>> tests (linux &
>> solaris)
>> -
>> vm.runtime (core
>> platforms)
>>
>> Performance:
>> - still
>> TBD - this is
>> proving to be
>> extremely
>> difficult. If
>> anyone
>> has
>> any simple
>> to run
>> microbenchmarks to
>> suggest I'd
>> give them a
>> try as a
>> sanity check.
>> But I lack
>> access to hardware for
>> running serious
>> benchmarking.
>>
>> Footprint:
>> - varies by
>> platform and the
>> VM (server,
>> client,
>> minimal)
>> - worst-case:
>> ~1% increase for
>> server VM
>> and minimal VM
>> - best-case:
>> 0.4% decrease
>> for client VM
>>
>> Thanks,
>> David
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Bertrand Delsart,
>> Grenoble
>> Engineering Center
>> Oracle, 180 av. de l'Europe,
>> ZIRST de
>> Montbonnot
>> 38330 Montbonnot Saint Martin, FRANCE
>> bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>
>> <mailto:bertrand.delsart at oracle.com
>> <mailto:bertrand.delsart at oracle.com>>>
>> Phone : +33 4 76 18 81 23
>>
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> NOTICE: This email message is for the
>> sole use of the
>> intended
>> recipient(s) and may contain
>> confidential and
>> privileged
>> information. Any unauthorized
>> review, use,
>> disclosure or
>> distribution is prohibited. If you are
>> not the intended
>> recipient,
>> please contact the sender by reply
>> email and destroy
>> all copies of
>> the original message.
>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>
>>
>>
>>
>>
>>
More information about the hotspot-dev
mailing list