RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables
David Holmes
david.holmes at oracle.com
Wed Nov 25 22:52:16 UTC 2015
On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
> I created a bug to track this:
>
> https://bugs.openjdk.java.net/browse/JDK-8144059
Thanks. I admit I still don't really grok the exact problem here. It is
the alignment code that is the problem right?
79 mov(r10, rsp);
80 andq(rsp, -16);
81 push(r10);
?
Thanks,
David
> Because Davids change is not submitted yet, the bug refers to the old
> coding, but stays relevant also after this fix.
>
> Kind Regards, Thomas
>
> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
> <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>> wrote:
>
> On 11/25/15 6:10 AM, David Holmes wrote:
>
> On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>
>
> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
> <bertrand.delsart at oracle.com
> <mailto:bertrand.delsart at oracle.com>
> <mailto:bertrand.delsart at oracle.com
> <mailto:bertrand.delsart at oracle.com>>> wrote:
>
> Good finding Thomas,
>
> On 25/11/2015 09:40, Thomas Stüfe wrote:
>
> Hi David, Bertrand,
>
> about
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>
> I am not sure this is correct for Windows x64.
>
>
> Unless I'm misreading it, the new code looks a lot like
> the old one
> (including the alignement part). Hence this does not
> look like a
> regression caused by David's change.
>
>
> Yes, this seems like an old bug.
>
> I also see that this is fixed in our codebase since 2010,
> exactly the
> way you propose (with MacroAssembler
> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>
> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>
> So, it is already used since five years. I would propose
> opening a
> separate bug for this one and we contribute our fix.
>
>
> That sounds good to me too. I also suspect additional code will
> need to be checked as this stack alignment is performed in a few
> places in ./cpu/x86/vm/macroAssembler_x86.cpp with no OS
> specific differences being considered.
>
>
> Looks like MacroAssembler::call_VM_leaf_base() does the "right thing"
> for stack alignment in the X64 case. Perhaps we need to audit both
> stack alignment _and_ uses of MacroAssembler::call() just to make
> sure...
>
> Did I mention that all the assembly code stuff gives me a headache? :-)
>
> Dan
>
>
>
>
> Thanks,
> David
>
> Regards, Thomas
>
>
> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>
> windows x64 Abi requires the register parameter
> area right above the
> return address on the stack. It is not guaranteed to be
> preserved during
> the function call:
>
> "Even if the called function has fewer than 4
> parameters, these
> 4 stack
> locations are effectively owned by the called
> function, and may
> be used
> by the called function for other purposes besides
> saving parameter
> register values. Thus the caller may not save
> information in
> this region
> of stack across a function call."
>
> So, restoring parameters from it may not work if
> Thread::current() uses
> this area for any reason. In this case, r9, rsp,
> r10, r11 may be
> overwritten by the callee.
>
> I also think Windows x64 ABI is different enough
> from Linux that
> this
> function could live somewhere in <os>_<cpu>.
>
>
> This is a valid point. However, note that this is
> similar to the
> issue we have for instance in 'shared'
> MacroAssembler::call_VM_leaf_base, which was not put in
> <os>_<cpu>
> files.
>
> In fact, the right fix may be to use call_VM_leaf
> instead of call :-)
>
> (alternatively we could use "#ifdef _WIN64" or check
> "frame::arg_reg_save_area_bytes != 0", which may be
> even cleaner and
> more portable)
>
> All these solutions are fine for me.
>
>
> Since we are near to feature freeze, this should
> probably be done in
> a separate CR (unless David has enough time to test it).
>
> Regards,
>
> Bertrand.
>
>
> Kind Regards, Thomas
>
>
>
>
> On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
> <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>
> <mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>>
> <mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>
> <mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>>>> wrote:
>
> On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>
> On 23/11/2015 13:27, David Holmes wrote:
>
> Hi Bertrand,
>
> On 23/11/2015 8:14 PM, Bertrand
> Delsart wrote:
>
> Hi David,
>
> Overall looks good.
> Thanks for the #ifndef
> USE_LIBRARY_BASED_TLS_ONLY :-)
>
> One doubt, in case this has not
> been discussed
> before.
>
> I'm still catching up on x86_64
> (after years of ARM
> assembly :-)) but it
> seems that there are some stack
> alignment
> constraints on
> runtime calls,
> at least for some x86_64 ABIs.
>
> Some of the x86
> MacroAssembler::get_thread
> implementations had code to
> align the stack before calling
> pthread_getspecific. See
> for instance
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>
>
>
> Sorry I'm not that familiar with the
> MacroAssembler
> - is it
> this odd
> fragment:
>
> - push(r10);
> - // XXX
> - mov(r10, rsp);
> - andq(rsp, -16);
> - push(r10);
>
> I'm not at all clear what that is
> doing - and if it
> somehow
> changes the
> alignment wouldn't something need to
> be fixed up when
> popping the
> previous values ??
>
> To be honest I'm not even sure what an
> "unaligned
> stack" means.
>
>
> On some platforms, SP may have to be
> aligned on a 16 bytes
> boundary when
> calling another method.
>
>
> Okay - that seems to be an x64 ABI requirement
> based on
> other code
> in MacroAssembler. The reason it is missing in
> the Solaris
> code is
> because I already removed it in the earlier
> changes that
> were done
> on Solaris:
>
> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>
> and nobody picked it up. :(
>
> That said, the absence of this code has not
> caused any
> problems so
> presumably we are normally aligned.
>
> After having pushed what needed to be
> saved, the code above
> rounds SP
> down and saves the old value. It will then
> also push
> r11, which
> results
> in the expected alignment.
>
> The conterpart, after the VM call, is the:
> pop(r11);
> pop(rsp);
>
>
> I had assumed this extra manipulation was
> related to
> pushing the arg
> for the call.
>
> This alignment is no longer
> performed in the new
> (shared) implementation
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>
>
>
>
> Now, Solaris was not performing
> the alignment and
> Windows has a separate
> path for x86_64. Did we really
> need the
> alignment for
> linux x86_64 and
> bsd_x86_64 ? Might it be needed
> for other ports ?
>
> IMHO, it might be safer to align
> the stack by
> default,
> knowing it should
> not be expensive since we call
> get_thread
> rarely for
> x86_64 (result is
> cached in r15). I'll let you see
> whether it is
> worth
> adding an ifdef so
> as to explicitly deactivate the
> alignment on
> some platforms
> (solaris_x86_64 ?)
>
>
> I don't have enough knowledge to even
> begin to
> comment on
> this. I will
> have to rely on our x86_64
> macroassembler experts
> to explain
> the old
> code and what the requirements are.
>
>
> OK. This means that the alignment was not
> removed
> purposefully.
>
> In that case, you must either keep the per
> platform
> code x86_64
> (since
> it was not identical for all platforms) or
> use the safest
> version, with
> the additional
>
> // XXX
> mov(r10, rsp);
> andq(rsp, -16);
> push(r10);
>
> before the push(r11) and with the
>
> pop(rsp);
>
> after the pop(r11). It should work on all
> x86_64 platforms.
>
>
> Right, I will add this back to the code - and
> replace the
> meaningless // XXX with an actual comment!
>
> Many thanks,
> David
> -----
>
>
> FYI, there is another way to do the
> alignment, based on
> the fact
> that
> we are at least aligned on 8 bytes (see
> MacroAssembler::call_VM_leaf_base). I
> assume that this
> second
> version is
> more efficient (particularly thanks to
> specilative
> branches) but it
> might be safer/simpler to continue using
> andq in
> get_thread.
>
> Bertrand.
>
>
> Thanks,
> David
>
> Regards,
>
> Bertrand.
>
>
>
> On 23/11/2015 08:03, David Holmes
> wrote:
>
> After all the preliminary
> discussions here
> are final
> proposed changes:
>
> bug:
> https://bugs.openjdk.java.net/browse/JDK-8132510
>
> Open webrev:
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>
> A simple (in principle) but
> wide-ranging change
> which should appeal to
> our Code Deletion Engineer's.
> We implement
> Thread::current() using a
> compiler/language-based
> thread-local
> variable eg:
>
> static __thread Thread
> *_thr_current;
>
> inline Thread*
> Thread::current() {
> return _thr_current;
> }
>
> with an appropriate setter of
> course. By
> doing this
> we can completely
> remove the os_cpu-specific
> ThreadLocalStorage
> implementations, and the
> associated
> os::thread_local_storage* calls.
>
> As __thread is not
> async-signal-safe (either in
> theory or practice) we
> need a fallback library-based
> solution as used
> today. For this we use a
> very simple ThreadLocalStorage
> class and an
> implementation thereof for
> POSIX (covers AIX, BSD, Linux,
> OSX and
> Solaris) using
> pthread_get/setspecific; and
> one for
> Windows using
> its TLS library.
> While these library routines
> are not guaranteed
> async-signal-safe, they
> seem to be in practice and are
> what we have
> been
> using all along.
>
> We also allow for use of only the
> library-based TLS
> for platforms where
> compiler-based thread locals
> are not
> supported (as
> will be needed in
> the
> Mobile project). This is
> enabled at build
> time by
> defining
> USE_LIBRARY_BASED_TLS_ONLY.
>
> Thanks to Andrew Haley for
> providing the
> Aarch64
> code; and for Thomas
> Stuefe for testing PPC and AIX.
>
> Testing:
> - JPRT (core platforms)
> - Jtreg tests (linux & solaris)
> - vm.runtime (core platforms)
>
> Performance:
> - still TBD - this is
> proving to be
> extremely
> difficult. If anyone
> has
> any simple to run
> microbenchmarks to
> suggest I'd
> give them a try as a
> sanity check. But I lack
> access to hardware for
> running serious
> benchmarking.
>
> Footprint:
> - varies by platform and the
> VM (server,
> client,
> minimal)
> - worst-case: ~1% increase for
> server VM
> and minimal VM
> - best-case: 0.4% decrease
> for client VM
>
> Thanks,
> David
>
>
>
>
>
>
>
>
> --
> Bertrand Delsart, Grenoble
> Engineering Center
> Oracle, 180 av. de l'Europe, ZIRST de
> Montbonnot
> 38330 Montbonnot Saint Martin, FRANCE
> bertrand.delsart at oracle.com
> <mailto:bertrand.delsart at oracle.com>
> <mailto:bertrand.delsart at oracle.com
> <mailto:bertrand.delsart at oracle.com>>
> Phone : +33 4 76 18 81 23
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NOTICE: This email message is for the sole use of the
> intended
> recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended
> recipient,
> please contact the sender by reply email and destroy
> all copies of
> the original message.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
>
More information about the hotspot-dev
mailing list