RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables
David Holmes
david.holmes at oracle.com
Wed Nov 25 13:10:33 UTC 2015
On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>
> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
> <bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>> wrote:
>
> Good finding Thomas,
>
> On 25/11/2015 09:40, Thomas Stüfe wrote:
>
> Hi David, Bertrand,
>
> about
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>
> I am not sure this is correct for Windows x64.
>
>
> Unless I'm misreading it, the new code looks a lot like the old one
> (including the alignement part). Hence this does not look like a
> regression caused by David's change.
>
>
> Yes, this seems like an old bug.
>
> I also see that this is fixed in our codebase since 2010, exactly the
> way you propose (with MacroAssembler
> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
> So, it is already used since five years. I would propose opening a
> separate bug for this one and we contribute our fix.
That sounds good to me too. I also suspect additional code will need to
be checked as this stack alignment is performed in a few places in
./cpu/x86/vm/macroAssembler_x86.cpp with no OS specific differences
being considered.
Thanks,
David
> Regards, Thomas
>
>
> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>
> windows x64 Abi requires the register parameter area right above the
> return address on the stack. It is not guaranteed to be
> preserved during
> the function call:
>
> "Even if the called function has fewer than 4 parameters, these
> 4 stack
> locations are effectively owned by the called function, and may
> be used
> by the called function for other purposes besides saving parameter
> register values. Thus the caller may not save information in
> this region
> of stack across a function call."
>
> So, restoring parameters from it may not work if
> Thread::current() uses
> this area for any reason. In this case, r9, rsp, r10, r11 may be
> overwritten by the callee.
>
> I also think Windows x64 ABI is different enough from Linux that
> this
> function could live somewhere in <os>_<cpu>.
>
>
> This is a valid point. However, note that this is similar to the
> issue we have for instance in 'shared'
> MacroAssembler::call_VM_leaf_base, which was not put in <os>_<cpu>
> files.
>
> In fact, the right fix may be to use call_VM_leaf instead of call :-)
>
> (alternatively we could use "#ifdef _WIN64" or check
> "frame::arg_reg_save_area_bytes != 0", which may be even cleaner and
> more portable)
>
> All these solutions are fine for me.
>
>
> Since we are near to feature freeze, this should probably be done in
> a separate CR (unless David has enough time to test it).
>
> Regards,
>
> Bertrand.
>
>
> Kind Regards, Thomas
>
>
>
>
> On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> <mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>>> wrote:
>
> On 23/11/2015 11:46 PM, Bertrand Delsart wrote:
>
> On 23/11/2015 13:27, David Holmes wrote:
>
> Hi Bertrand,
>
> On 23/11/2015 8:14 PM, Bertrand Delsart wrote:
>
> Hi David,
>
> Overall looks good.
> Thanks for the #ifndef
> USE_LIBRARY_BASED_TLS_ONLY :-)
>
> One doubt, in case this has not been discussed
> before.
>
> I'm still catching up on x86_64 (after years of ARM
> assembly :-)) but it
> seems that there are some stack alignment
> constraints on
> runtime calls,
> at least for some x86_64 ABIs.
>
> Some of the x86 MacroAssembler::get_thread
> implementations had code to
> align the stack before calling
> pthread_getspecific. See
> for instance
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>
>
>
> Sorry I'm not that familiar with the MacroAssembler
> - is it
> this odd
> fragment:
>
> - push(r10);
> - // XXX
> - mov(r10, rsp);
> - andq(rsp, -16);
> - push(r10);
>
> I'm not at all clear what that is doing - and if it
> somehow
> changes the
> alignment wouldn't something need to be fixed up when
> popping the
> previous values ??
>
> To be honest I'm not even sure what an "unaligned
> stack" means.
>
>
> On some platforms, SP may have to be aligned on a 16 bytes
> boundary when
> calling another method.
>
>
> Okay - that seems to be an x64 ABI requirement based on
> other code
> in MacroAssembler. The reason it is missing in the Solaris
> code is
> because I already removed it in the earlier changes that
> were done
> on Solaris:
>
> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>
> and nobody picked it up. :(
>
> That said, the absence of this code has not caused any
> problems so
> presumably we are normally aligned.
>
> After having pushed what needed to be saved, the code above
> rounds SP
> down and saves the old value. It will then also push
> r11, which
> results
> in the expected alignment.
>
> The conterpart, after the VM call, is the:
> pop(r11);
> pop(rsp);
>
>
> I had assumed this extra manipulation was related to
> pushing the arg
> for the call.
>
> This alignment is no longer performed in the new
> (shared) implementation
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>
>
>
>
> Now, Solaris was not performing the alignment and
> Windows has a separate
> path for x86_64. Did we really need the
> alignment for
> linux x86_64 and
> bsd_x86_64 ? Might it be needed for other ports ?
>
> IMHO, it might be safer to align the stack by
> default,
> knowing it should
> not be expensive since we call get_thread
> rarely for
> x86_64 (result is
> cached in r15). I'll let you see whether it is
> worth
> adding an ifdef so
> as to explicitly deactivate the alignment on
> some platforms
> (solaris_x86_64 ?)
>
>
> I don't have enough knowledge to even begin to
> comment on
> this. I will
> have to rely on our x86_64 macroassembler experts
> to explain
> the old
> code and what the requirements are.
>
>
> OK. This means that the alignment was not removed
> purposefully.
>
> In that case, you must either keep the per platform
> code x86_64
> (since
> it was not identical for all platforms) or use the safest
> version, with
> the additional
>
> // XXX
> mov(r10, rsp);
> andq(rsp, -16);
> push(r10);
>
> before the push(r11) and with the
>
> pop(rsp);
>
> after the pop(r11). It should work on all x86_64 platforms.
>
>
> Right, I will add this back to the code - and replace the
> meaningless // XXX with an actual comment!
>
> Many thanks,
> David
> -----
>
>
> FYI, there is another way to do the alignment, based on
> the fact
> that
> we are at least aligned on 8 bytes (see
> MacroAssembler::call_VM_leaf_base). I assume that this
> second
> version is
> more efficient (particularly thanks to specilative
> branches) but it
> might be safer/simpler to continue using andq in
> get_thread.
>
> Bertrand.
>
>
> Thanks,
> David
>
> Regards,
>
> Bertrand.
>
>
>
> On 23/11/2015 08:03, David Holmes wrote:
>
> After all the preliminary discussions here
> are final
> proposed changes:
>
> bug:
> https://bugs.openjdk.java.net/browse/JDK-8132510
>
> Open webrev:
> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>
> A simple (in principle) but wide-ranging change
> which should appeal to
> our Code Deletion Engineer's. We implement
> Thread::current() using a
> compiler/language-based thread-local
> variable eg:
>
> static __thread Thread *_thr_current;
>
> inline Thread* Thread::current() {
> return _thr_current;
> }
>
> with an appropriate setter of course. By
> doing this
> we can completely
> remove the os_cpu-specific ThreadLocalStorage
> implementations, and the
> associated os::thread_local_storage* calls.
>
> As __thread is not async-signal-safe (either in
> theory or practice) we
> need a fallback library-based solution as used
> today. For this we use a
> very simple ThreadLocalStorage class and an
> implementation thereof for
> POSIX (covers AIX, BSD, Linux, OSX and
> Solaris) using
> pthread_get/setspecific; and one for
> Windows using
> its TLS library.
> While these library routines are not guaranteed
> async-signal-safe, they
> seem to be in practice and are what we have
> been
> using all along.
>
> We also allow for use of only the
> library-based TLS
> for platforms where
> compiler-based thread locals are not
> supported (as
> will be needed in
> the
> Mobile project). This is enabled at build
> time by
> defining
> USE_LIBRARY_BASED_TLS_ONLY.
>
> Thanks to Andrew Haley for providing the
> Aarch64
> code; and for Thomas
> Stuefe for testing PPC and AIX.
>
> Testing:
> - JPRT (core platforms)
> - Jtreg tests (linux & solaris)
> - vm.runtime (core platforms)
>
> Performance:
> - still TBD - this is proving to be
> extremely
> difficult. If anyone
> has
> any simple to run microbenchmarks to
> suggest I'd
> give them a try as a
> sanity check. But I lack access to hardware for
> running serious
> benchmarking.
>
> Footprint:
> - varies by platform and the VM (server,
> client,
> minimal)
> - worst-case: ~1% increase for server VM
> and minimal VM
> - best-case: 0.4% decrease for client VM
>
> Thanks,
> David
>
>
>
>
>
>
>
>
> --
> Bertrand Delsart, Grenoble Engineering Center
> Oracle, 180 av. de l'Europe, ZIRST de Montbonnot
> 38330 Montbonnot Saint Martin, FRANCE
> bertrand.delsart at oracle.com <mailto:bertrand.delsart at oracle.com>
> Phone : +33 4 76 18 81 23
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended recipient,
> please contact the sender by reply email and destroy all copies of
> the original message.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
More information about the hotspot-dev
mailing list