RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables
David Holmes
david.holmes at oracle.com
Wed Nov 25 23:51:26 UTC 2015
On 26/11/2015 9:08 AM, Daniel D. Daugherty wrote:
> If I understand the problem... :-)
>
> The Win-X64 ABI requires that some extra space be allocated on
> the stack in the caller for "homing" four registers...
>
> This code uses MacroAssembler::call() which is a lower level
> call() function that assumes you know the ABI...
>
> 9765 void MacroAssembler::get_thread(Register thread) {
> <snip>
> 9784 call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));
So the failing here was nothing to do with register saving per-se, nor
the alignment code, but "simply" not also saving the extra space prior
to the call!
> If you switch to MacroAssembler::call_VM_leaf_base() it handles
> more of the gory details for you than MacroAssembler::call():
Wow! Not exactly a clearly-defined name for this. AFAICS there's nothing
VM or leaf specific about this. Also the MacroAssembler is supposed to
be a platform-agnostic higher-level abstraction that is easy to use -
but in this case every use of the plain call on Windows-x64 is
potentially broken if the programmer was not aware of this ABI
constraint which only applies to Windows (and we're in OS-neutral files!).
This seems fundamentally broken to me.
Thanks for the info.
David
> src/cpu/x86/vm/macroAssembler_x86.cpp:
>
> 524 void MacroAssembler::call_VM_leaf_base(address entry_point, int
> num_args
> ) {
> 525 Label L, E;
> 526
> 527 #ifdef _WIN64
> 528 // Windows always allocates space for it's register args
> 529 assert(num_args <= 4, "only register arguments supported");
> 530 subq(rsp, frame::arg_reg_save_area_bytes);
> 531 #endif
> 532
> 533 // Align stack if necessary
> 534 testl(rsp, 15);
> 535 jcc(Assembler::zero, L);
> 536
> 537 subq(rsp, 8);
> 538 {
> 539 call(RuntimeAddress(entry_point));
> 540 }
>
> So call_VM_leaf_base() allocates the "homing" space for register
> args for Win-X64 on L530 and it handles any stack alignment before
> calling MacroAssembler::call().
>
> So if you switch to call_VM_leaf_base() you can drop
> the stack alignment code:
>
> 9777 // ensure stack alignment
> 9778 mov(r10, rsp);
> 9779 andq(rsp, -16);
> 9780 push(r10);
>
> and just switch to:
>
> 9777 push(rsp);
>
> And you won't be exposed to potential saved register
> overwrite by the "homing" of any registers in the call
> Thread::current().
>
> I _think_ I have that all correct... :-)
>
> Dan
>
>
> On 11/25/15 3:52 PM, David Holmes wrote:
>> On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>>> I created a bug to track this:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8144059
>>
>> Thanks. I admit I still don't really grok the exact problem here. It
>> is the alignment code that is the problem right?
>>
>> 79 mov(r10, rsp);
>> 80 andq(rsp, -16);
>> 81 push(r10);
>>
>> ?
>>
>> Thanks,
>> David
>>
>>> Because Davids change is not submitted yet, the bug refers to the old
>>> coding, but stays relevant also after this fix.
>>>
>>> Kind Regards, Thomas
>>>
>>> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>>> <daniel.daugherty at oracle.com <mailto:daniel.daugherty at oracle.com>>
>>> wrote:
>>>
>>> On 11/25/15 6:10 AM, David Holmes wrote:
>>>
>>> On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>>
>>>
>>> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand Delsart
>>> <bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>>> wrote:
>>>
>>> Good finding Thomas,
>>>
>>> On 25/11/2015 09:40, Thomas Stüfe wrote:
>>>
>>> Hi David, Bertrand,
>>>
>>> about
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>>
>>>
>>> I am not sure this is correct for Windows x64.
>>>
>>>
>>> Unless I'm misreading it, the new code looks a lot like
>>> the old one
>>> (including the alignement part). Hence this does not
>>> look like a
>>> regression caused by David's change.
>>>
>>>
>>> Yes, this seems like an old bug.
>>>
>>> I also see that this is fixed in our codebase since 2010,
>>> exactly the
>>> way you propose (with MacroAssembler
>>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>>>
>>>
>>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>>>
>>>
>>> So, it is already used since five years. I would propose
>>> opening a
>>> separate bug for this one and we contribute our fix.
>>>
>>>
>>> That sounds good to me too. I also suspect additional code will
>>> need to be checked as this stack alignment is performed in a few
>>> places in ./cpu/x86/vm/macroAssembler_x86.cpp with no OS
>>> specific differences being considered.
>>>
>>>
>>> Looks like MacroAssembler::call_VM_leaf_base() does the "right
>>> thing"
>>> for stack alignment in the X64 case. Perhaps we need to audit both
>>> stack alignment _and_ uses of MacroAssembler::call() just to make
>>> sure...
>>>
>>> Did I mention that all the assembly code stuff gives me a
>>> headache? :-)
>>>
>>> Dan
>>>
>>>
>>>
>>>
>>> Thanks,
>>> David
>>>
>>> Regards, Thomas
>>>
>>>
>>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>>
>>> windows x64 Abi requires the register parameter
>>> area right above the
>>> return address on the stack. It is not
>>> guaranteed to be
>>> preserved during
>>> the function call:
>>>
>>> "Even if the called function has fewer than 4
>>> parameters, these
>>> 4 stack
>>> locations are effectively owned by the called
>>> function, and may
>>> be used
>>> by the called function for other purposes besides
>>> saving parameter
>>> register values. Thus the caller may not save
>>> information in
>>> this region
>>> of stack across a function call."
>>>
>>> So, restoring parameters from it may not work if
>>> Thread::current() uses
>>> this area for any reason. In this case, r9, rsp,
>>> r10, r11 may be
>>> overwritten by the callee.
>>>
>>> I also think Windows x64 ABI is different enough
>>> from Linux that
>>> this
>>> function could live somewhere in <os>_<cpu>.
>>>
>>>
>>> This is a valid point. However, note that this is
>>> similar to the
>>> issue we have for instance in 'shared'
>>> MacroAssembler::call_VM_leaf_base, which was not put in
>>> <os>_<cpu>
>>> files.
>>>
>>> In fact, the right fix may be to use call_VM_leaf
>>> instead of call :-)
>>>
>>> (alternatively we could use "#ifdef _WIN64" or check
>>> "frame::arg_reg_save_area_bytes != 0", which may be
>>> even cleaner and
>>> more portable)
>>>
>>> All these solutions are fine for me.
>>>
>>>
>>> Since we are near to feature freeze, this should
>>> probably be done in
>>> a separate CR (unless David has enough time to test
>>> it).
>>>
>>> Regards,
>>>
>>> Bertrand.
>>>
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>>
>>>
>>> On Mon, Nov 23, 2015 at 10:42 PM, David Holmes
>>> <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>>>> wrote:
>>>
>>> On 23/11/2015 11:46 PM, Bertrand Delsart
>>> wrote:
>>>
>>> On 23/11/2015 13:27, David Holmes wrote:
>>>
>>> Hi Bertrand,
>>>
>>> On 23/11/2015 8:14 PM, Bertrand
>>> Delsart wrote:
>>>
>>> Hi David,
>>>
>>> Overall looks good.
>>> Thanks for the #ifndef
>>> USE_LIBRARY_BASED_TLS_ONLY :-)
>>>
>>> One doubt, in case this has not
>>> been discussed
>>> before.
>>>
>>> I'm still catching up on x86_64
>>> (after years of ARM
>>> assembly :-)) but it
>>> seems that there are some stack
>>> alignment
>>> constraints on
>>> runtime calls,
>>> at least for some x86_64 ABIs.
>>>
>>> Some of the x86
>>> MacroAssembler::get_thread
>>> implementations had code to
>>> align the stack before calling
>>> pthread_getspecific. See
>>> for instance
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>> Sorry I'm not that familiar with the
>>> MacroAssembler
>>> - is it
>>> this odd
>>> fragment:
>>>
>>> - push(r10);
>>> - // XXX
>>> - mov(r10, rsp);
>>> - andq(rsp, -16);
>>> - push(r10);
>>>
>>> I'm not at all clear what that is
>>> doing - and if it
>>> somehow
>>> changes the
>>> alignment wouldn't something need to
>>> be fixed up when
>>> popping the
>>> previous values ??
>>>
>>> To be honest I'm not even sure what an
>>> "unaligned
>>> stack" means.
>>>
>>>
>>> On some platforms, SP may have to be
>>> aligned on a 16 bytes
>>> boundary when
>>> calling another method.
>>>
>>>
>>> Okay - that seems to be an x64 ABI requirement
>>> based on
>>> other code
>>> in MacroAssembler. The reason it is missing in
>>> the Solaris
>>> code is
>>> because I already removed it in the earlier
>>> changes that
>>> were done
>>> on Solaris:
>>>
>>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>>
>>> and nobody picked it up. :(
>>>
>>> That said, the absence of this code has not
>>> caused any
>>> problems so
>>> presumably we are normally aligned.
>>>
>>> After having pushed what needed to be
>>> saved, the code above
>>> rounds SP
>>> down and saves the old value. It will then
>>> also push
>>> r11, which
>>> results
>>> in the expected alignment.
>>>
>>> The conterpart, after the VM call, is the:
>>> pop(r11);
>>> pop(rsp);
>>>
>>>
>>> I had assumed this extra manipulation was
>>> related to
>>> pushing the arg
>>> for the call.
>>>
>>> This alignment is no longer
>>> performed in the new
>>> (shared) implementation
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>>
>>> Now, Solaris was not performing
>>> the alignment and
>>> Windows has a separate
>>> path for x86_64. Did we really
>>> need the
>>> alignment for
>>> linux x86_64 and
>>> bsd_x86_64 ? Might it be needed
>>> for other ports ?
>>>
>>> IMHO, it might be safer to align
>>> the stack by
>>> default,
>>> knowing it should
>>> not be expensive since we call
>>> get_thread
>>> rarely for
>>> x86_64 (result is
>>> cached in r15). I'll let you see
>>> whether it is
>>> worth
>>> adding an ifdef so
>>> as to explicitly deactivate the
>>> alignment on
>>> some platforms
>>> (solaris_x86_64 ?)
>>>
>>>
>>> I don't have enough knowledge to even
>>> begin to
>>> comment on
>>> this. I will
>>> have to rely on our x86_64
>>> macroassembler experts
>>> to explain
>>> the old
>>> code and what the requirements are.
>>>
>>>
>>> OK. This means that the alignment was not
>>> removed
>>> purposefully.
>>>
>>> In that case, you must either keep the per
>>> platform
>>> code x86_64
>>> (since
>>> it was not identical for all platforms) or
>>> use the safest
>>> version, with
>>> the additional
>>>
>>> // XXX
>>> mov(r10, rsp);
>>> andq(rsp, -16);
>>> push(r10);
>>>
>>> before the push(r11) and with the
>>>
>>> pop(rsp);
>>>
>>> after the pop(r11). It should work on all
>>> x86_64 platforms.
>>>
>>>
>>> Right, I will add this back to the code - and
>>> replace the
>>> meaningless // XXX with an actual comment!
>>>
>>> Many thanks,
>>> David
>>> -----
>>>
>>>
>>> FYI, there is another way to do the
>>> alignment, based on
>>> the fact
>>> that
>>> we are at least aligned on 8 bytes (see
>>> MacroAssembler::call_VM_leaf_base). I
>>> assume that this
>>> second
>>> version is
>>> more efficient (particularly thanks to
>>> specilative
>>> branches) but it
>>> might be safer/simpler to continue using
>>> andq in
>>> get_thread.
>>>
>>> Bertrand.
>>>
>>>
>>> Thanks,
>>> David
>>>
>>> Regards,
>>>
>>> Bertrand.
>>>
>>>
>>>
>>> On 23/11/2015 08:03, David Holmes
>>> wrote:
>>>
>>> After all the preliminary
>>> discussions here
>>> are final
>>> proposed changes:
>>>
>>> bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8132510
>>>
>>> Open webrev:
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>
>>> A simple (in principle) but
>>> wide-ranging change
>>> which should appeal to
>>> our Code Deletion Engineer's.
>>> We implement
>>> Thread::current() using a
>>> compiler/language-based
>>> thread-local
>>> variable eg:
>>>
>>> static __thread Thread
>>> *_thr_current;
>>>
>>> inline Thread*
>>> Thread::current() {
>>> return _thr_current;
>>> }
>>>
>>> with an appropriate setter of
>>> course. By
>>> doing this
>>> we can completely
>>> remove the os_cpu-specific
>>> ThreadLocalStorage
>>> implementations, and the
>>> associated
>>> os::thread_local_storage* calls.
>>>
>>> As __thread is not
>>> async-signal-safe (either in
>>> theory or practice) we
>>> need a fallback library-based
>>> solution as used
>>> today. For this we use a
>>> very simple ThreadLocalStorage
>>> class and an
>>> implementation thereof for
>>> POSIX (covers AIX, BSD, Linux,
>>> OSX and
>>> Solaris) using
>>> pthread_get/setspecific; and
>>> one for
>>> Windows using
>>> its TLS library.
>>> While these library routines
>>> are not guaranteed
>>> async-signal-safe, they
>>> seem to be in practice and are
>>> what we have
>>> been
>>> using all along.
>>>
>>> We also allow for use of
>>> only the
>>> library-based TLS
>>> for platforms where
>>> compiler-based thread locals
>>> are not
>>> supported (as
>>> will be needed in
>>> the
>>> Mobile project). This is
>>> enabled at build
>>> time by
>>> defining
>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>
>>> Thanks to Andrew Haley for
>>> providing the
>>> Aarch64
>>> code; and for Thomas
>>> Stuefe for testing PPC and
>>> AIX.
>>>
>>> Testing:
>>> - JPRT (core platforms)
>>> - Jtreg tests (linux &
>>> solaris)
>>> - vm.runtime (core
>>> platforms)
>>>
>>> Performance:
>>> - still TBD - this is
>>> proving to be
>>> extremely
>>> difficult. If anyone
>>> has
>>> any simple to run
>>> microbenchmarks to
>>> suggest I'd
>>> give them a try as a
>>> sanity check. But I lack
>>> access to hardware for
>>> running serious
>>> benchmarking.
>>>
>>> Footprint:
>>> - varies by platform and the
>>> VM (server,
>>> client,
>>> minimal)
>>> - worst-case: ~1% increase for
>>> server VM
>>> and minimal VM
>>> - best-case: 0.4% decrease
>>> for client VM
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Delsart, Grenoble
>>> Engineering Center
>>> Oracle, 180 av. de l'Europe, ZIRST de
>>> Montbonnot
>>> 38330 Montbonnot Saint Martin, FRANCE
>>> bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>>
>>> Phone : +33 4 76 18 81 23
>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> NOTICE: This email message is for the sole use of the
>>> intended
>>> recipient(s) and may contain confidential and
>>> privileged
>>> information. Any unauthorized review, use,
>>> disclosure or
>>> distribution is prohibited. If you are not the intended
>>> recipient,
>>> please contact the sender by reply email and destroy
>>> all copies of
>>> the original message.
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>>
>>>
>
More information about the hotspot-dev
mailing list