RFR: 8132510: Replace ThreadLocalStorage with compiler/language-based thread-local variables
David Holmes
david.holmes at oracle.com
Tue Dec 1 21:55:09 UTC 2015
Thanks Dan!
David
On 2/12/2015 12:28 AM, Daniel D. Daugherty wrote:
> > http://cr.openjdk.java.net/~dholmes/8132510/webrev.v10
>
> Rereviewed by comparing this webrev's patch with the patch from
> webrev.v8.
>
> Thumbs up.
>
> Dan
>
>
> On 11/27/15 12:26 AM, David Holmes wrote:
>> Hi Thomas,
>>
>> On 26/11/2015 7:56 PM, Thomas Stüfe wrote:
>>>
>>> On Thu, Nov 26, 2015 at 12:51 AM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>> On 26/11/2015 9:08 AM, Daniel D. Daugherty wrote:
>>>
>>> If I understand the problem... :-)
>>>
>>> The Win-X64 ABI requires that some extra space be allocated on
>>> the stack in the caller for "homing" four registers...
>>>
>>> This code uses MacroAssembler::call() which is a lower level
>>> call() function that assumes you know the ABI...
>>>
>>> 9765 void MacroAssembler::get_thread(Register thread) {
>>> <snip>
>>> 9784 call(RuntimeAddress(CAST_FROM_FN_PTR(address,
>>> Thread::current)));
>>>
>>> So the failing here was nothing to do with register saving per-se,
>>> nor the alignment code, but "simply" not also saving the extra space
>>> prior to the call!
>>>
>>> Yes. These 32 byte are a scratch area for the callee, which may be used
>>> for parameter registers or anything else. We may just have been lucky so
>>> far that TlsGetValue() did not use this area. As Bertrand already said,
>>> now we call Thread::current(), which does magic tls accesses under the
>>> hood, so this must be tested again.
>>
>> Right. All testing so far has been successful but ...
>>
>>> However I think that a simple addition to your current fix could simply
>>> be subtracting 32 from rsp before the call:
>>>
>>> subq(rsp, 32)
>>> call(RuntimeAddress(CAST_FROM_FN_PTR(address, Thread::current)));
>>> addq(rsp, 32)
>>
>> ... I decided to do what was previously suggested and use:
>>
>> call_VM_leaf_base(CAST_FROM_FN_PTR(address, Thread::current), 0);
>>
>> as it takes care of everything (stack alignment if needed and
>> preserving space for the register args on win64).
>>
>>> Also, we could probably skip saving/restoring rdi and rsi on Windows:
>>
>> Maybe - but we don't seem to consider that anywhere else ie:
>>
>> void MacroAssembler::push_callee_saved_registers() {
>> push(rsi);
>> push(rdi);
>> push(rdx);
>> push(rcx);
>> }
>>
>> so I don't want to be the trail-blazer here. :)
>>
>> Updated webrev:
>>
>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v10
>>
>> Only change to macroAssembler_x86.cpp. Re-tested on Windows 32-bit and
>> 64-bit.
>>
>> Thanks,
>> David
>>
>>> https://msdn.microsoft.com/en-us/library/6t169e9c.aspx
>>>
>>> "The registers RAX, RCX, RDX, R8, R9, R10, R11 are considered volatile
>>> and must be considered destroyed on function calls (unless otherwise
>>> safety-provable by analysis such as whole program optimization). The
>>> registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, and R15 are considered
>>> nonvolatile and must be saved and restored by a function that uses
>>> them."
>>
>>
>>
>>> Best Regards, Thomas
>>>
>>>
>>> If you switch to MacroAssembler::call_VM_leaf_base() it handles
>>> more of the gory details for you than MacroAssembler::call():
>>>
>>>
>>> Wow! Not exactly a clearly-defined name for this. AFAICS there's
>>> nothing VM or leaf specific about this. Also the MacroAssembler is
>>> supposed to be a platform-agnostic higher-level abstraction that is
>>> easy to use - but in this case every use of the plain call on
>>> Windows-x64 is potentially broken if the programmer was not aware of
>>> this ABI constraint which only applies to Windows (and we're in
>>> OS-neutral files!).
>>>
>>> This seems fundamentally broken to me.
>>>
>>> Thanks for the info.
>>>
>>> David
>>>
>>>
>>> src/cpu/x86/vm/macroAssembler_x86.cpp:
>>>
>>> 524 void MacroAssembler::call_VM_leaf_base(address
>>> entry_point, int
>>> num_args
>>> ) {
>>> 525 Label L, E;
>>> 526
>>> 527 #ifdef _WIN64
>>> 528 // Windows always allocates space for it's
>>> register args
>>> 529 assert(num_args <= 4, "only register arguments
>>> supported");
>>> 530 subq(rsp, frame::arg_reg_save_area_bytes);
>>> 531 #endif
>>> 532
>>> 533 // Align stack if necessary
>>> 534 testl(rsp, 15);
>>> 535 jcc(Assembler::zero, L);
>>> 536
>>> 537 subq(rsp, 8);
>>> 538 {
>>> 539 call(RuntimeAddress(entry_point));
>>> 540 }
>>>
>>> So call_VM_leaf_base() allocates the "homing" space for register
>>> args for Win-X64 on L530 and it handles any stack alignment
>>> before
>>> calling MacroAssembler::call().
>>>
>>> So if you switch to call_VM_leaf_base() you can drop
>>> the stack alignment code:
>>>
>>> 9777 // ensure stack alignment
>>> 9778 mov(r10, rsp);
>>> 9779 andq(rsp, -16);
>>> 9780 push(r10);
>>>
>>> and just switch to:
>>>
>>> 9777 push(rsp);
>>>
>>> And you won't be exposed to potential saved register
>>> overwrite by the "homing" of any registers in the call
>>> Thread::current().
>>>
>>> I _think_ I have that all correct... :-)
>>>
>>> Dan
>>>
>>>
>>> On 11/25/15 3:52 PM, David Holmes wrote:
>>>
>>> On 26/11/2015 2:53 AM, Thomas Stüfe wrote:
>>>
>>> I created a bug to track this:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8144059
>>>
>>>
>>> Thanks. I admit I still don't really grok the exact problem
>>> here. It
>>> is the alignment code that is the problem right?
>>>
>>> 79 mov(r10, rsp);
>>> 80 andq(rsp, -16);
>>> 81 push(r10);
>>>
>>> ?
>>>
>>> Thanks,
>>> David
>>>
>>> Because Davids change is not submitted yet, the bug
>>> refers to the old
>>> coding, but stays relevant also after this fix.
>>>
>>> Kind Regards, Thomas
>>>
>>> On Wed, Nov 25, 2015 at 3:52 PM, Daniel D. Daugherty
>>> <daniel.daugherty at oracle.com
>>> <mailto:daniel.daugherty at oracle.com>
>>> <mailto:daniel.daugherty at oracle.com
>>> <mailto:daniel.daugherty at oracle.com>>>
>>> wrote:
>>>
>>> On 11/25/15 6:10 AM, David Holmes wrote:
>>>
>>> On 25/11/2015 8:46 PM, Thomas Stüfe wrote:
>>>
>>>
>>> On Wed, Nov 25, 2015 at 10:22 AM, Bertrand
>>> Delsart
>>> <bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>>>> wrote:
>>>
>>> Good finding Thomas,
>>>
>>> On 25/11/2015 09:40, Thomas Stüfe
>>> wrote:
>>>
>>> Hi David, Bertrand,
>>>
>>> about
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v8/src/cpu/x86/vm/macroAssembler_x86.cpp.frames.html
>>>
>>>
>>>
>>> I am not sure this is correct for
>>> Windows x64.
>>>
>>>
>>> Unless I'm misreading it, the new code
>>> looks a lot like
>>> the old one
>>> (including the alignement part). Hence
>>> this does not
>>> look like a
>>> regression caused by David's change.
>>>
>>>
>>> Yes, this seems like an old bug.
>>>
>>> I also see that this is fixed in our
>>> codebase since 2010,
>>> exactly the
>>> way you propose (with MacroAssembler
>>> <http://ld8443:8080/source/s?defs=MacroAssembler&project=dev-hotspot>::call_VM_leaf_base
>>>
>>>
>>>
>>> <http://ld8443:8080/source/s?defs=call_VM_leaf_base&project=dev-hotspot>).
>>>
>>>
>>>
>>> So, it is already used since five years. I
>>> would propose
>>> opening a
>>> separate bug for this one and we contribute
>>> our fix.
>>>
>>>
>>> That sounds good to me too. I also suspect
>>> additional code will
>>> need to be checked as this stack alignment is
>>> performed in a few
>>> places in ./cpu/x86/vm/macroAssembler_x86.cpp
>>> with no OS
>>> specific differences being considered.
>>>
>>>
>>> Looks like MacroAssembler::call_VM_leaf_base() does
>>> the "right
>>> thing"
>>> for stack alignment in the X64 case. Perhaps we
>>> need to audit both
>>> stack alignment _and_ uses of
>>> MacroAssembler::call() just to make
>>> sure...
>>>
>>> Did I mention that all the assembly code stuff
>>> gives me a
>>> headache? :-)
>>>
>>> Dan
>>>
>>>
>>>
>>>
>>> Thanks,
>>> David
>>>
>>> Regards, Thomas
>>>
>>>
>>> https://msdn.microsoft.com/en-us/library/ew5tede7(v=vs.90).aspx
>>>
>>> windows x64 Abi requires the
>>> register parameter
>>> area right above the
>>> return address on the stack. It
>>> is not
>>> guaranteed to be
>>> preserved during
>>> the function call:
>>>
>>> "Even if the called function has
>>> fewer than 4
>>> parameters, these
>>> 4 stack
>>> locations are effectively owned by
>>> the called
>>> function, and may
>>> be used
>>> by the called function for other
>>> purposes besides
>>> saving parameter
>>> register values. Thus the caller
>>> may not save
>>> information in
>>> this region
>>> of stack across a function call."
>>>
>>> So, restoring parameters from it
>>> may not work if
>>> Thread::current() uses
>>> this area for any reason. In this
>>> case, r9, rsp,
>>> r10, r11 may be
>>> overwritten by the callee.
>>>
>>> I also think Windows x64 ABI is
>>> different enough
>>> from Linux that
>>> this
>>> function could live somewhere in
>>> <os>_<cpu>.
>>>
>>>
>>> This is a valid point. However, note
>>> that this is
>>> similar to the
>>> issue we have for instance in 'shared'
>>> MacroAssembler::call_VM_leaf_base,
>>> which was not put in
>>> <os>_<cpu>
>>> files.
>>>
>>> In fact, the right fix may be to use
>>> call_VM_leaf
>>> instead of call :-)
>>>
>>> (alternatively we could use "#ifdef
>>> _WIN64" or check
>>> "frame::arg_reg_save_area_bytes != 0",
>>> which may be
>>> even cleaner and
>>> more portable)
>>>
>>> All these solutions are fine for me.
>>>
>>>
>>> Since we are near to feature freeze,
>>> this should
>>> probably be done in
>>> a separate CR (unless David has enough
>>> time to test
>>> it).
>>>
>>> Regards,
>>>
>>> Bertrand.
>>>
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>>
>>>
>>> On Mon, Nov 23, 2015 at 10:42 PM,
>>> David Holmes
>>> <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>>>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>>>>> wrote:
>>>
>>> On 23/11/2015 11:46 PM,
>>> Bertrand Delsart
>>> wrote:
>>>
>>> On 23/11/2015 13:27,
>>> David Holmes wrote:
>>>
>>> Hi Bertrand,
>>>
>>> On 23/11/2015 8:14
>>> PM, Bertrand
>>> Delsart wrote:
>>>
>>> Hi David,
>>>
>>> Overall looks
>>> good.
>>> Thanks for the
>>> #ifndef
>>> USE_LIBRARY_BASED_TLS_ONLY :-)
>>>
>>> One doubt, in
>>> case this has not
>>> been discussed
>>> before.
>>>
>>> I'm still
>>> catching up on x86_64
>>> (after years of ARM
>>> assembly :-))
>>> but it
>>> seems that there
>>> are some stack
>>> alignment
>>> constraints on
>>> runtime calls,
>>> at least for some
>>> x86_64 ABIs.
>>>
>>> Some of the x86
>>> MacroAssembler::get_thread
>>> implementations
>>> had code to
>>> align the stack
>>> before calling
>>> pthread_getspecific. See
>>> for instance
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/os_cpu/linux_x86/vm/assembler_linux_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>>
>>> Sorry I'm not that
>>> familiar with the
>>> MacroAssembler
>>> - is it
>>> this odd
>>> fragment:
>>>
>>> - push(r10);
>>> - // XXX
>>> - mov(r10, rsp);
>>> - andq(rsp, -16);
>>> - push(r10);
>>>
>>> I'm not at all clear
>>> what that is
>>> doing - and if it
>>> somehow
>>> changes the
>>> alignment wouldn't
>>> something need to
>>> be fixed up when
>>> popping the
>>> previous values ??
>>>
>>> To be honest I'm not
>>> even sure what an
>>> "unaligned
>>> stack" means.
>>>
>>>
>>> On some platforms, SP may
>>> have to be
>>> aligned on a 16 bytes
>>> boundary when
>>> calling another method.
>>>
>>>
>>> Okay - that seems to be an
>>> x64 ABI requirement
>>> based on
>>> other code
>>> in MacroAssembler. The reason
>>> it is missing in
>>> the Solaris
>>> code is
>>> because I already removed it
>>> in the earlier
>>> changes that
>>> were done
>>> on Solaris:
>>>
>>> http://hg.openjdk.java.net/jdk9/hs-rt/hotspot/rev/d5b328043c10
>>>
>>> and nobody picked it up. :(
>>>
>>> That said, the absence of
>>> this code has not
>>> caused any
>>> problems so
>>> presumably we are normally
>>> aligned.
>>>
>>> After having pushed what
>>> needed to be
>>> saved, the code above
>>> rounds SP
>>> down and saves the old
>>> value. It will then
>>> also push
>>> r11, which
>>> results
>>> in the expected
>>> alignment.
>>>
>>> The conterpart, after the
>>> VM call, is the:
>>> pop(r11);
>>> pop(rsp);
>>>
>>>
>>> I had assumed this extra
>>> manipulation was
>>> related to
>>> pushing the arg
>>> for the call.
>>>
>>> This alignment is
>>> no longer
>>> performed in the new
>>> (shared)
>>> implementation
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/src/cpu/x86/vm/macroAssembler_x86.cpp.udiff.html
>>>
>>>
>>>
>>>
>>>
>>>
>>> Now, Solaris was
>>> not performing
>>> the alignment and
>>> Windows has a
>>> separate
>>> path for x86_64.
>>> Did we really
>>> need the
>>> alignment for
>>> linux x86_64 and
>>> bsd_x86_64 ?
>>> Might it be needed
>>> for other ports ?
>>>
>>> IMHO, it might be
>>> safer to align
>>> the stack by
>>> default,
>>> knowing it should
>>> not be expensive
>>> since we call
>>> get_thread
>>> rarely for
>>> x86_64 (result is
>>> cached in r15).
>>> I'll let you see
>>> whether it is
>>> worth
>>> adding an
>>> ifdef so
>>> as to explicitly
>>> deactivate the
>>> alignment on
>>> some platforms
>>> (solaris_x86_64 ?)
>>>
>>>
>>> I don't have enough
>>> knowledge to even
>>> begin to
>>> comment on
>>> this. I will
>>> have to rely on our
>>> x86_64
>>> macroassembler experts
>>> to explain
>>> the old
>>> code and what the
>>> requirements are.
>>>
>>>
>>> OK. This means that the
>>> alignment was not
>>> removed
>>> purposefully.
>>>
>>> In that case, you must
>>> either keep the per
>>> platform
>>> code x86_64
>>> (since
>>> it was not identical for
>>> all platforms) or
>>> use the safest
>>> version, with
>>> the additional
>>>
>>> // XXX
>>> mov(r10, rsp);
>>> andq(rsp, -16);
>>> push(r10);
>>>
>>> before the push(r11) and
>>> with the
>>>
>>> pop(rsp);
>>>
>>> after the pop(r11). It
>>> should work on all
>>> x86_64 platforms.
>>>
>>>
>>> Right, I will add this back
>>> to the code - and
>>> replace the
>>> meaningless // XXX with an
>>> actual comment!
>>>
>>> Many thanks,
>>> David
>>> -----
>>>
>>>
>>> FYI, there is another way
>>> to do the
>>> alignment, based on
>>> the fact
>>> that
>>> we are at least aligned
>>> on 8 bytes (see
>>> MacroAssembler::call_VM_leaf_base). I
>>> assume that this
>>> second
>>> version is
>>> more efficient
>>> (particularly thanks to
>>> specilative
>>> branches) but it
>>> might be safer/simpler to
>>> continue using
>>> andq in
>>> get_thread.
>>>
>>> Bertrand.
>>>
>>>
>>> Thanks,
>>> David
>>>
>>> Regards,
>>>
>>> Bertrand.
>>>
>>>
>>>
>>> On 23/11/2015
>>> 08:03, David Holmes
>>> wrote:
>>>
>>> After all the
>>> preliminary
>>> discussions here
>>> are final
>>> proposed changes:
>>>
>>> bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8132510
>>>
>>> Open webrev:
>>> http://cr.openjdk.java.net/~dholmes/8132510/webrev.v6/
>>>
>>> A simple (in
>>> principle) but
>>> wide-ranging change
>>> which should
>>> appeal to
>>> our Code
>>> Deletion Engineer's.
>>> We implement
>>>
>>> Thread::current() using a
>>> compiler/language-based
>>> thread-local
>>> variable eg:
>>>
>>> static
>>> __thread Thread
>>> *_thr_current;
>>>
>>> inline Thread*
>>> Thread::current() {
>>> return
>>> _thr_current;
>>> }
>>>
>>> with an
>>> appropriate setter of
>>> course. By
>>> doing this
>>> we can
>>> completely
>>> remove the
>>> os_cpu-specific
>>> ThreadLocalStorage
>>>
>>> implementations, and the
>>> associated
>>> os::thread_local_storage* calls.
>>>
>>> As __thread
>>> is not
>>> async-signal-safe (either in
>>> theory or
>>> practice) we
>>> need a
>>> fallback library-based
>>> solution as used
>>> today. For
>>> this we use a
>>> very simple
>>> ThreadLocalStorage
>>> class and an
>>>
>>> implementation thereof for
>>> POSIX (covers
>>> AIX, BSD, Linux,
>>> OSX and
>>> Solaris) using
>>> pthread_get/setspecific; and
>>> one for
>>> Windows using
>>> its TLS
>>> library.
>>> While these
>>> library routines
>>> are not guaranteed
>>>
>>> async-signal-safe, they
>>> seem to be in
>>> practice and are
>>> what we have
>>> been
>>> using all
>>> along.
>>>
>>> We also allow
>>> for use of
>>> only the
>>> library-based TLS
>>> for platforms
>>> where
>>>
>>> compiler-based thread locals
>>> are not
>>> supported (as
>>> will be
>>> needed in
>>> the
>>> Mobile
>>> project). This is
>>> enabled at build
>>> time by
>>> defining
>>> USE_LIBRARY_BASED_TLS_ONLY.
>>>
>>> Thanks to
>>> Andrew Haley for
>>> providing the
>>> Aarch64
>>> code; and for
>>> Thomas
>>> Stuefe for
>>> testing PPC and
>>> AIX.
>>>
>>> Testing:
>>> - JPRT
>>> (core platforms)
>>> - Jtreg
>>> tests (linux &
>>> solaris)
>>> -
>>> vm.runtime (core
>>> platforms)
>>>
>>> Performance:
>>> - still
>>> TBD - this is
>>> proving to be
>>> extremely
>>> difficult. If
>>> anyone
>>> has
>>> any simple
>>> to run
>>> microbenchmarks to
>>> suggest I'd
>>> give them a
>>> try as a
>>> sanity check.
>>> But I lack
>>> access to hardware for
>>> running serious
>>> benchmarking.
>>>
>>> Footprint:
>>> - varies by
>>> platform and the
>>> VM (server,
>>> client,
>>> minimal)
>>> - worst-case:
>>> ~1% increase for
>>> server VM
>>> and minimal VM
>>> - best-case:
>>> 0.4% decrease
>>> for client VM
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Bertrand Delsart,
>>> Grenoble
>>> Engineering Center
>>> Oracle, 180 av. de l'Europe,
>>> ZIRST de
>>> Montbonnot
>>> 38330 Montbonnot Saint Martin, FRANCE
>>> bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>
>>> <mailto:bertrand.delsart at oracle.com
>>> <mailto:bertrand.delsart at oracle.com>>>
>>> Phone : +33 4 76 18 81 23
>>>
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> NOTICE: This email message is for the
>>> sole use of the
>>> intended
>>> recipient(s) and may contain
>>> confidential and
>>> privileged
>>> information. Any unauthorized
>>> review, use,
>>> disclosure or
>>> distribution is prohibited. If you are
>>> not the intended
>>> recipient,
>>> please contact the sender by reply
>>> email and destroy
>>> all copies of
>>> the original message.
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>>
>>>
>>>
>>>
>>>
>
More information about the hotspot-dev
mailing list