RFR(xs): 8184339: Thread::current_or_null() should not assert if Posix TLS is not yet initialized

Tue Aug 8 12:23:17 UTC 2017

Hi David,

On Mon, Aug 7, 2017 at 5:26 AM, David Holmes <david.holmes at oracle.com>
wrote:

> Hi Thomas,
>
> On 1/08/2017 11:40 PM, Thomas Stüfe wrote:
>
>> Hi David!
>>
>> On Tue, Aug 1, 2017 at 9:33 AM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     Hi Thomas,
>>
>>     Back from vacation ...
>>
>>     I appreciate the effort to fix this problem, however ... :)
>>
>>         Thomas Stüfe thomas.stuefe at gmail.com <http://gmail.com>
>>
>>         Mon Jul 17 07:00:01 UTC 2017
>>         Hi all,
>>
>>         may I please have reviews + a sponsor for the following fix:
>>
>>         bug: https://bugs.openjdk.java.net/browse/JDK-8184339
>>         <https://bugs.openjdk.java.net/browse/JDK-8184339>
>>         webrev:
>>         http://cr.openjdk.java.net/~stuefe/webrevs/8184339-Thread-cu
>> rrent-or-null-shall-not-assert/webrev.00/webrev/
>>         <http://cr.openjdk.java.net/~stuefe/webrevs/8184339-Thread-c
>> urrent-or-null-shall-not-assert/webrev.00/webrev/>
>>
>>         The problem is caused by the fact that Posix TLS cannot be used
>>         before it
>>         is initialized. It is initialized in os::init(). If we use Posix
>>         TLS (eg
>>         via Thread::current()) before, we assert. It is used now (since
>>         JDK-8183925
>>         <https://bugs.openjdk.java.net/browse/JDK-8183925
>>         <https://bugs.openjdk.java.net/browse/JDK-8183925>>) before
>>         os::init() (see
>>         callstack in bug description).
>>
>>
>>     ... those kind of initialization order changes are exactly what the
>>     assertion was supposed to detect! It can be dangerous to use various
>>     VM facilities "too soon" in the initialization sequence. The change
>>     in JDK-8183925 needs a close examination IMHO before relaxing this
>>     guard. :(
>>
>>
>> I agree with you that using a VM facility "too soon" is the core of this
>> problem. But I am not sure that just forbidding that is a practical
>> solution, because these pre-init-uses have the tendency to creep up on you:
>>
>> There are many functions in the VM which very basic and which are used
>> without regard for initialization sequence. os::malloc() is one example,
>> Thread::current_or_null() is another. It is difficult to find pre-init uses
>> with tests, especially if we have code paths which are only executed on one
>> platform, like in this case.
>>
>> In this case, on AIX, we had two problems, Thread::current_or_null() was
>> asserting, then it was again used in error handling, asserting again, which
>> lead to an annoying recursion difficult to entangle (on AIX with its
>> less-than-optimal debugger at least).
>>
>> To prevent this from happening, one would have to guard every
>> "Thread::current_or_null()" use with something like "if
>> ThreadLocalStorage.is_initialized()". But I think a more practical
>> solution would be to keep these basic functions safe to be called anytime.
>> And maybe make this a clear part of the interface, and test this too.
>> Similar to the Posix list of functions safe to be called in signal handlers.
>>
>
> I don't agree :) I would prefer if nothing were executed prior to VM
> initialization - certainly "on load" actions and C++ static initialization
> actions are very fragile because there is nothing in any of that code to
> indicate that it is being executed in such an environment. So it is far too
> easy for people to make changes to code that is called from that
> environment in ways that are incompatible with that environment.

No disagreement from me. In fact, where I have the choice I very much
prefer strict rules and asserting them to being tolerant and hiding errors
behind a feel-good API.

> Hence using asserts to try and catch when someone makes such a change.
>
> The fact such assertion failures can trigger further problems with error
> reporting just reinforces the basic problem. Error reporting makes lots of
> assumptions about the state of things and should not be called "very early"
> in the VM init sequence.

Here I disagree. Error reporting is a perfect example for code that should
always work, regardless the phase. Maybe the difference is in how we (SAP)
do analyze customer issues compared to how Oracle does it.

For us, robust and rich hs-err files are very important. We hardened and
expanded error reporting a lot, so much in fact that a large percentage of
issues at customer sites can be analyzed with our error reports alone; we
often do not need core files. We did this because getting core files from
customer systems is complicated and fragile, and tool support for analyzing
cores is terrible on some OSes (like AIX or HPUX). So, robust error
reporting is one of our most important tool.

And quite a lot of problems happen before initialization or also at
shutdown, after cleanup. Or maybe in threads which only happen to share the
process space with us, without being attached to the VM. Without useful
error reporting in those corners support would be much more difficult.

That is why I'd like to have a small subset of ubiquitous VM functions to
always work, regardless of life phase. Top priority for us is error
reporting.

>
> Thread::current_or_null() could return NULL before initialization,
>> os::malloc() could refrain from using NMT before its initialization ran,
>> and NMT in turn could tolerate "free/realloc-without-malloc" cases...
>>
>> (Note that this discussion feels similar to the "CanUseSafeFetch()" issue
>> we had two years ago, and we did not agree then either :)
>>
>
> :)
>
> Cheers,
> David
>
>
>
Kind Regards, Thomas

>
>>
>>         There are two functions, Thread::current() and
>>         Thread::current_or_null().
>>         The latter should not assert if Posix TLS is not yet available
>>         but return
>>         NULL instead. This is what callers expect: this function is safe
>>         to call,
>>         but it might not return a valid Thread*.
>>
>>
>>     Yes but because the current thread has not yet been set, not because
>>     we're executing the code well before any VM initialization has been
>>     performed! :(
>>
>>
>> I think anyone using Thread::current_or_null() is aware that the thread
>> pointer returned may be NULL and is prepared to handle it, but does not
>> care for the reason it is NULL. Like, in error handling.
>>
>>         Note that this issue currently only comes up at AIX, because AIX
>>         is the
>>         only platform using Posix TLS for Thread::current() - all other
>>         platforms
>>         use Compiler-based TLS (__thread variables).
>>
>>
>>     It's also used in the mobile-dev project IIRC, but they don't have
>>     JDK-8183925 applied.
>>
>>         However, we want to keep the Posix TLS code path alive, so this
>>         may also
>>         affect other platforms. There have been compiler bugs in the
>>         past (e.g. gcc
>>         + glibc) leading to errors when using compiler-based TLS, so it
>>         is good to
>>         have a reliable fallback.
>>
>>
>>     Definitely!
>>
>>
>> Good to be in agreement here!
>> Kind Regards, Thomas
>>
>>     Cheers,
>>     David
>>
>>         Thanks,
>>
>>         Thomas
>>
>>
>>