RFR (S) 8235962: os::current_thread_id() is not signal safe on macOS
gerard ziemski
gerard.ziemski at oracle.com
Wed Jan 22 17:42:46 UTC 2020
Thank you David for taking the time to follow the explanation and the
review!
I still need a second review please.
On 1/16/20 4:07 PM, David Holmes wrote:
> Hi Gerard,
>
> Thank you for the very detailed discussion on mach ports! Please add
> all of that to the bug report (plus any references/links to where you
> found the info).
>
> I'm good with the current fix to "do the right thing" and keep the
> reference count in sync.
>
> Thanks,
> David
>
> On 17/01/2020 5:14 am, gerard ziemski wrote:
>> Hi David,
>>
>> Thank you very much for the review.
>>
>>
>> On 1/13/20 8:49 PM, David Holmes wrote:
>>> Hi Gerard,
>>>
>>> Thanks for tackling this. All extremely confusing as usual :) far
>>> too many different notions of "thread identity".
>>>
>>> On 14/01/2020 4:43 am, gerard ziemski wrote:
>>>> hi all,
>>>>
>>>> Before JDK-8022808 we used to call "mach_thread_self()" to obtain a
>>>> thread id, but back then we forgot to release the returned mach
>>>> port, which would make a long running java process (like
>>>> Kitchensink) eventually run out of resources and hang it.
>>>>
>>>> To solve that, in JDK-8022808, we switched to using
>>>> "pthread_mach_thread_np(pthread_self())", which returns a static
>>>> mach port, which does not need to be released.
>>>> However, the "pthread_mach_thread_np()" API, which we started using
>>>> is specifically knownn as not async-signal-safe API and that might
>>>> cause an issue, because we use it to generate thread ids in
>>>> hs_err_pid.log crash log files in our signal handler, which
>>>> requires that only async-signal-safe API be used.
>>>>
>>>> Our crash handler is best effort approach, and to this extend I'd
>>>> like to propose a fix, which reverts back to using
>>>> "mach_thread_self()", which I must admit is unknown whether it
>>>> itself is async-signal-safe, though it's not part of pthread API,
>>>> like "pthread_mach_thread_np()" which itself is not part of
>>>> async-signal-safe APIs (IEEE Std 1003.1, 2016). Additionally, the
>>>> reporter of the issue suggest that it solves the actual issue with
>>>> using "pthread_mach_thread_np()" for them in their project, so
>>>> let's benefit from their experience.
>>>
>>> Okay that seems reasonable.
>>>
>>>> In this fix we revert back to using "mach_thread_self()" API, but
>>>> make sure to release the mach port, and use MACH_PORT_VALID() macro
>>>> to verify the validity of mach port.
>>>
>>> I'm unclear on this aspect. Now that we are releasing it doesn't
>>> that mean that the port value could be reused and applied to a
>>> different thread? It isn't at all clear to me what a "mach port"
>>> actually means and what releasing it (or failing to) means.
>>
>> I've done more research and testing and here is how I understand our
>> usage of mach ports:
>>
>> Mach ports are a resource a code needs in order to talk to mach
>> thread (or task), though we simply use its value as thread id. They
>> (i.e. the mach kernel) use reference counting to keep track of
>> whether they need to exist, and get automatically destroyed when the
>> ref count reaches 0.
>>
>> They can be created via APIs like mach_port_allocate(), and deleted
>> directly via mach_port_destroy(), or indirectly via
>> mach_port_deallocate(). The difference between these two (inaptly
>> named imho) APIs is that while mach_port_destroy() will outright
>> delete the port, mach_port_deallocate() will only decrease the ref
>> count and will call mach_port_destroy() implicitly only if it reaches 0.
>>
>> If the user obtains mach port via mach_thread_self(), the ref count
>> of that thread will increase, which needs to be matched by a call to
>> mach_port_deallocate(). However, one can not delete such special mach
>> port via mach_port_destroy() (only user own created ports can de
>> deleted).
>>
>> Back in 2013 a call to mach_thread_self() would create a brand new
>> mach port, so if you did not call mach_port_deallocate() after being
>> done with it, the resource would be leaked. I believe that was
>> actually a bug in early OS X, or possibly a bad API design, which
>> affected many other users, not just us (i.e. Google's Chrome app for
>> Mac). The confusion was primary due to the fact that mach_task_self()
>> would not create a new port, so developers expected the same from
>> mach_thread_self(). There are only about 300,000 ports available to a
>> process (verified via a native sample that keeps creating new ports
>> via mach_port_allocate() API), so eventually long lived apps, like
>> our kitchensink, would run out of them.
>>
>> We could have fixed our issue by matching mach_thread_self() with
>> mach_port_deallocate() even back then, but instead we chose to use
>> pthread_mach_thread_np() API, which is documented to return thread's
>> static mach port, which does not need to be returned via
>> mach_port_deallocate(). I think that solution just bypassed the
>> mystery of mach ports and when and if we need to delete them, but
>> also was more efficient as it did not create a new port.
>>
>> Today, mach_thread_self() on OS X behaves differently, to how it did
>> back then - it no longer creates new mach ports, and so in most cases
>> the matching call to mach_port_deallocate() could be skipped, as the
>> counter the mach kernel uses for ref count is a very big number.
>> Though it would eventually overflow, and the kernel then would
>> probably delete the mach port with tragic results. I tried to see if
>> I could make that happen, but no luck in 24 hours of a test case
>> continuously asking for mach_thread_self() every 10 micro seconds.
>>
>> So today we could have simply reverted to how the code used to be,
>> and not even bother to call mach_port_deallocate() after
>> mach_thread_self(), though theoretically that could overflow the ref
>> count eventually.
>>
>> The proposed fix, reverts back to using mach_thread_self() (which
>> increases the ref count) because it's probably signal-async-safe,
>> unlike pthread_mach_thread_np(), AND calls mach_port_deallocate()
>> (which decreases the refcount)
>>
>> Someone please correct me if I got anything wrong here.
>>
>>
>>>
>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
>>>> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev1/index.html
>>>
>>> I found the refactoring around _unique_thread_id a bit hard to
>>> follow. Your new set_unique_thread_id needs to use "#ifdef
>>> __APPLE__" otherwise it won't compile on other BSD systems.
>>
>> I will fix it, test it and post a new webrev, assuming everyone is
>> satisfied with my answers above.
I chose the pattern of the original code and decided not to #ifdef the
"_unique_thread_id" field (or the API), so that I don't have to touch
the file "vmStructs_bsd_x86.hpp" where it's exported, and only #ifdef
the implementation of "set_unique_thread_id()" and the client calls to
it, similar to the original code:
bug: https://bugs.openjdk.java.net/browse/JDK-8235962
webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev2/index.htm
Tested with Mach5 tier1,2,3,4,5 (and verified by watching the number of
process ports while executing Kitchensink)
cheers
More information about the hotspot-runtime-dev
mailing list