RFR (S) 8235962: os::current_thread_id() is not signal safe on macOS
Chris Plummer
chris.plummer at oracle.com
Fri Jan 24 22:41:19 UTC 2020
Hi Gerard,
Changes look good to me. Thanks for taking care of this.
Chris
On 1/22/20 9:42 AM, gerard ziemski wrote:
> Thank you David for taking the time to follow the explanation and the
> review!
>
> I still need a second review please.
>
>
>
> On 1/16/20 4:07 PM, David Holmes wrote:
>> Hi Gerard,
>>
>> Thank you for the very detailed discussion on mach ports! Please add
>> all of that to the bug report (plus any references/links to where you
>> found the info).
>>
>> I'm good with the current fix to "do the right thing" and keep the
>> reference count in sync.
>>
>> Thanks,
>> David
>>
>> On 17/01/2020 5:14 am, gerard ziemski wrote:
>>> Hi David,
>>>
>>> Thank you very much for the review.
>>>
>>>
>>> On 1/13/20 8:49 PM, David Holmes wrote:
>>>> Hi Gerard,
>>>>
>>>> Thanks for tackling this. All extremely confusing as usual :) far
>>>> too many different notions of "thread identity".
>>>>
>>>> On 14/01/2020 4:43 am, gerard ziemski wrote:
>>>>> hi all,
>>>>>
>>>>> Before JDK-8022808 we used to call "mach_thread_self()" to obtain
>>>>> a thread id, but back then we forgot to release the returned mach
>>>>> port, which would make a long running java process (like
>>>>> Kitchensink) eventually run out of resources and hang it.
>>>>>
>>>>> To solve that, in JDK-8022808, we switched to using
>>>>> "pthread_mach_thread_np(pthread_self())", which returns a static
>>>>> mach port, which does not need to be released.
>>>>> However, the "pthread_mach_thread_np()" API, which we started
>>>>> using is specifically knownn as not async-signal-safe API and that
>>>>> might cause an issue, because we use it to generate thread ids in
>>>>> hs_err_pid.log crash log files in our signal handler, which
>>>>> requires that only async-signal-safe API be used.
>>>>>
>>>>> Our crash handler is best effort approach, and to this extend I'd
>>>>> like to propose a fix, which reverts back to using
>>>>> "mach_thread_self()", which I must admit is unknown whether it
>>>>> itself is async-signal-safe, though it's not part of pthread API,
>>>>> like "pthread_mach_thread_np()" which itself is not part of
>>>>> async-signal-safe APIs (IEEE Std 1003.1, 2016). Additionally, the
>>>>> reporter of the issue suggest that it solves the actual issue with
>>>>> using "pthread_mach_thread_np()" for them in their project, so
>>>>> let's benefit from their experience.
>>>>
>>>> Okay that seems reasonable.
>>>>
>>>>> In this fix we revert back to using "mach_thread_self()" API, but
>>>>> make sure to release the mach port, and use MACH_PORT_VALID()
>>>>> macro to verify the validity of mach port.
>>>>
>>>> I'm unclear on this aspect. Now that we are releasing it doesn't
>>>> that mean that the port value could be reused and applied to a
>>>> different thread? It isn't at all clear to me what a "mach port"
>>>> actually means and what releasing it (or failing to) means.
>>>
>>> I've done more research and testing and here is how I understand our
>>> usage of mach ports:
>>>
>>> Mach ports are a resource a code needs in order to talk to mach
>>> thread (or task), though we simply use its value as thread id. They
>>> (i.e. the mach kernel) use reference counting to keep track of
>>> whether they need to exist, and get automatically destroyed when the
>>> ref count reaches 0.
>>>
>>> They can be created via APIs like mach_port_allocate(), and deleted
>>> directly via mach_port_destroy(), or indirectly via
>>> mach_port_deallocate(). The difference between these two (inaptly
>>> named imho) APIs is that while mach_port_destroy() will outright
>>> delete the port, mach_port_deallocate() will only decrease the ref
>>> count and will call mach_port_destroy() implicitly only if it
>>> reaches 0.
>>>
>>> If the user obtains mach port via mach_thread_self(), the ref count
>>> of that thread will increase, which needs to be matched by a call to
>>> mach_port_deallocate(). However, one can not delete such special
>>> mach port via mach_port_destroy() (only user own created ports can
>>> de deleted).
>>>
>>> Back in 2013 a call to mach_thread_self() would create a brand new
>>> mach port, so if you did not call mach_port_deallocate() after being
>>> done with it, the resource would be leaked. I believe that was
>>> actually a bug in early OS X, or possibly a bad API design, which
>>> affected many other users, not just us (i.e. Google's Chrome app for
>>> Mac). The confusion was primary due to the fact that
>>> mach_task_self() would not create a new port, so developers expected
>>> the same from mach_thread_self(). There are only about 300,000 ports
>>> available to a process (verified via a native sample that keeps
>>> creating new ports via mach_port_allocate() API), so eventually long
>>> lived apps, like our kitchensink, would run out of them.
>>>
>>> We could have fixed our issue by matching mach_thread_self() with
>>> mach_port_deallocate() even back then, but instead we chose to use
>>> pthread_mach_thread_np() API, which is documented to return thread's
>>> static mach port, which does not need to be returned via
>>> mach_port_deallocate(). I think that solution just bypassed the
>>> mystery of mach ports and when and if we need to delete them, but
>>> also was more efficient as it did not create a new port.
>>>
>>> Today, mach_thread_self() on OS X behaves differently, to how it did
>>> back then - it no longer creates new mach ports, and so in most
>>> cases the matching call to mach_port_deallocate() could be skipped,
>>> as the counter the mach kernel uses for ref count is a very big
>>> number. Though it would eventually overflow, and the kernel then
>>> would probably delete the mach port with tragic results. I tried to
>>> see if I could make that happen, but no luck in 24 hours of a test
>>> case continuously asking for mach_thread_self() every 10 micro seconds.
>>>
>>> So today we could have simply reverted to how the code used to be,
>>> and not even bother to call mach_port_deallocate() after
>>> mach_thread_self(), though theoretically that could overflow the ref
>>> count eventually.
>>>
>>> The proposed fix, reverts back to using mach_thread_self() (which
>>> increases the ref count) because it's probably signal-async-safe,
>>> unlike pthread_mach_thread_np(), AND calls mach_port_deallocate()
>>> (which decreases the refcount)
>>>
>>> Someone please correct me if I got anything wrong here.
>>>
>>>
>>>>
>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
>>>>> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev1/index.html
>>>>
>>>> I found the refactoring around _unique_thread_id a bit hard to
>>>> follow. Your new set_unique_thread_id needs to use "#ifdef
>>>> __APPLE__" otherwise it won't compile on other BSD systems.
>>>
>>> I will fix it, test it and post a new webrev, assuming everyone is
>>> satisfied with my answers above.
>
> I chose the pattern of the original code and decided not to #ifdef the
> "_unique_thread_id" field (or the API), so that I don't have to touch
> the file "vmStructs_bsd_x86.hpp" where it's exported, and only #ifdef
> the implementation of "set_unique_thread_id()" and the client calls to
> it, similar to the original code:
>
> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev2/index.htm
> Tested with Mach5 tier1,2,3,4,5 (and verified by watching the number
> of process ports while executing Kitchensink)
>
>
> cheers
More information about the hotspot-runtime-dev
mailing list