RFR (S) 8235962: os::current_thread_id() is not signal safe on macOS

Chris Plummer chris.plummer at oracle.com
Fri Jan 24 22:41:19 UTC 2020


Hi Gerard,

Changes look good to me. Thanks for taking care of this.

Chris

On 1/22/20 9:42 AM, gerard ziemski wrote:
> Thank you David for taking the time to follow the explanation and the 
> review!
>
> I still need a second review please.
>
>
>
> On 1/16/20 4:07 PM, David Holmes wrote:
>> Hi Gerard,
>>
>> Thank you for the very detailed discussion on mach ports! Please add 
>> all of that to the bug report (plus any references/links to where you 
>> found the info).
>>
>> I'm good with the current fix to "do the right thing" and keep the 
>> reference count in sync.
>>
>> Thanks,
>> David
>>
>> On 17/01/2020 5:14 am, gerard ziemski wrote:
>>> Hi David,
>>>
>>> Thank you very much for the review.
>>>
>>>
>>> On 1/13/20 8:49 PM, David Holmes wrote:
>>>> Hi Gerard,
>>>>
>>>> Thanks for tackling this. All extremely confusing as usual :) far 
>>>> too many different notions of "thread identity".
>>>>
>>>> On 14/01/2020 4:43 am, gerard ziemski wrote:
>>>>> hi all,
>>>>>
>>>>> Before JDK-8022808 we used to call "mach_thread_self()" to obtain 
>>>>> a thread id, but back then we forgot to release the returned mach 
>>>>> port, which would make a long running java process (like 
>>>>> Kitchensink) eventually run out of resources and hang it.
>>>>>
>>>>> To solve that, in JDK-8022808, we switched to using 
>>>>> "pthread_mach_thread_np(pthread_self())", which returns a static 
>>>>> mach port, which does not need to be released.
>>>>> However, the "pthread_mach_thread_np()" API, which we started 
>>>>> using is specifically knownn as not async-signal-safe API and that 
>>>>> might cause an issue, because we use it to generate thread ids in 
>>>>> hs_err_pid.log crash log files in our signal handler, which 
>>>>> requires that only async-signal-safe API be used.
>>>>>
>>>>> Our crash handler is best effort approach, and to this extend I'd 
>>>>> like to propose a fix, which reverts back to using 
>>>>> "mach_thread_self()", which I must admit is unknown whether it 
>>>>> itself is async-signal-safe, though it's not part of pthread API, 
>>>>> like "pthread_mach_thread_np()" which itself is not part of 
>>>>> async-signal-safe APIs (IEEE Std 1003.1, 2016). Additionally, the 
>>>>> reporter of the issue suggest that it solves the actual issue with 
>>>>> using "pthread_mach_thread_np()" for them in their project, so 
>>>>> let's benefit from their experience.
>>>>
>>>> Okay that seems reasonable.
>>>>
>>>>> In this fix we revert back to using "mach_thread_self()" API, but 
>>>>> make sure to release the mach port, and use MACH_PORT_VALID() 
>>>>> macro to verify the validity of mach port.
>>>>
>>>> I'm unclear on this aspect. Now that we are releasing it doesn't 
>>>> that mean that the port value could be reused and applied to a 
>>>> different thread? It isn't at all clear to me what a "mach port" 
>>>> actually means and what releasing it (or failing to) means.
>>>
>>> I've done more research and testing and here is how I understand our 
>>> usage of mach ports:
>>>
>>> Mach ports are a resource a code needs in order to talk to mach 
>>> thread (or task), though we simply use its value as thread id. They 
>>> (i.e. the mach kernel) use reference counting to keep track of 
>>> whether they need to exist, and get automatically destroyed when the 
>>> ref count reaches 0.
>>>
>>> They can be created via APIs like mach_port_allocate(), and deleted 
>>> directly via mach_port_destroy(), or indirectly via 
>>> mach_port_deallocate(). The difference between these two (inaptly 
>>> named imho) APIs is that while mach_port_destroy() will outright 
>>> delete the port, mach_port_deallocate() will only decrease the ref 
>>> count and will call mach_port_destroy() implicitly only if it 
>>> reaches 0.
>>>
>>> If the user obtains mach port via mach_thread_self(), the ref count 
>>> of that thread will increase, which needs to be matched by a call to 
>>> mach_port_deallocate(). However, one can not delete such special 
>>> mach port via mach_port_destroy() (only user own created ports can 
>>> de deleted).
>>>
>>> Back in 2013 a call to mach_thread_self() would create a brand new 
>>> mach port, so if you did not call mach_port_deallocate() after being 
>>> done with it, the resource would be leaked. I believe that was 
>>> actually a bug in early OS X, or possibly a bad API design, which 
>>> affected many other users, not just us (i.e. Google's Chrome app for 
>>> Mac). The confusion was primary due to the fact that 
>>> mach_task_self() would not create a new port, so developers expected 
>>> the same from mach_thread_self(). There are only about 300,000 ports 
>>> available to a process (verified via a native sample that keeps 
>>> creating new ports via mach_port_allocate() API), so eventually long 
>>> lived apps, like our kitchensink, would run out of them.
>>>
>>> We could have fixed our issue by matching mach_thread_self() with 
>>> mach_port_deallocate() even back then, but instead we chose to use 
>>> pthread_mach_thread_np() API, which is documented to return thread's 
>>> static mach port, which does not need to be returned via 
>>> mach_port_deallocate(). I think that solution just bypassed the 
>>> mystery of mach ports and when and if we need to delete them, but 
>>> also was more efficient as it did not create a new port.
>>>
>>> Today, mach_thread_self() on OS X behaves differently, to how it did 
>>> back then - it no longer creates new mach ports, and so in most 
>>> cases the matching call to mach_port_deallocate() could be skipped, 
>>> as the counter the mach kernel uses for ref count is a very big 
>>> number. Though it would eventually overflow, and the kernel then 
>>> would probably delete the mach port with tragic results. I tried to 
>>> see if I could make that happen, but no luck in 24 hours of a test 
>>> case continuously asking for mach_thread_self() every 10 micro seconds.
>>>
>>> So today we could have simply reverted to how the code used to be, 
>>> and not even bother to call mach_port_deallocate() after 
>>> mach_thread_self(), though theoretically that could overflow the ref 
>>> count eventually.
>>>
>>> The proposed fix, reverts back to using mach_thread_self() (which 
>>> increases the ref count) because it's probably signal-async-safe, 
>>> unlike pthread_mach_thread_np(), AND calls mach_port_deallocate() 
>>> (which decreases the refcount)
>>>
>>> Someone please correct me if I got anything wrong here.
>>>
>>>
>>>>
>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
>>>>> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev1/index.html
>>>>
>>>> I found the refactoring around _unique_thread_id a bit hard to 
>>>> follow. Your new set_unique_thread_id needs to use "#ifdef 
>>>> __APPLE__" otherwise it won't compile on other BSD systems.
>>>
>>> I will fix it, test it and post a new webrev, assuming everyone is 
>>> satisfied with my answers above.
>
> I chose the pattern of the original code and decided not to #ifdef the 
> "_unique_thread_id" field (or the API), so that I don't have to touch 
> the file "vmStructs_bsd_x86.hpp" where it's exported, and only #ifdef 
> the implementation of "set_unique_thread_id()" and the client calls to 
> it, similar to the original code:
>
> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev2/index.htm
> Tested with Mach5 tier1,2,3,4,5 (and verified by watching the number 
> of process ports while executing Kitchensink)
>
>
> cheers



More information about the hotspot-runtime-dev mailing list