RFR (S) 8235962: os::current_thread_id() is not signal safe on macOS

Thu Jan 16 22:07:15 UTC 2020

Hi Gerard,

Thank you for the very detailed discussion on mach ports! Please add all 
of that to the bug report (plus any references/links to where you found 
the info).

I'm good with the current fix to "do the right thing" and keep the 
reference count in sync.

Thanks,
David

On 17/01/2020 5:14 am, gerard ziemski wrote:
> Hi David,
> 
> Thank you very much for the review.
> 
> 
> On 1/13/20 8:49 PM, David Holmes wrote:
>> Hi Gerard,
>>
>> Thanks for tackling this. All extremely confusing as usual :) far too 
>> many different notions of "thread identity".
>>
>> On 14/01/2020 4:43 am, gerard ziemski wrote:
>>> hi all,
>>>
>>> Before JDK-8022808 we used to call "mach_thread_self()" to obtain a 
>>> thread id, but back then we forgot to release the returned mach port, 
>>> which would make a long running java process (like Kitchensink) 
>>> eventually run out of resources and hang it.
>>>
>>> To solve that, in JDK-8022808, we switched to using 
>>> "pthread_mach_thread_np(pthread_self())", which returns a static mach 
>>> port, which does not need to be released.
>>> However, the "pthread_mach_thread_np()" API, which we started using 
>>> is specifically knownn as not async-signal-safe API and that might 
>>> cause an issue, because we use it to generate thread ids in 
>>> hs_err_pid.log crash log files in our signal handler, which requires 
>>> that only async-signal-safe API be used.
>>>
>>> Our crash handler is best effort approach, and to this extend I'd 
>>> like to propose a fix, which reverts back to using 
>>> "mach_thread_self()", which I must admit is unknown whether it itself 
>>> is async-signal-safe, though it's not part of pthread API, like 
>>> "pthread_mach_thread_np()" which itself is not part of 
>>> async-signal-safe APIs (IEEE Std 1003.1, 2016). Additionally, the 
>>> reporter of the issue suggest that it solves the actual issue with 
>>> using "pthread_mach_thread_np()" for them in their project, so let's 
>>> benefit from their experience.
>>
>> Okay that seems reasonable.
>>
>>> In this fix we revert back to using "mach_thread_self()" API, but 
>>> make sure to release the mach port, and use MACH_PORT_VALID() macro 
>>> to verify the validity of mach port.
>>
>> I'm unclear on this aspect. Now that we are releasing it doesn't that 
>> mean that the port value could be reused and applied to a different 
>> thread? It isn't at all clear to me what a "mach port" actually means 
>> and what releasing it (or failing to) means.
> 
> I've done more research and testing and here is how I understand our 
> usage of mach ports:
> 
> Mach ports are a resource a code needs in order to talk to mach thread 
> (or task), though we simply use its value as thread id. They (i.e. the 
> mach kernel) use reference counting to keep track of whether they need 
> to exist, and get automatically destroyed when the ref count reaches 0.
> 
> They can be created via APIs like mach_port_allocate(), and deleted 
> directly via mach_port_destroy(), or indirectly via 
> mach_port_deallocate(). The difference between these two (inaptly named 
> imho) APIs is that while mach_port_destroy() will outright delete the 
> port, mach_port_deallocate() will only decrease the ref count and will 
> call mach_port_destroy() implicitly only if it reaches 0.
> 
> If the user obtains mach port via mach_thread_self(), the ref count of 
> that thread will increase, which needs to be matched by a call to 
> mach_port_deallocate(). However, one can not delete such special mach 
> port via mach_port_destroy() (only user own created ports can de deleted).
> 
> Back in 2013 a call to mach_thread_self() would create a brand new mach 
> port, so if you did not call mach_port_deallocate() after being done 
> with it, the resource would be leaked. I believe that was actually a bug 
> in early OS X, or possibly a bad API design, which affected many other 
> users, not just us (i.e. Google's Chrome app for Mac). The confusion was 
> primary due to the fact that mach_task_self() would not create a new 
> port, so developers expected the same from mach_thread_self(). There are 
> only about 300,000 ports available to a process (verified via a native 
> sample that keeps creating new ports via mach_port_allocate() API), so 
> eventually long lived apps, like our kitchensink, would run out of them.
> 
> We could have fixed our issue by matching mach_thread_self() with 
> mach_port_deallocate() even back then, but instead we chose to use 
> pthread_mach_thread_np() API, which is documented to return thread's 
> static mach port, which does not need to be returned via 
> mach_port_deallocate(). I think that solution just bypassed the mystery 
> of mach ports and when and if we need to delete them, but also was more 
> efficient as it did not create a new port.
> 
> Today, mach_thread_self() on OS X behaves differently, to how it did 
> back then - it no longer creates new mach ports, and so in most cases 
> the matching call to mach_port_deallocate() could be skipped, as the 
> counter the mach kernel uses for ref count is a very big number. Though 
> it would eventually overflow, and the kernel then would probably delete 
> the mach port with tragic results. I tried to see if I could make that 
> happen, but no luck in 24 hours of a test case continuously asking for 
> mach_thread_self() every 10 micro seconds.
> 
> So today we could have simply reverted to how the code used to be, and 
> not even bother to call mach_port_deallocate() after mach_thread_self(), 
> though theoretically that could overflow the ref count eventually.
> 
> The proposed fix, reverts back to using mach_thread_self() (which 
> increases the ref count) because it's probably signal-async-safe, unlike 
> pthread_mach_thread_np(), AND calls mach_port_deallocate() (which 
> decreases the refcount)
> 
> Someone please correct me if I got anything wrong here.
> 
> 
>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
>>> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev1/index.html
>>
>> I found the refactoring around _unique_thread_id a bit hard to follow. 
>> Your new set_unique_thread_id needs to use "#ifdef __APPLE__" 
>> otherwise it won't compile on other BSD systems.
> 
> I will fix it, test it and post a new webrev, assuming everyone is 
> satisfied with my answers above.
> 
> 
>>
>> Thanks,
>> David
>> -----
>>
>>> Tested with Mach5 tier1,2,3,4,5 and Kitchensink24 and 
>>> Kitchensink24Stress locally on Mac laptop over 2 days.
>>>
>>>
>>> cheers
>