RFR (S) 8235962: os::current_thread_id() is not signal safe on macOS

Thu Jan 16 19:14:19 UTC 2020

Hi David,

Thank you very much for the review.

On 1/13/20 8:49 PM, David Holmes wrote:
> Hi Gerard,
>
> Thanks for tackling this. All extremely confusing as usual :) far too 
> many different notions of "thread identity".
>
> On 14/01/2020 4:43 am, gerard ziemski wrote:
>> hi all,
>>
>> Before JDK-8022808 we used to call "mach_thread_self()" to obtain a 
>> thread id, but back then we forgot to release the returned mach port, 
>> which would make a long running java process (like Kitchensink) 
>> eventually run out of resources and hang it.
>>
>> To solve that, in JDK-8022808, we switched to using 
>> "pthread_mach_thread_np(pthread_self())", which returns a static mach 
>> port, which does not need to be released.
>> However, the "pthread_mach_thread_np()" API, which we started using 
>> is specifically knownn as not async-signal-safe API and that might 
>> cause an issue, because we use it to generate thread ids in 
>> hs_err_pid.log crash log files in our signal handler, which requires 
>> that only async-signal-safe API be used.
>>
>> Our crash handler is best effort approach, and to this extend I'd 
>> like to propose a fix, which reverts back to using 
>> "mach_thread_self()", which I must admit is unknown whether it itself 
>> is async-signal-safe, though it's not part of pthread API, like 
>> "pthread_mach_thread_np()" which itself is not part of 
>> async-signal-safe APIs (IEEE Std 1003.1, 2016). Additionally, the 
>> reporter of the issue suggest that it solves the actual issue with 
>> using "pthread_mach_thread_np()" for them in their project, so let's 
>> benefit from their experience.
>
> Okay that seems reasonable.
>
>> In this fix we revert back to using "mach_thread_self()" API, but 
>> make sure to release the mach port, and use MACH_PORT_VALID() macro 
>> to verify the validity of mach port.
>
> I'm unclear on this aspect. Now that we are releasing it doesn't that 
> mean that the port value could be reused and applied to a different 
> thread? It isn't at all clear to me what a "mach port" actually means 
> and what releasing it (or failing to) means.

I've done more research and testing and here is how I understand our 
usage of mach ports:

Mach ports are a resource a code needs in order to talk to mach thread 
(or task), though we simply use its value as thread id. They (i.e. the 
mach kernel) use reference counting to keep track of whether they need 
to exist, and get automatically destroyed when the ref count reaches 0.

They can be created via APIs like mach_port_allocate(), and deleted 
directly via mach_port_destroy(), or indirectly via 
mach_port_deallocate(). The difference between these two (inaptly named 
imho) APIs is that while mach_port_destroy() will outright delete the 
port, mach_port_deallocate() will only decrease the ref count and will 
call mach_port_destroy() implicitly only if it reaches 0.

If the user obtains mach port via mach_thread_self(), the ref count of 
that thread will increase, which needs to be matched by a call to 
mach_port_deallocate(). However, one can not delete such special mach 
port via mach_port_destroy() (only user own created ports can de deleted).

Back in 2013 a call to mach_thread_self() would create a brand new mach 
port, so if you did not call mach_port_deallocate() after being done 
with it, the resource would be leaked. I believe that was actually a bug 
in early OS X, or possibly a bad API design, which affected many other 
users, not just us (i.e. Google's Chrome app for Mac). The confusion was 
primary due to the fact that mach_task_self() would not create a new 
port, so developers expected the same from mach_thread_self(). There are 
only about 300,000 ports available to a process (verified via a native 
sample that keeps creating new ports via mach_port_allocate() API), so 
eventually long lived apps, like our kitchensink, would run out of them.

We could have fixed our issue by matching mach_thread_self() with 
mach_port_deallocate() even back then, but instead we chose to use 
pthread_mach_thread_np() API, which is documented to return thread's 
static mach port, which does not need to be returned via 
mach_port_deallocate(). I think that solution just bypassed the mystery 
of mach ports and when and if we need to delete them, but also was more 
efficient as it did not create a new port.

Today, mach_thread_self() on OS X behaves differently, to how it did 
back then - it no longer creates new mach ports, and so in most cases 
the matching call to mach_port_deallocate() could be skipped, as the 
counter the mach kernel uses for ref count is a very big number. Though 
it would eventually overflow, and the kernel then would probably delete 
the mach port with tragic results. I tried to see if I could make that 
happen, but no luck in 24 hours of a test case continuously asking for 
mach_thread_self() every 10 micro seconds.

So today we could have simply reverted to how the code used to be, and 
not even bother to call mach_port_deallocate() after mach_thread_self(), 
though theoretically that could overflow the ref count eventually.

The proposed fix, reverts back to using mach_thread_self() (which 
increases the ref count) because it's probably signal-async-safe, unlike 
pthread_mach_thread_np(), AND calls mach_port_deallocate() (which 
decreases the refcount)

Someone please correct me if I got anything wrong here.

>
>> bug: https://bugs.openjdk.java.net/browse/JDK-8235962
>> webrev: http://cr.openjdk.java.net/~gziemski/8235962_rev1/index.html
>
> I found the refactoring around _unique_thread_id a bit hard to follow. 
> Your new set_unique_thread_id needs to use "#ifdef __APPLE__" 
> otherwise it won't compile on other BSD systems.

I will fix it, test it and post a new webrev, assuming everyone is 
satisfied with my answers above.

>
> Thanks,
> David
> -----
>
>> Tested with Mach5 tier1,2,3,4,5 and Kitchensink24 and 
>> Kitchensink24Stress locally on Mac laptop over 2 days.
>>
>>
>> cheers