RFR(s): 8204166: TLH: Semaphore may not be destroy until signal have returned.
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue Jun 19 13:24:29 UTC 2018
On 6/19/18 4:46 AM, Robbin Ehn wrote:
> Hi David,
>
> On 06/19/2018 07:10 AM, David Holmes wrote:
>> Hi Robbin,
>>
>> Overall changes seem okay. I gave a lot of thought as to whether an
>> "old" thread still returning from sem_wait could potentially
>> interfere with the next use of the sempahore, but it seems okay.
>> Interesting (read "scary") glibc bug!
>
> Good, yes!
>
>>
>> Minor comments:
>>
>> handshake.cpp:
>>
>> 311 if (thread->is_terminated()) {
>> 312 // If thread is not on threads list but armed, cancel.
>> 313 thread->cancel_handshake();
>> 314 return;
>> 315 }
>>
>> did you actually encounter late handshakes in the thread lifecycle
>> causing problems, or is this just being cautious?
>
> Just cautious.
I have a vague memory of Mikael G. running into this problem during
the early days of TLH... but I could be mis-remembering...
Dan
>
>>
>> 377 if(vmthread_can_process_handshake(target)) {
>>
>> Space needed after "if"
>
> Fixed!
>
> Thanks David, Robbin
>
>>
>> Thanks,
>> David
>> -----
>>
>> On 19/06/2018 12:05 AM, Robbin Ehn wrote:
>>> On 06/18/2018 03:07 PM, Robbin Ehn wrote:
>>>> Hi all,
>>>>
>>>> After some internal discussions I changed the patch to:
>>>> http://rehn-ws.se.oracle.com/cr_mirror/8204166/v2/
>>>
>>> Correct external url:
>>> http://cr.openjdk.java.net/~rehn/8204166/v2/
>>>
>>> /Robbin
>>>
>>>>
>>>> Which handles thread off javathreads list better.
>>>>
>>>> Passes handshake testing and ZGC testing seems okay.
>>>>
>>>> Thanks, Robbin
>>>>
>>>> On 06/14/2018 12:11 PM, Robbin Ehn wrote:
>>>>> Hi all, please review.
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8204166
>>>>> Webrev: http://cr.openjdk.java.net/~rehn/8204166/v1/webrev/
>>>>>
>>>>> The root cause of this failure is a bug in the posix semaphores:
>>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=12674
>>>>>
>>>>> Thread a:
>>>>> sem_post(my_sem);
>>>>>
>>>>> Thread b:
>>>>> sem_wait(my_sem);
>>>>> sem_destroy(my_sem);
>>>>>
>>>>> Thread b is waiting on my_sem (count 0), Thread a posts (count 0->1).
>>>>> If Thread b start executing directly after the increment in post
>>>>> but before
>>>>> Thread a leaves the call to post and manage to destroy the
>>>>> semaphore. Thread a
>>>>> _can_ get EINVAL from sem_post! This is fixed in newer glibc(2.21).
>>>>>
>>>>> Note that mutexes have had same issue on some platforms:
>>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=13690
>>>>> Fixed in 2.23.
>>>>>
>>>>> Since we only have one handshake operation running at anytime
>>>>> (safepoints and handshakes are also mutual exclusive, both run on
>>>>> VM Thread) we can actually always use the same semaphore. This
>>>>> patch changes the _done semaphore to be static instead, thus
>>>>> avoiding the post<->destroy race.
>>>>>
>>>>> Patch also contains some small changes which remove of dead code,
>>>>> remove unneeded state, handling of cases which we can't easily say
>>>>> will never happen and some additional error checks.
>>>>>
>>>>> Handshakes test passes, but they don't trigger the original issue,
>>>>> so more interesting is that this issue do not happen when running
>>>>> ZGC which utilize handshakes with the static semaphore.
>>>>>
>>>>> Thanks, Robbin
>
More information about the hotspot-dev
mailing list