RFR(s): 8204166: TLH: Semaphore may not be destroy until signal have returned.

Tue Jun 19 08:46:14 UTC 2018

Hi David,

On 06/19/2018 07:10 AM, David Holmes wrote:
> Hi Robbin,
> 
> Overall changes seem okay. I gave a lot of thought as to whether an "old" thread 
> still returning from sem_wait could potentially interfere with the next use of 
> the sempahore, but it seems okay. Interesting (read "scary") glibc bug!

Good, yes!

> 
> Minor comments:
> 
> handshake.cpp:
> 
>   311   if (thread->is_terminated()) {
>   312     // If thread is not on threads list but armed, cancel.
>   313     thread->cancel_handshake();
>   314     return;
>   315   }
> 
> did you actually encounter late handshakes in the thread lifecycle causing 
> problems, or is this just being cautious?

Just cautious.

> 
> 377   if(vmthread_can_process_handshake(target)) {
> 
> Space needed after "if"

Fixed!

Thanks David, Robbin

> 
> Thanks,
> David
> -----
> 
> On 19/06/2018 12:05 AM, Robbin Ehn wrote:
>> On 06/18/2018 03:07 PM, Robbin Ehn wrote:
>>> Hi all,
>>>
>>> After some internal discussions I changed the patch to:
>>> http://rehn-ws.se.oracle.com/cr_mirror/8204166/v2/
>>
>> Correct external url:
>> http://cr.openjdk.java.net/~rehn/8204166/v2/
>>
>> /Robbin
>>
>>>
>>> Which handles thread off javathreads list better.
>>>
>>> Passes handshake testing and ZGC testing seems okay.
>>>
>>> Thanks, Robbin
>>>
>>> On 06/14/2018 12:11 PM, Robbin Ehn wrote:
>>>> Hi all, please review.
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8204166
>>>> Webrev: http://cr.openjdk.java.net/~rehn/8204166/v1/webrev/
>>>>
>>>> The root cause of this failure is a bug in the posix semaphores: 
>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=12674
>>>>
>>>> Thread a:
>>>> sem_post(my_sem);
>>>>
>>>> Thread b:
>>>> sem_wait(my_sem);
>>>> sem_destroy(my_sem);
>>>>
>>>> Thread b is waiting on my_sem (count 0), Thread a posts (count 0->1).
>>>> If Thread b start executing directly after the increment in post but before
>>>> Thread a leaves the call to post and manage to destroy the semaphore. Thread a
>>>> _can_ get EINVAL from sem_post! This is fixed in newer glibc(2.21).
>>>>
>>>> Note that mutexes have had same issue on some platforms:
>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=13690
>>>> Fixed in 2.23.
>>>>
>>>> Since we only have one handshake operation running at anytime (safepoints 
>>>> and handshakes are also mutual exclusive, both run on VM Thread) we can 
>>>> actually always use the same semaphore. This patch changes the _done 
>>>> semaphore to be static instead, thus avoiding the post<->destroy race.
>>>>
>>>> Patch also contains some small changes which remove of dead code, remove 
>>>> unneeded state, handling of cases which we can't easily say will never 
>>>> happen and some additional error checks.
>>>>
>>>> Handshakes test passes, but they don't trigger the original issue, so more 
>>>> interesting is that this issue do not happen when running ZGC which utilize 
>>>> handshakes with the static semaphore.
>>>>
>>>> Thanks, Robbin