RFR(s): 8204166: TLH: Semaphore may not be destroy until signal have returned.

Tue Jun 19 05:10:41 UTC 2018

Hi Robbin,

Overall changes seem okay. I gave a lot of thought as to whether an 
"old" thread still returning from sem_wait could potentially interfere 
with the next use of the sempahore, but it seems okay. Interesting (read 
"scary") glibc bug!

Minor comments:

handshake.cpp:

  311   if (thread->is_terminated()) {
  312     // If thread is not on threads list but armed, cancel.
  313     thread->cancel_handshake();
  314     return;
  315   }

did you actually encounter late handshakes in the thread lifecycle 
causing problems, or is this just being cautious?

377   if(vmthread_can_process_handshake(target)) {

Space needed after "if"

Thanks,
David
-----	

On 19/06/2018 12:05 AM, Robbin Ehn wrote:
> On 06/18/2018 03:07 PM, Robbin Ehn wrote:
>> Hi all,
>>
>> After some internal discussions I changed the patch to:
>> http://rehn-ws.se.oracle.com/cr_mirror/8204166/v2/
> 
> Correct external url:
> http://cr.openjdk.java.net/~rehn/8204166/v2/
> 
> /Robbin
> 
>>
>> Which handles thread off javathreads list better.
>>
>> Passes handshake testing and ZGC testing seems okay.
>>
>> Thanks, Robbin
>>
>> On 06/14/2018 12:11 PM, Robbin Ehn wrote:
>>> Hi all, please review.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8204166
>>> Webrev: http://cr.openjdk.java.net/~rehn/8204166/v1/webrev/
>>>
>>> The root cause of this failure is a bug in the posix semaphores: 
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=12674
>>>
>>> Thread a:
>>> sem_post(my_sem);
>>>
>>> Thread b:
>>> sem_wait(my_sem);
>>> sem_destroy(my_sem);
>>>
>>> Thread b is waiting on my_sem (count 0), Thread a posts (count 0->1).
>>> If Thread b start executing directly after the increment in post but 
>>> before
>>> Thread a leaves the call to post and manage to destroy the semaphore. 
>>> Thread a
>>> _can_ get EINVAL from sem_post! This is fixed in newer glibc(2.21).
>>>
>>> Note that mutexes have had same issue on some platforms:
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=13690
>>> Fixed in 2.23.
>>>
>>> Since we only have one handshake operation running at anytime 
>>> (safepoints and handshakes are also mutual exclusive, both run on VM 
>>> Thread) we can actually always use the same semaphore. This patch 
>>> changes the _done semaphore to be static instead, thus avoiding the 
>>> post<->destroy race.
>>>
>>> Patch also contains some small changes which remove of dead code, 
>>> remove unneeded state, handling of cases which we can't easily say 
>>> will never happen and some additional error checks.
>>>
>>> Handshakes test passes, but they don't trigger the original issue, so 
>>> more interesting is that this issue do not happen when running ZGC 
>>> which utilize handshakes with the static semaphore.
>>>
>>> Thanks, Robbin