RFR(m): 8214271: Fast primitive to wake many threads

Wed Jan 9 03:40:42 UTC 2019

Hi Robbin,

No further significant comments, lets just see how this plays out.

Some minor nits:

src/hotspot/share/utilities/waitBarrier.hpp

! // A primary goal of the WaitBarrier implementation is to disarm all 
waiting

s/disarm/wake/

That was one place a global replace shouldn't have been applied. :)

! //    - Calling disarm() guarantees any thread calling or called 
wait(tag) will

"or called" is not grammatically correct. Perhaps:

//    - Calling disarm() guarantees any thread now calling or that has 
called wait(tag) will

     // Guarantees any thread called wait() will be awake when it returns.

s/called/that called/

---

src/hotspot/share/utilities/waitBarrier_generic.cpp

!   // disarm store must not float below.

s/float/sink/

74   // API specifies wake() must provides a trailing fence.

s/wake/disarm/
s/provides/provide/

  81     // API specifies wait() must provides a trailing fence.

s/provides/provide/

Thanks,
David

On 8/01/2019 8:42 pm, Robbin Ehn wrote:
> Hi David,
> 
> On 1/2/19 12:35 AM, David Holmes wrote:
>>>> Further this sounds like a race that could lead to bugs if not used 
>>>> very carefully ie. you can't assume between disarm() and wake() that 
>>>> all threads are blocked.
>>>
>>> I didn't realize how subtle this is. I think your original comment that
>>> disarm/wake should be one operation was spot on.
>>> Investigating... thinking... testing... yes I think this will work, 
>>> fixed!
>>> Sorry for not looking more into this before.
>>
>> I'm now curious how this will actually work in the context of the 
>> safepoint changes?
> 
> Since code already handle this 'invariant' with threads not being block 
> between disarm() and wake(), doing it one operation just very slightly 
> increases the chance that a thread will be blocked when we actually can 
> handle it to be running, but reduces the chance to hit a false positive 
> TLH poll.
> (with TLH we have a two-step un-synchronizing out of safepoints where we 
> must change global safepoint state before changing the thread polling 
> state)
> 
> (I have some thoughts on simplifying TLH/safepoint states)
> 
>> Nit: I would have kept disarm() rather than wake() as I like the 
>> arm/disarm duality.
> 
> Yes, me too. Not sure why I did the opposite, fixed!
> 
>>
>>    void GenericWaitBarrier::wait(int barrier_tag) {
>>      assert(barrier_tag != 0, "Trying to wait on disarmed value");
>> +   if (barrier_tag == 0 && barrier_tag != _barrier_tag) {
>> +     OrderAccess::fence();
>> +     return;
>> +   }
>>
>> I don't understand what the above is doing. A barrier_tag of 0 is a 
>> programming error caught during testing in debug builds. You don't 
>> need to account for it being 0 in product because this isn't something 
>> that can come in from an external source - we have full code control 
>> here. And even if you want to be this paranoid why would you need the 
>> fence?
> 
> Fixed, but kept the fence, since we say we are providing a trailing fence.
> Otherwise I would like to add that exception to the description of wait().
> 
> Including Dan's comments:
> Full: http://cr.openjdk.java.net/~rehn/8214271/6/full/webrev/
> Inc : http://cr.openjdk.java.net/~rehn/8214271/6/inc/webrev/
> 
> 
> Thanks, Robbin
> 
>>
>> Thanks,
>> David
>> -----
>>
>>> Full:
>>> http://cr.openjdk.java.net/~rehn/8214271/5/full/webrev/
>>>
>>> gtest passes thousands of loops locally and hundreds in mach5.
>>>
>>> Thanks, Robbin
>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>>>
>>>>>> s/Implementation/Implementations/
>>>>>
>>>>> Fixed
>>>>>
>>>>>>
>>>>>> The fourth line is no longer needed.
>>>>>
>>>>> Above is the reason I would like to keep the fourth line, since 
>>>>> only if you call
>>>>> both disarm() and wake() you have that guarantee that waiter 
>>>>> threads will
>>>>> return.
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>>> Inc:
>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/4/inc/webrev/
>>>>>>>
>>>>>>> Full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/4/full/webrev/
>>>>>>>
>>>>>>> /Robbin
>>>>>>>
>>>>>>>>
>>>>>>>> Otherwise this all looks good!
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>> -----
>>>>>>>>
>>>>>>>>
>>>>>>>>> Full:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/3/full/webrev/
>>>>>>>>>
>>>>>>>>> Thanks, Robbin
>>>>>>>>>
>>>>>>>>> On 11/23/18 5:55 PM, Robbin Ehn wrote:
>>>>>>>>>> Forgot RFR in subject.
>>>>>>>>>>
>>>>>>>>>> /Robbin
>>>>>>>>>>
>>>>>>>>>> On 2018-11-23 17:51, Robbin Ehn wrote:
>>>>>>>>>>> Hi all, please review.
>>>>>>>>>>>
>>>>>>>>>>> When a safepoint is ended we need a way to get back to 100% 
>>>>>>>>>>> utilization as fast
>>>>>>>>>>> as possible. 100% utilization means no idle cpu in the system 
>>>>>>>>>>> if there is a
>>>>>>>>>>> JavaThread that could be executed. The traditional ways to 
>>>>>>>>>>> wake many, e.g.
>>>>>>>>>>> semaphore, pthread_cond, is not implemented with a single 
>>>>>>>>>>> syscall instead they
>>>>>>>>>>> typical do one syscall per thread to wake.
>>>>>>>>>>>
>>>>>>>>>>> This change-set contains that primitive, the WaitBarrier, and 
>>>>>>>>>>> a gtest for it.
>>>>>>>>>>> No actual users, which is in coming patches.
>>>>>>>>>>>
>>>>>>>>>>> The WaitBarrier solves by doing a cooperative semaphore 
>>>>>>>>>>> posting, threads woken
>>>>>>>>>>> will also post. On Linux we can instead directly use a futex 
>>>>>>>>>>> and with one
>>>>>>>>>>> syscall wake all. Depending on how many threads and cpus the 
>>>>>>>>>>> performance vary,
>>>>>>>>>>> but a good utilization of the machine, just on the edge of 
>>>>>>>>>>> saturated, the time to reach 100% utilization is around 3 
>>>>>>>>>>> times faster with the WaitBarrier (where futex is faster than 
>>>>>>>>>>> semaphore).
>>>>>>>>>>>
>>>>>>>>>>> Webrev:
>>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/webrev/
>>>>>>>>>>>
>>>>>>>>>>> CR:
>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8214271
>>>>>>>>>>>
>>>>>>>>>>> Passes 100 iterations of gtest on our platforms, both 
>>>>>>>>>>> fastdebug and release.
>>>>>>>>>>> And have been stable when used in safepoints (t1-8) (coming 
>>>>>>>>>>> patches).
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Robbin