RFR(m): 8214271: Fast primitive to wake many threads

Robbin Ehn robbin.ehn at oracle.com
Fri Jan 11 09:20:38 UTC 2019


Thanks Dan!

/Robbin

On 1/10/19 4:57 PM, Daniel D. Daugherty wrote:
>  > Including Dan's comments:
>  > Full: http://cr.openjdk.java.net/~rehn/8214271/6/full/webrev/
>  > Inc : http://cr.openjdk.java.net/~rehn/8214271/6/inc/webrev/
> 
> src/hotspot/os/linux/waitBarrier_linux.cpp
>      No comments.
> 
> src/hotspot/os/linux/waitBarrier_linux.hpp
>      No comments.
> 
> src/hotspot/share/utilities/waitBarrier.hpp
>      No comments beyond David's.
> 
> src/hotspot/share/utilities/waitBarrier_generic.cpp
>      No comments beyond David's.
> 
> src/hotspot/share/utilities/waitBarrier_generic.hpp
>      No comments.
> 
> test/hotspot/gtest/utilities/test_waitBarrier.cpp
>      No comments.
> 
> Thumbs up!
> 
> Dan
> 
> 
> 
> On 1/9/19 7:57 AM, Robbin Ehn wrote:
>> Hi David,
>>
>> On 1/9/19 4:40 AM, David Holmes wrote:
>>> Hi Robbin,
>>>
>>> No further significant comments, lets just see how this plays out.
>>
>> Yes, thanks!
>>
>> Fixed nits.
>>
>> If Dan have something more non-trivial I'll publish a v7.
>>
>> /Robbin
>>
>>>
>>> Some minor nits:
>>>
>>> src/hotspot/share/utilities/waitBarrier.hpp
>>>
>>> ! // A primary goal of the WaitBarrier implementation is to disarm all waiting
>>>
>>> s/disarm/wake/
>>>
>>> That was one place a global replace shouldn't have been applied. :)
>>
>> Fixed!
>>
>>>
>>> ! //    - Calling disarm() guarantees any thread calling or called wait(tag) 
>>> will
>>>
>>> "or called" is not grammatically correct. Perhaps:
>>>
>>> //    - Calling disarm() guarantees any thread now calling or that has called 
>>> wait(tag) will
>>>
>>
>> Fixed!
>>
>>>
>>>      // Guarantees any thread called wait() will be awake when it returns.
>>>
>>> s/called/that called/
>>>
>>
>> Fixed!
>>
>>> ---
>>>
>>> src/hotspot/share/utilities/waitBarrier_generic.cpp
>>>
>>> !   // disarm store must not float below.
>>>
>>> s/float/sink/
>>>
>>
>> Fixed!
>>
>>> 74   // API specifies wake() must provides a trailing fence.
>>>
>>> s/wake/disarm/
>>> s/provides/provide/
>>
>> Fixed
>>
>>>
>>>   81     // API specifies wait() must provides a trailing fence.
>>>
>>> s/provides/provide/
>>
>> Fixed
>>
>>>
>>> Thanks,
>>> David
>>>
>>> On 8/01/2019 8:42 pm, Robbin Ehn wrote:
>>>> Hi David,
>>>>
>>>> On 1/2/19 12:35 AM, David Holmes wrote:
>>>>>>> Further this sounds like a race that could lead to bugs if not used very 
>>>>>>> carefully ie. you can't assume between disarm() and wake() that all 
>>>>>>> threads are blocked.
>>>>>>
>>>>>> I didn't realize how subtle this is. I think your original comment that
>>>>>> disarm/wake should be one operation was spot on.
>>>>>> Investigating... thinking... testing... yes I think this will work, fixed!
>>>>>> Sorry for not looking more into this before.
>>>>>
>>>>> I'm now curious how this will actually work in the context of the safepoint 
>>>>> changes?
>>>>
>>>> Since code already handle this 'invariant' with threads not being block 
>>>> between disarm() and wake(), doing it one operation just very slightly 
>>>> increases the chance that a thread will be blocked when we actually can 
>>>> handle it to be running, but reduces the chance to hit a false positive TLH 
>>>> poll.
>>>> (with TLH we have a two-step un-synchronizing out of safepoints where we 
>>>> must change global safepoint state before changing the thread polling state)
>>>>
>>>> (I have some thoughts on simplifying TLH/safepoint states)
>>>>
>>>>> Nit: I would have kept disarm() rather than wake() as I like the arm/disarm 
>>>>> duality.
>>>>
>>>> Yes, me too. Not sure why I did the opposite, fixed!
>>>>
>>>>>
>>>>>    void GenericWaitBarrier::wait(int barrier_tag) {
>>>>>      assert(barrier_tag != 0, "Trying to wait on disarmed value");
>>>>> +   if (barrier_tag == 0 && barrier_tag != _barrier_tag) {
>>>>> +     OrderAccess::fence();
>>>>> +     return;
>>>>> +   }
>>>>>
>>>>> I don't understand what the above is doing. A barrier_tag of 0 is a 
>>>>> programming error caught during testing in debug builds. You don't need to 
>>>>> account for it being 0 in product because this isn't something that can 
>>>>> come in from an external source - we have full code control here. And even 
>>>>> if you want to be this paranoid why would you need the fence?
>>>>
>>>> Fixed, but kept the fence, since we say we are providing a trailing fence.
>>>> Otherwise I would like to add that exception to the description of wait().
>>>>
>>>> Including Dan's comments:
>>>> Full: http://cr.openjdk.java.net/~rehn/8214271/6/full/webrev/
>>>> Inc : http://cr.openjdk.java.net/~rehn/8214271/6/inc/webrev/
>>>>
>>>>
>>>> Thanks, Robbin
>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>> Full:
>>>>>> http://cr.openjdk.java.net/~rehn/8214271/5/full/webrev/
>>>>>>
>>>>>> gtest passes thousands of loops locally and hundreds in mach5.
>>>>>>
>>>>>> Thanks, Robbin
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>>>
>>>>>>>>> s/Implementation/Implementations/
>>>>>>>>
>>>>>>>> Fixed
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The fourth line is no longer needed.
>>>>>>>>
>>>>>>>> Above is the reason I would like to keep the fourth line, since only if 
>>>>>>>> you call
>>>>>>>> both disarm() and wake() you have that guarantee that waiter threads will
>>>>>>>> return.
>>>>>>>>
>>>>>>>> Thanks, Robbin
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Inc:
>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/4/inc/webrev/
>>>>>>>>>>
>>>>>>>>>> Full:
>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/4/full/webrev/
>>>>>>>>>>
>>>>>>>>>> /Robbin
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Otherwise this all looks good!
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> David
>>>>>>>>>>> -----
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Full:
>>>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/3/full/webrev/
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Robbin
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/23/18 5:55 PM, Robbin Ehn wrote:
>>>>>>>>>>>>> Forgot RFR in subject.
>>>>>>>>>>>>>
>>>>>>>>>>>>> /Robbin
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2018-11-23 17:51, Robbin Ehn wrote:
>>>>>>>>>>>>>> Hi all, please review.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When a safepoint is ended we need a way to get back to 100% 
>>>>>>>>>>>>>> utilization as fast
>>>>>>>>>>>>>> as possible. 100% utilization means no idle cpu in the system if 
>>>>>>>>>>>>>> there is a
>>>>>>>>>>>>>> JavaThread that could be executed. The traditional ways to wake 
>>>>>>>>>>>>>> many, e.g.
>>>>>>>>>>>>>> semaphore, pthread_cond, is not implemented with a single syscall 
>>>>>>>>>>>>>> instead they
>>>>>>>>>>>>>> typical do one syscall per thread to wake.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This change-set contains that primitive, the WaitBarrier, and a 
>>>>>>>>>>>>>> gtest for it.
>>>>>>>>>>>>>> No actual users, which is in coming patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The WaitBarrier solves by doing a cooperative semaphore posting, 
>>>>>>>>>>>>>> threads woken
>>>>>>>>>>>>>> will also post. On Linux we can instead directly use a futex and 
>>>>>>>>>>>>>> with one
>>>>>>>>>>>>>> syscall wake all. Depending on how many threads and cpus the 
>>>>>>>>>>>>>> performance vary,
>>>>>>>>>>>>>> but a good utilization of the machine, just on the edge of 
>>>>>>>>>>>>>> saturated, the time to reach 100% utilization is around 3 times 
>>>>>>>>>>>>>> faster with the WaitBarrier (where futex is faster than semaphore).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Webrev:
>>>>>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8214271/webrev/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CR:
>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8214271
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Passes 100 iterations of gtest on our platforms, both fastdebug 
>>>>>>>>>>>>>> and release.
>>>>>>>>>>>>>> And have been stable when used in safepoints (t1-8) (coming patches).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Robbin
> 


More information about the hotspot-dev mailing list