RFR 8230594: Allow direct handshakes without VMThread intervention

Wed Jan 22 13:53:54 UTC 2020

Hi Robbin,

On 1/22/20 5:58 AM, Robbin Ehn wrote:
>> Here is v5, which contains also style changes mentioned by Dan in his 
>> last review:
>>
>> Full: http://cr.openjdk.java.net/~pchilanomate/8230594/v05/webrev/
>> Inc: http://cr.openjdk.java.net/~pchilanomate/8230594/v05/inc/webrev/
>
> Ship it, thanks!
Great, thanks for reviewing this!

Patricio
> /Robbin
>
>>
>> Thanks!
>>
>> Patricio
>>> Thanks for fixing, Robbin
>>>
>>>>
>>>> Thanks!
>>>>
>>>> Patricio
>>>>> Dan
>>>>>
>>>>>
>>>>> On 1/15/20 10:56 PM, Patricio Chilano wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> On 1/15/20 8:29 PM, David Holmes wrote:
>>>>>>> Hi Patricio,
>>>>>>>
>>>>>>> On 16/01/2020 6:06 am, Patricio Chilano wrote:
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> Here is v2:
>>>>>>>>
>>>>>>>> Full: http://cr.openjdk.java.net/~pchilanomate/8230594/v02/webrev/
>>>>>>>> Inc: 
>>>>>>>> http://cr.openjdk.java.net/~pchilanomate/8230594/v02/inc/webrev/
>>>>>>>
>>>>>>> src/hotspot/share/runtime/handshake.cpp
>>>>>>>
>>>>>>> why is it that execute_direct doesn't check for 
>>>>>>> uses_thread_local_poll() ?
>>>>>> Sorry, I thought for a moment that thread-local handshakes for 
>>>>>> arm32 was already implemented and all platforms where using it. I 
>>>>>> added that check back, and by testing that path I had to also add 
>>>>>> back the VMThread in the assert in walk_stack_and_revoke().
>>>>>>
>>>>>>> src/hotspot/share/runtime/handshake.hpp
>>>>>>>
>>>>>>> typo: wether -> whether
>>>>>> Fixed.
>>>>>>
>>>>>> Here is v3:
>>>>>> Full: http://cr.openjdk.java.net/~pchilanomate/8230594/v03/webrev/
>>>>>> Inc: 
>>>>>> http://cr.openjdk.java.net/~pchilanomate/8230594/v03/inc/webrev/
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Patricio
>>>>>>> Otherwise updates look fine.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>> Passed one round of t1-6, and running more rounds now.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Patricio
>>>>>>>>
>>>>>>>> On 1/14/20 7:57 PM, David Holmes wrote:
>>>>>>>>> Hi Patricio,
>>>>>>>>>
>>>>>>>>> On 15/01/2020 5:17 am, Patricio Chilano wrote:
>>>>>>>>>> Hi David,
>>>>>>>>>>
>>>>>>>>>> On 1/14/20 4:13 AM, David Holmes wrote:
>>>>>>>>>>> Hi Patricio,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for tackling this. I've taken an initial look through 
>>>>>>>>>>> trying to understand the general approach, but I don't claim 
>>>>>>>>>>> to understand all the subtle details, so there are some 
>>>>>>>>>>> queries below to help with my overall understanding.
>>>>>>>>>>>
>>>>>>>>>>> On 14/01/2020 2:25 am, Patricio Chilano wrote:
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> The following patch adds the ability to execute direct 
>>>>>>>>>>>> handshakes between JavaThreads without the VMThread 
>>>>>>>>>>>> intervention, and enables this functionality for biased 
>>>>>>>>>>>> locking revocations.
>>>>>>>>>>>> The current handshake mechanism that uses the VMThread, 
>>>>>>>>>>>> either to handshake one JavaThread or all of them, is still 
>>>>>>>>>>>> the default unless you specify otherwise when calling 
>>>>>>>>>>>> Handshake::execute().
>>>>>>>>>>>
>>>>>>>>>>> Can I suggest that rather than using an overloaded:
>>>>>>>>>>>
>>>>>>>>>>>  static bool execute(HandshakeClosure* hs_cl, JavaThread* 
>>>>>>>>>>> target, bool is_direct_handshake = false);
>>>>>>>>>>>
>>>>>>>>>>> you instead add an execute_direct method for clarity?
>>>>>>>>>> Ok, added execute_direct() and removed overloading.
>>>>>>>>>>
>>>>>>>>>>>> In order to avoid adding additional overhead to this path 
>>>>>>>>>>>> that uses the VMThread (especially the one that handshakes 
>>>>>>>>>>>> all JavaThreads) I added a new HandshakeOperation pointer 
>>>>>>>>>>>> in the HandshakeState class, _operation_direct, to be used 
>>>>>>>>>>>> for the direct handshake cases only and whose access is 
>>>>>>>>>>>> serialized between JavaThreads by using a semaphore. Thus, 
>>>>>>>>>>>> one direct handshake will be allowed at any given time, and 
>>>>>>>>>>>> upon completion the semaphore will be signaled to allow the 
>>>>>>>>>>>> next handshaker if any to proceed. In this way the old 
>>>>>>>>>>>> _operation can still be used only by the VMThread without 
>>>>>>>>>>>> the need for synchronization to access it. The handshakee 
>>>>>>>>>>>> will now check if any of _operation or _operation_direct is 
>>>>>>>>>>>> set when checking for a pending handshake and will try to 
>>>>>>>>>>>> execute both in HandshakeState::process_self_inner(). The 
>>>>>>>>>>>> execution of the handshake’s ThreadClosure, either direct 
>>>>>>>>>>>> handshake or not, is still protected by a semaphore, which 
>>>>>>>>>>>> I renamed to _processing_sem.
>>>>>>>>>>>> I converted the semaphore _done in HandshakeOperation to be 
>>>>>>>>>>>> just an atomic counter because of bug 
>>>>>>>>>>>> https://sourceware.org/bugzilla/show_bug.cgi?id=12674 
>>>>>>>>>>>> (which I actually hit once!). Since the semaphore could not 
>>>>>>>>>>>> be static anymore due to possibly having more than one 
>>>>>>>>>>>> HandshakeOperation at a time, the handshakee could try to 
>>>>>>>>>>>> access the nwaiters field of an already destroyed semaphore 
>>>>>>>>>>>> when signaling it. In any case nobody was waiting on that 
>>>>>>>>>>>> semaphore (we were not using kernel functionality), so just 
>>>>>>>>>>>> using an atomic counter seems more appropriate.
>>>>>>>>>>>> In order to avoid issues due to disarming a JavaThread that 
>>>>>>>>>>>> should still be armed for a handshake or safepoint, each 
>>>>>>>>>>>> JavaThread will now always disarm its own polling page.
>>>>>>>>>>>> I also added a new test, HandshakeDirectTest.java, which 
>>>>>>>>>>>> tries to stress the use of direct handshakes with revocations.
>>>>>>>>>>>> In terms of performance, I measured no difference in the 
>>>>>>>>>>>> execution time of one individual handshake. The difference 
>>>>>>>>>>>> can be seen when several handshakes at a time are executed 
>>>>>>>>>>>> as expected. So for example on Linux running on an Intel 
>>>>>>>>>>>> Xeon 8167M cpu, test HandshakeDirectTest.java (which 
>>>>>>>>>>>> executes 50000 handshakes between 32 threads) executes in 
>>>>>>>>>>>> around 340ms using direct handshakes and in around 5.6 
>>>>>>>>>>>> seconds without it. For a modified version of that test 
>>>>>>>>>>>> that only executes 128 handshakes between the 32 threads 
>>>>>>>>>>>> and avoids any suspend-resume, the test takes around 12ms 
>>>>>>>>>>>> with direct handshakes and 19ms without it.
>>>>>>>>>>>> Tested with mach5, tiers1-6.
>>>>>>>>>>>>
>>>>>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8230594
>>>>>>>>>>>> Webrev: 
>>>>>>>>>>>> http://cr.openjdk.java.net/~pchilanomate/8230594/v01/webrev/
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/handshake.hpp
>>>>>>>>>>>
>>>>>>>>>>>  36 // A handshake closure is a callback that is executed 
>>>>>>>>>>> for each JavaThread
>>>>>>>>>>>   37 // while that thread is in a safepoint safe state. The 
>>>>>>>>>>> callback is executed
>>>>>>>>>>>   38 // either by the thread itself or by the VM thread 
>>>>>>>>>>> while keeping the thread
>>>>>>>>>>>   39 // in a blocked state. A handshake can be performed 
>>>>>>>>>>> with a single
>>>>>>>>>>>   40 // JavaThread as well.
>>>>>>>>>>>
>>>>>>>>>>> Does this comment block need updating for the direct case?
>>>>>>>>>>>
>>>>>>>>>>>  61 // the operation is only done by either VM thread on 
>>>>>>>>>>> behalf of the JavaThread
>>>>>>>>>>>  62 // or the JavaThread itself.
>>>>>>>>>>>
>>>>>>>>>>> Again does this need an update?
>>>>>>>>>> Yes, I forgot to update those comments in handshake.hpp. Fixed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/handshake.hpp
>>>>>>>>>>>
>>>>>>>>>>>  41 class HandshakeOperation: public StackObj {
>>>>>>>>>>>
>>>>>>>>>>> Would it be clearer to have HandShakeOperation subclassed by 
>>>>>>>>>>> DirectHandShakeOperation, rather than it being a property of 
>>>>>>>>>>> the current op instance?
>>>>>>>>>> But I would still need to access the is_direct field for 
>>>>>>>>>> non-direct handshakes in HandshakeState::try_process() anyways.
>>>>>>>>>
>>>>>>>>> Okay.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  41 class HandshakeOperation: public StackObj {
>>>>>>>>>>>  42   HandshakeClosure* _handshake_cl;
>>>>>>>>>>>  43   int64_t _pending_threads;
>>>>>>>>>>>  ...
>>>>>>>>>>>  53   void add_target_count(int count) { 
>>>>>>>>>>> Atomic::add(&_pending_threads, count); }
>>>>>>>>>>>
>>>>>>>>>>> Can you clarify the lifecycle of this StackObj please. You 
>>>>>>>>>>> obviously expose it to other threads - hence the atomic 
>>>>>>>>>>> update - but that means it needs to be guaranteed to be live 
>>>>>>>>>>> when that update occurs.
>>>>>>>>>> The HandshakeOperation object is created in 
>>>>>>>>>> Handshake::execute() (or Handshake::execute_direct() now) by 
>>>>>>>>>> the Handshaker. The only other threads that could access this 
>>>>>>>>>> object are the VMThread(for non-direct case only) and the 
>>>>>>>>>> Handshakee(in both cases). Seeing that it's safe for the 
>>>>>>>>>> VMThread to decrement the counter is straightforward since 
>>>>>>>>>> the Handshaker will be waiting on the VMOperationRequest_lock 
>>>>>>>>>> waiting for the operation to finish (this is just as always 
>>>>>>>>>> has been).
>>>>>>>>>> As for the Handshakee, we can see that the HandshakeOperation 
>>>>>>>>>> object will be alive as long as the VMThread(for the 
>>>>>>>>>> non-direct case) or Handshaker(for the direct case) keep 
>>>>>>>>>> seeing that the operation is not done by calling 
>>>>>>>>>> HandshakeOperation::is_completed(). Once that returns true, 
>>>>>>>>>> for the non-direct case the VMThread will finish the 
>>>>>>>>>> operation and wake up the Handshaker who will return from 
>>>>>>>>>> Handshake::execute() and the object will be destroyed, and 
>>>>>>>>>> for the direct case the Handshaker will just return from 
>>>>>>>>>> execute_direct() and the object will be destroyed. Since 
>>>>>>>>>> is_completed() will only return true when _pending_threads 
>>>>>>>>>> reaches zero, that means the decrement operation had to be 
>>>>>>>>>> safe. Just as an observation, for the HandshakeAllThreads 
>>>>>>>>>> operation, _pending_threads could go to zero and even 
>>>>>>>>>> negative before executing add_target_count(), but that's okay 
>>>>>>>>>> because the VMThread doesn't call is_completed() before that 
>>>>>>>>>> and the load of _pending_threads cannot float above 
>>>>>>>>>> add_target_count() since the atomic add operation provides a 
>>>>>>>>>> memory fence.
>>>>>>>>>> And finally the Handshakee cannot try to execute an operation 
>>>>>>>>>> that has already being processed by the VMThread/Handshaker, 
>>>>>>>>>> because they clear it first and then signal the 
>>>>>>>>>> _processing_sem semaphore in HandshakeState::try_process(). 
>>>>>>>>>> That worked the same before this change.
>>>>>>>>>
>>>>>>>>> Thanks for the detailed explanation!
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>>
>>>>>>>>>>>  349     HandshakeOperation* op = 
>>>>>>>>>>> Atomic::load_acquire(&_operation);
>>>>>>>>>>>  350     if (op == NULL) {
>>>>>>>>>>>  351       op = Atomic::load_acquire(&_operation_direct);
>>>>>>>>>>>
>>>>>>>>>>> This gives preference to the non-direct op. Is it possible 
>>>>>>>>>>> for the direct-op to be indefinitely delayed if there is a 
>>>>>>>>>>> series of non-direct ops?
>>>>>>>>>> Yes, it's possible, although I think that's an unlikely 
>>>>>>>>>> scenario. That would mean the Handshakee would be stuck in 
>>>>>>>>>> that while loop with the VMThread constantly setting new 
>>>>>>>>>> operations in _operation. But I can modify the loop and make 
>>>>>>>>>> it try to execute both before going into the next interation, 
>>>>>>>>>> something like (more on load_acquire next):
>>>>>>>>>>
>>>>>>>>>>      if (has_operation()) {
>>>>>>>>>>        HandleMark hm(_thread);
>>>>>>>>>>        CautiouslyPreserveExceptionMark pem(_thread);
>>>>>>>>>>        HandshakeOperation * op = _operation;
>>>>>>>>>>        if (op != NULL) {
>>>>>>>>>>          // Disarm before execute the operation
>>>>>>>>>>          clear_handshake(false);
>>>>>>>>>>          op->do_handshake(_thread);
>>>>>>>>>>        }
>>>>>>>>>>        op = _operation_direct;
>>>>>>>>>>        if (op != NULL) {
>>>>>>>>>>          // Disarm before execute the operation
>>>>>>>>>>          clear_handshake(true);
>>>>>>>>>>          op->do_handshake(_thread);
>>>>>>>>>>        }
>>>>>>>>>>      }
>>>>>>>>>>
>>>>>>>>>> what do you think?
>>>>>>>>>
>>>>>>>>> That seems better to me.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Also I don't see the release_stores that these load_acquire 
>>>>>>>>>>> would pair with. I'm assuming it should be here:
>>>>>>>>>>>
>>>>>>>>>>>  319     _operation = op;
>>>>>>>>>>>  323     _operation_direct = op;
>>>>>>>>>>>
>>>>>>>>>>> But I would also have expected all necessary memory 
>>>>>>>>>>> synchronization to already be present via the semaphore 
>>>>>>>>>>> operations and/or the rest of the handshake mechanism. ??
>>>>>>>>>> Yes, the release is in 
>>>>>>>>>> SafepointMechanism::arm_local_poll_release() after setting 
>>>>>>>>>> those fields. But it's true that those load_acquire are not 
>>>>>>>>>> needed since the semaphore already has acquire semantics. I 
>>>>>>>>>> actually wanted to do a normal load but ended up just copying 
>>>>>>>>>> how we were loading _operation. I''ll change them and retest.
>>>>>>>>>
>>>>>>>>> Okay.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>> -------
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 396 if ((!is_direct && _operation != NULL) || (is_direct && 
>>>>>>>>>>> _operation_direct != NULL)){
>>>>>>>>>>> 406   if ((!is_direct && _operation == NULL) || (is_direct 
>>>>>>>>>>> && _operation_direct == NULL)){
>>>>>>>>>>>
>>>>>>>>>>> Can this not be abstracted back into a "has_operation" 
>>>>>>>>>>> method that takes a "direct" parameter? e.g
>>>>>>>>>>>
>>>>>>>>>>>   bool has_operation() const { return _operation != NULL || 
>>>>>>>>>>> _operation_direct != NULL; }
>>>>>>>>>>> + bool has_specific_operation(bool direct) {
>>>>>>>>>>> +   return direct ? _operation_direct != NULL : _operation 
>>>>>>>>>>> != NULL;
>>>>>>>>>>> + }
>>>>>>>>>> Done.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> test/hotspot/jtreg/runtime/handshake/HandshakeDirectTest.java
>>>>>>>>>>>
>>>>>>>>>>> Please add:
>>>>>>>>>>>
>>>>>>>>>>> import java.util.concurrent.Semaphore;
>>>>>>>>>>>
>>>>>>>>>>> so you can just refer to Sempahore.
>>>>>>>>>> Done.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Style nit: _working_threads
>>>>>>>>>>>
>>>>>>>>>>> Java style is to not use leading underscore and to use 
>>>>>>>>>>> camelCase for variables ie workingThreads.
>>>>>>>>>> Right, I changed them to camelCase.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 44 static Thread _suspendresume_thread = new Thread();
>>>>>>>>>>>
>>>>>>>>>>> The above is dead code.
>>>>>>>>>> Removed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 53 if (_is_biased[me] == false) {
>>>>>>>>>>>
>>>>>>>>>>> Style nit: use "if (!_is_biased[me]) {"
>>>>>>>>>> Changed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 80             } catch(InterruptedException ie) {
>>>>>>>>>>>   81             }
>>>>>>>>>>>
>>>>>>>>>>> I suggest inserting "throw new Error("Unexpected 
>>>>>>>>>>> interrupt");" for good measure.
>>>>>>>>>>>
>>>>>>>>>>>  111 _working_threads[i].suspend();
>>>>>>>>>> Done.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thread suspension has been deprecated-for-removal as of JDK 
>>>>>>>>>>> 14 so could be gone in JDK 15 (but more likely 16). If the 
>>>>>>>>>>> suspend/resume is important for this test then you will need 
>>>>>>>>>>> to switch to using JVM TI suspend/resume; or else perhaps 
>>>>>>>>>>> introduce a WhiteBox method to do whatever you need in the VM.
>>>>>>>>>> It's not really needed for this test. I just wanted to mixed 
>>>>>>>>>> suspend-resume with handshakes because I know there could be 
>>>>>>>>>> some buggy interactions between them. But since it has been 
>>>>>>>>>> deprecated then maybe there is no point in stressing 
>>>>>>>>>> handshakes with them and I can remove that part.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  123 // Wait until the desired number of direct handshakes 
>>>>>>>>>>> is reached
>>>>>>>>>>>  124         while (_handshake_count.get() < 
>>>>>>>>>>> DIRECT_HANDSHAKES_MARK) {
>>>>>>>>>>>  125             Thread.sleep(10); // sleep for 10ms
>>>>>>>>>>>  126         }
>>>>>>>>>>>
>>>>>>>>>>> You could just do a join() on one of the worker threads.
>>>>>>>>>> Right, didn't thought about that. Changed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks for looking at this David! I'll retest with the above 
>>>>>>>>>> changes in the meantime.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Patricio
>>>>>>>>>>> Thanks,
>>>>>>>>>>> David
>>>>>>>>>>> -----
>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> Patricio
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>