RFR(s): 8220774: Add HandshakeALot diag option

Daniel D. Daugherty daniel.daugherty at oracle.com
Mon Mar 25 19:55:05 UTC 2019


> Please see v6:
> Inc:  http://cr.openjdk.java.net/~rehn/8220774/v6/inc/ 

test/hotspot/jtreg/runtime/Thread/SuspendAtExit.java
     No comments.

test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
     No comments (on the current changes).

Regarding this earlier reply:

>      L91:             for (Thread thr : _suspend_threads) {
>      L92:                 thr.resume();
>      L93:             }
>          So why is there an extra resume per _suspend_threads element?
>          You have matched suspend() and resume() calls so what am I
>          missing?
>
> If the two threads never suspended exit first, the other threads can 
> cause a
> circular suspend and they hanging forever. One of the issues in 8214174. 

You might want to add a comment above:

L87:             for (Thread thr : _suspend_threads) {

                  // Resume any worker threads that might have suspended
                  // each other at exactly the same time so they can see
                  // _exit_now and check in via the semaphore.

Thumbs up!

Dan



On 3/25/19 5:09 AM, Robbin Ehn wrote:
> Hi David,
>
> On 3/25/19 2:36 AM, David Holmes wrote:
>> Hi Robbin,
>>
>> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>>> Hi David,
>>>
>>> On 3/22/19 8:05 AM, David Holmes wrote:
>>>> Hi Robbin,
>>>>
>>>> This was a little more complex than I had imagined. :) A couple of 
>>>> comments:
>>>>
>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>
>>>>    508           // Have to unlock VMOperationQueue_lock just in case
>>>> no_op_safepoint()
>>>>    509           // has to do a handshake.
>>>>    510           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>> Mutex::_no_safepoint_check_flag);
>>>>    511           if (timedout && (_cur_vm_operation =
>>>> VMThread::no_op_safepoint()) != NULL) {
>>>>
>>>> wouldn't it be better to check timedout first and only then use the 
>>>> unlocker
>>>> then check _cur_vm_operation?
>>>
>>> Fixed.
>>>
>>>>
>>>> ---
>>>>
>>>> src/hotspot/share/runtime/thread.cpp
>>>>
>>>> Can has_last_Java_frame() and java_call_counter() change values 
>>>> between their
>>>> use in the assert and their use in the assert message?
>>>
>>> No, the JavaThread is safepoint_safe here, thus must have a stable 
>>> java stack.
>>
>> Ok.
>>
>>>>
>>>> ---
>>>>
>>>> Don't you want to add some tests that exercise this? Or update 
>>>> existing tests
>>>> that use SafepointALot to also use HandshakeALot?
>>>
>>> Fixed!
>>
>> I'm unclear how this fix and the renamed test relate to the existing 
>> bug JDK-8214174 that was referenced in the ProblemList entry ??
>
> Since we are not using withebox to walk thread stack, instead using
> HandshakeALot I removed Walk from name. The original could cause a 
> circular
> suspend issue.
>
>>
>> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>>
>> If you need the loop index use the full form of the for-loop. This:
>>
>>    43             int i = 0;
>>    44             for (Thread thr : _suspend_threads) {
>>    45                 if (i++ > _suspend_threads.length -2) {
>>    46                     // Leave last 2 threads running.
>>    47                     break;
>>    48                 }
>>    49                 if (Thread.currentThread() != thr) {
>>
>> can reduce to simply:
>>
>>      // Leave last 2 threads running.
>>      for (int i = 0; i < _suspend_threads.length - 2; i++) {
>>          if (Thread.currentThread() != _suspend_threads[i]) {
>>
>
> Fixed.
>
>>
>>   67         // Wait for all suspend thread starting to loop.
>>
>> ->  // Wait for all suspend-threads to start looping.
>>
>>   75             exit_threads[i] = new Thread(new Runnable() { public 
>> void run() {} });
>>
>> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>>
>>
>>    79         // Try to suspend them.
>>    80         for (Thread thr : exit_threads) {
>>    81             thr.suspend();
>>    82         }
>>    83         for (Thread thr : exit_threads) {
>>    84             thr.resume();
>>    85         }
>>
>> there's really no guarantee of getting the suspend during "exit". The 
>> SuspendAtExit test seems to use logging to make it more likely to 
>> encounter the suspend when desired. Is there a reason not to add a 
>> second @run to that test to use HandShakes?
>
> The original issue I had was so unlikely that I needed tenths of 
> thousands of
> thread and looping that for several hours. (that issue is fixed)
> So the test just try to quickly with a small chance hit interesting 
> paths.
>
> We can do that also.
>
>>
>>    90         do {
>>    91             for (Thread thr : _suspend_threads) {
>>    92                 thr.resume();
>>    93             }
>>    94             while (_sem.tryAcquire()) {
>>    95                 --waiting;
>>    96             }
>>    97         } while (waiting > 0);
>>
>> why do a busy-wait and resume threads you never suspended? Why not 
>> just do a blocking acquire() on the semaphore? The main 
>> suspend/resume logic in the _suspend_threads must leave the target 
>> thread resumed.
>
> The _suspend_threads may create a circular suspend, so we resume them 
> just in case.
>
>>
>>> All handshakes test passed 100 iteration on each platform.
>>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go 
>>> in first.
>>
>> Okay. Just waiting for Dan's stress test results.
>
> Please see v6:
> Inc:  http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
> Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
>
> Thanks, Robbin
>
>
>>
>> Thanks,
>> David
>> -----
>>
>>
>>> v5 full:
>>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>>> v5 inc:
>>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>>
>>> Thanks, Robbin
>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>>> Check-out v3, just full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>
>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>       No comments.
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>       No comments.
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>       L2954:   assert((!has_last_Java_frame() && 
>>>>>> java_call_counter() == 0) ||
>>>>>>       L2955:          (has_last_Java_frame() && 
>>>>>> java_call_counter() > 0),
>>>>>> "wrong java_sp info!");
>>>>>>          Perhaps change to this:
>>>>>>
>>>>>>                assert((!has_last_Java_frame() && 
>>>>>> java_call_counter() == 0) ||
>>>>>>                       (has_last_Java_frame() && 
>>>>>> java_call_counter() > 0),
>>>>>>                       "unexpected frame info: has_last_frame=%d,
>>>>>> java_call_counter=%d",
>>>>>>                       has_last_Java_frame(), java_call_counter());
>>>>>>
>>>>>>           and consider making the same change to the nmethod 
>>>>>> version.
>>>>>>
>>>>>
>>>>> Fixed both!
>>>>>
>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>       No comments.
>>>>>>
>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>       L449:   // Must check for handshakes first, since ops returns.
>>>>>>           Perhaps:
>>>>>>               // Check for handshakes first since we may need to 
>>>>>> return a VMop.
>>>>>>
>>>>>>       L454:   // Check for a true cleanup first, trying to keep 
>>>>>> stats correct.
>>>>>>           Perhaps:
>>>>>>               // Check for a cleanup before SafepointALot to keep 
>>>>>> stats correct.
>>>>>>
>>>>>>       L635:     // We want to make sure that we get to a 
>>>>>> safepoint regularly.
>>>>>>           Perhaps an addition:
>>>>>>                 // We want to make sure that we get to a 
>>>>>> safepoint regularly
>>>>>>                 // even when executing VMops that don't require 
>>>>>> safepoints.
>>>>>>
>>>>>
>>>>> Fixed above!
>>>>>
>>>>> Publishing v4, since I need a second review:
>>>>> Full:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>>> Inc:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>>
>>>>>> If you decide to tweak the semantics for SafepointALot down the 
>>>>>> road,
>>>>>> it would be best to do that in its own bug rather than as part of
>>>>>> another bug.
>>>>>>
>>>>>> Thumbs up! Your call on whether to tweak the assert or change the 
>>>>>> comments.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>>> Hi Dan,
>>>>>>>>
>>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>>
>>>>>>>>> It's too difficult to craft my comments relative to the 
>>>>>>>>> incremental
>>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>       No comments.
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>       No comments.
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>       L2953:   // This checks that the thread have a correct 
>>>>>>>>> frame state
>>>>>>>>> during a handshake
>>>>>>>>>           typo: s/have/has/
>>>>>>>>>           Also please add a '.' to end of the sentence.
>>>>>>>>
>>>>>>>> Fixed.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>       L2954:   assert((!has_last_Java_frame() && 
>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>       L2955:          (has_last_Java_frame() && 
>>>>>>>>> java_call_counter() > 0),
>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>           Trying to make sure I understand what you are 
>>>>>>>>> asserting:
>>>>>>>>>
>>>>>>>>>           1) if the thread does not have a last Java frame 
>>>>>>>>> then it must have
>>>>>>>>>              never called Java
>>>>>>>>>           2) if the thread has a last Java frame, then it must 
>>>>>>>>> have called
>>>>>>>>>              Java at least once
>>>>>>>>>
>>>>>>>>>           I'm good with the second expression. I have some 
>>>>>>>>> vague doubts about
>>>>>>>>>           first expression. When a JavaThread is done 
>>>>>>>>> executing Java code and
>>>>>>>>>           it is on its way toward a thread exit, is there ever 
>>>>>>>>> a time when it
>>>>>>>>>           no longer has a last Java frame? I'm thinking a 
>>>>>>>>> handshake late in
>>>>>>>>>           the JavaThread's life...
>>>>>>>>
>>>>>>>> The assert is a copy paste from
>>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>>
>>>>>>>> I have seen this assert trigger twice on windows, either it's 
>>>>>>>> wrong and should
>>>>>>>> be removed from nmethods_do, or it is correct and this will 
>>>>>>>> help us find when
>>>>>>>> this happens.
>>>>>>>
>>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>>
>>>>>>>      assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>>>             (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>             "unexpected frame info: has_last_frame=%d, 
>>>>>>> java_call_counter=%d",
>>>>>>>             has_last_Java_frame(), java_call_counter());
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>       No comments.
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>       L503:           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>>           Please add something like this above L503:
>>>>>>>>>
>>>>>>>>>                       // Have to unlock VMOperationQueue_lock 
>>>>>>>>> just in case
>>>>>>>>>                       // no_op_safepoint() has to do a handshake.
>>>>>>>>>
>>>>>>>>>       old L619:     // We want to make sure that we get to a 
>>>>>>>>> safepoint
>>>>>>>>> regularly.
>>>>>>>>>       old L620:     //
>>>>>>>>>       old L621:     if ((_cur_vm_operation =
>>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>>           This call to no_op_safepoint() is at the bottom of 
>>>>>>>>> the "while(true)"
>>>>>>>>>           loop in VMThread::loop(). Before the fix for 
>>>>>>>>> 8219436, this line
>>>>>>>>> used to be:
>>>>>>>>>
>>>>>>>>>               if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>>
>>>>>>>>>           and it was the only call to no_op_safepoint_needed() 
>>>>>>>>> that passed
>>>>>>>>>           'true'. It's the 'true' parameter that made the 
>>>>>>>>> comment on L619
>>>>>>>>>           correct. Why do I say that? Well, the only way we 
>>>>>>>>> get to the
>>>>>>>>>           bottom of the "while(true)" loop is if we had a 
>>>>>>>>> vm_operation to
>>>>>>>>>           perform. Let's say we had a bunch of no-safepoint 
>>>>>>>>> vm_operations
>>>>>>>>>           to perform and we just kept executing them, one 
>>>>>>>>> after another.
>>>>>>>>>           While doing this work, if our time between 
>>>>>>>>> safepoints exceeds
>>>>>>>>>           GuaranteedSafepointInterval, then this 
>>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>>           call is what would detect that we've gone too long 
>>>>>>>>> between
>>>>>>>>> safepoints
>>>>>>>>>           and would force one.
>>>>>>>>>
>>>>>>>>>           With the fix for 8219436, that call was changed to
>>>>>>>>> no_op_safepoint(false)
>>>>>>>>>           and with the v2 version of this fix that call is now 
>>>>>>>>> gone. So we no
>>>>>>>>>           longer have the ability to have 
>>>>>>>>> GuaranteedSafepointInterval work
>>>>>>>>>           right when we are doing lots of non-safepoint VM 
>>>>>>>>> operations.
>>>>>>>>>
>>>>>>>>>           This no_op_safepoint_needed() probe point, either 
>>>>>>>>> with 'true' or
>>>>>>>>>           'false', also gave us the opportunity to force a 
>>>>>>>>> safepoint when
>>>>>>>>>           SafepointALot is true or when
>>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>>           returns true. In the v2 version of this fix, we lose 
>>>>>>>>> the ability
>>>>>>>>>           to SafepointALot after each VM operation. We also 
>>>>>>>>> lose the ability
>>>>>>>>>           to do cleanup() after each VM operation.
>>>>>>>>
>>>>>>>> And I don't think that is an issue, ICache and Monitor, which 
>>>>>>>> is the
>>>>>>>> cleanup we
>>>>>>>> do, just have just basic heuristic. If we cleanup every second 
>>>>>>>> or every other
>>>>>>>> second makes little difference.
>>>>>>>
>>>>>>> My point is that you are changing the semantics of SafepointALot 
>>>>>>> without
>>>>>>> stating clearly that is what you are doing. The point of 
>>>>>>> SafepointALot is
>>>>>>> to inject a safepoint after every VMop even those VMops that 
>>>>>>> have to be
>>>>>>> executed at a safepoint. SafepointALot is considered a stress 
>>>>>>> option for
>>>>>>> a very good reason.
>>>>>>>
>>>>>>>
>>>>>>>> SafepointALot is just diag, which causes safepoint directly 
>>>>>>>> after a safepoint.
>>>>>>>> I'd rather have it doing one extra safepoint per 
>>>>>>>> GuaranteedSafepointInterval,
>>>>>>>> that way you know how many safepoint you have per second, which 
>>>>>>>> is really
>>>>>>>> useful
>>>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>>>
>>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>>> stress option.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>           I think you need to restore the "bool check_time" 
>>>>>>>>> parameter to
>>>>>>>>>           no_op_safepoint() and you need to restore this call 
>>>>>>>>> to as:
>>>>>>>>>           VMThread::no_op_safepoint_needed(true) along with 
>>>>>>>>> the rest
>>>>>>>>>           of old L618-L626.
>>>>>>>>
>>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>>
>>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can 
>>>>>>> miss
>>>>>>> a time check.
>>>>>>>
>>>>>>>
>>>>>>>> So the difference is that I think the previous behavior was 
>>>>>>>> bad/buggy and
>>>>>>>> you are trying to restore it :)
>>>>>>>>
>>>>>>>> Check-out v3, just full:
>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>
>>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>>> - if timeout
>>>>>>>>      - do handshake a lot
>>>>>>>>      - we are guaranteed to pass the time check, so cleanup if 
>>>>>>>> needed
>>>>>>>>      - else safepoint alot
>>>>>>>> - if op
>>>>>>>>      - after op
>>>>>>>>          - do handshake a lot
>>>>>>>>          - check time, if expired, cleanup if needed
>>>>>>>>          - else safepoint alot
>>>>>>>>
>>>>>>>> I think that is what you want :)
>>>>>>>> And that's what in v3.
>>>>>>>>
>>>>>>>> I would rather have:
>>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>>> - if wait-time > 0
>>>>>>>>      - wait on queue
>>>>>>>> - else or timeout
>>>>>>>>      - do handshake a lot
>>>>>>>>      - cleanup if needed
>>>>>>>>      - else safepoint alot
>>>>>>>>
>>>>>>>> That way we do check for cleanup after non-safepoint in time 
>>>>>>>> but we do not
>>>>>>>> causes extra safepoints after safepoints with SafepointALot. (I 
>>>>>>>> know you
>>>>>>>> think of this as feature :) )
>>>>>>>
>>>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, Robbin
>>>>>>>
>>>>>>



More information about the hotspot-runtime-dev mailing list