RFR(s): 8220774: Add HandshakeALot diag option

Tue Mar 26 07:12:08 UTC 2019

On 26/03/2019 5:00 pm, Robbin Ehn wrote:
> Hi David,
> 
>> This really makes me wonder whether the test truly stresses 
>> suspend/resume or whether it quickly grinds to a halt till the main 
>> thread tells everyone to exit?
> 
> That is the reason for skipping two 'last' threads.

Ah!

> In worse case there is always two threads suspending and resuming the 
> others.
> We seem to manage something like 800 handshakes during the test on my 
> machine.
> So the chances are not to bad to hit a few of the interesting code paths.
> 
> Adding Dan's last comment suggestion, are you fine with v6 + that comment?

Yep all good!

Thanks,
David

> Thanks, Robbin
> 
> 
>>
>> Cheers,
>> David
>>
>>> Thumbs up!
>>>
>>> Dan
>>>
>>>
>>>
>>> On 3/25/19 5:09 AM, Robbin Ehn wrote:
>>>> Hi David,
>>>>
>>>> On 3/25/19 2:36 AM, David Holmes wrote:
>>>>> Hi Robbin,
>>>>>
>>>>> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>>>>>> Hi David,
>>>>>>
>>>>>> On 3/22/19 8:05 AM, David Holmes wrote:
>>>>>>> Hi Robbin,
>>>>>>>
>>>>>>> This was a little more complex than I had imagined. :) A couple 
>>>>>>> of comments:
>>>>>>>
>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>
>>>>>>>    508           // Have to unlock VMOperationQueue_lock just in 
>>>>>>> case
>>>>>>> no_op_safepoint()
>>>>>>>    509           // has to do a handshake.
>>>>>>>    510           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>    511           if (timedout && (_cur_vm_operation =
>>>>>>> VMThread::no_op_safepoint()) != NULL) {
>>>>>>>
>>>>>>> wouldn't it be better to check timedout first and only then use 
>>>>>>> the unlocker
>>>>>>> then check _cur_vm_operation?
>>>>>>
>>>>>> Fixed.
>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>
>>>>>>> Can has_last_Java_frame() and java_call_counter() change values 
>>>>>>> between their
>>>>>>> use in the assert and their use in the assert message?
>>>>>>
>>>>>> No, the JavaThread is safepoint_safe here, thus must have a stable 
>>>>>> java stack.
>>>>>
>>>>> Ok.
>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> Don't you want to add some tests that exercise this? Or update 
>>>>>>> existing tests
>>>>>>> that use SafepointALot to also use HandshakeALot?
>>>>>>
>>>>>> Fixed!
>>>>>
>>>>> I'm unclear how this fix and the renamed test relate to the 
>>>>> existing bug JDK-8214174 that was referenced in the ProblemList 
>>>>> entry ??
>>>>
>>>> Since we are not using withebox to walk thread stack, instead using
>>>> HandshakeALot I removed Walk from name. The original could cause a 
>>>> circular
>>>> suspend issue.
>>>>
>>>>>
>>>>> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>>>>>
>>>>> If you need the loop index use the full form of the for-loop. This:
>>>>>
>>>>>    43             int i = 0;
>>>>>    44             for (Thread thr : _suspend_threads) {
>>>>>    45                 if (i++ > _suspend_threads.length -2) {
>>>>>    46                     // Leave last 2 threads running.
>>>>>    47                     break;
>>>>>    48                 }
>>>>>    49                 if (Thread.currentThread() != thr) {
>>>>>
>>>>> can reduce to simply:
>>>>>
>>>>>      // Leave last 2 threads running.
>>>>>      for (int i = 0; i < _suspend_threads.length - 2; i++) {
>>>>>          if (Thread.currentThread() != _suspend_threads[i]) {
>>>>>
>>>>
>>>> Fixed.
>>>>
>>>>>
>>>>>   67         // Wait for all suspend thread starting to loop.
>>>>>
>>>>> ->  // Wait for all suspend-threads to start looping.
>>>>>
>>>>>   75             exit_threads[i] = new Thread(new Runnable() { 
>>>>> public void run() {} });
>>>>>
>>>>> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>>>>>
>>>>>
>>>>>    79         // Try to suspend them.
>>>>>    80         for (Thread thr : exit_threads) {
>>>>>    81             thr.suspend();
>>>>>    82         }
>>>>>    83         for (Thread thr : exit_threads) {
>>>>>    84             thr.resume();
>>>>>    85         }
>>>>>
>>>>> there's really no guarantee of getting the suspend during "exit". 
>>>>> The SuspendAtExit test seems to use logging to make it more likely 
>>>>> to encounter the suspend when desired. Is there a reason not to add 
>>>>> a second @run to that test to use HandShakes?
>>>>
>>>> The original issue I had was so unlikely that I needed tenths of 
>>>> thousands of
>>>> thread and looping that for several hours. (that issue is fixed)
>>>> So the test just try to quickly with a small chance hit interesting 
>>>> paths.
>>>>
>>>> We can do that also.
>>>>
>>>>>
>>>>>    90         do {
>>>>>    91             for (Thread thr : _suspend_threads) {
>>>>>    92                 thr.resume();
>>>>>    93             }
>>>>>    94             while (_sem.tryAcquire()) {
>>>>>    95                 --waiting;
>>>>>    96             }
>>>>>    97         } while (waiting > 0);
>>>>>
>>>>> why do a busy-wait and resume threads you never suspended? Why not 
>>>>> just do a blocking acquire() on the semaphore? The main 
>>>>> suspend/resume logic in the _suspend_threads must leave the target 
>>>>> thread resumed.
>>>>
>>>> The _suspend_threads may create a circular suspend, so we resume 
>>>> them just in case.
>>>>
>>>>>
>>>>>> All handshakes test passed 100 iteration on each platform.
>>>>>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should 
>>>>>> go in first.
>>>>>
>>>>> Okay. Just waiting for Dan's stress test results.
>>>>
>>>> Please see v6:
>>>> Inc:  http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
>>>> Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
>>>>
>>>> Thanks, Robbin
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>
>>>>>> v5 full:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>>>>>> v5 inc:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>>>>>
>>>>>> Thanks, Robbin
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>>>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>>>>>> Check-out v3, just full:
>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>       No comments.
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>       No comments.
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>       L2954:   assert((!has_last_Java_frame() && 
>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>       L2955:          (has_last_Java_frame() && 
>>>>>>>>> java_call_counter() > 0),
>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>          Perhaps change to this:
>>>>>>>>>
>>>>>>>>>                assert((!has_last_Java_frame() && 
>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>                       (has_last_Java_frame() && 
>>>>>>>>> java_call_counter() > 0),
>>>>>>>>>                       "unexpected frame info: has_last_frame=%d,
>>>>>>>>> java_call_counter=%d",
>>>>>>>>>                       has_last_Java_frame(), java_call_counter());
>>>>>>>>>
>>>>>>>>>           and consider making the same change to the nmethod 
>>>>>>>>> version.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Fixed both!
>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>       No comments.
>>>>>>>>>
>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>       L449:   // Must check for handshakes first, since ops 
>>>>>>>>> returns.
>>>>>>>>>           Perhaps:
>>>>>>>>>               // Check for handshakes first since we may need 
>>>>>>>>> to return a VMop.
>>>>>>>>>
>>>>>>>>>       L454:   // Check for a true cleanup first, trying to keep 
>>>>>>>>> stats correct.
>>>>>>>>>           Perhaps:
>>>>>>>>>               // Check for a cleanup before SafepointALot to 
>>>>>>>>> keep stats correct.
>>>>>>>>>
>>>>>>>>>       L635:     // We want to make sure that we get to a 
>>>>>>>>> safepoint regularly.
>>>>>>>>>           Perhaps an addition:
>>>>>>>>>                 // We want to make sure that we get to a 
>>>>>>>>> safepoint regularly
>>>>>>>>>                 // even when executing VMops that don't require 
>>>>>>>>> safepoints.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Fixed above!
>>>>>>>>
>>>>>>>> Publishing v4, since I need a second review:
>>>>>>>> Full:
>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>>>>>> Inc:
>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>>>>>
>>>>>>>> Thanks, Robbin
>>>>>>>>
>>>>>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>>>>>
>>>>>>>>> If you decide to tweak the semantics for SafepointALot down the 
>>>>>>>>> road,
>>>>>>>>> it would be best to do that in its own bug rather than as part of
>>>>>>>>> another bug.
>>>>>>>>>
>>>>>>>>> Thumbs up! Your call on whether to tweak the assert or change 
>>>>>>>>> the comments.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>>>>>
>>>>>>>>>>>> It's too difficult to craft my comments relative to the 
>>>>>>>>>>>> incremental
>>>>>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>>>>       No comments.
>>>>>>>>>>>>
>>>>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>>>>       No comments.
>>>>>>>>>>>>
>>>>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>>>>       L2953:   // This checks that the thread have a correct 
>>>>>>>>>>>> frame state
>>>>>>>>>>>> during a handshake
>>>>>>>>>>>>           typo: s/have/has/
>>>>>>>>>>>>           Also please add a '.' to end of the sentence.
>>>>>>>>>>>
>>>>>>>>>>> Fixed.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>       L2954:   assert((!has_last_Java_frame() && 
>>>>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>>>>       L2955:          (has_last_Java_frame() && 
>>>>>>>>>>>> java_call_counter() > 0),
>>>>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>>>>           Trying to make sure I understand what you are 
>>>>>>>>>>>> asserting:
>>>>>>>>>>>>
>>>>>>>>>>>>           1) if the thread does not have a last Java frame 
>>>>>>>>>>>> then it must have
>>>>>>>>>>>>              never called Java
>>>>>>>>>>>>           2) if the thread has a last Java frame, then it 
>>>>>>>>>>>> must have called
>>>>>>>>>>>>              Java at least once
>>>>>>>>>>>>
>>>>>>>>>>>>           I'm good with the second expression. I have some 
>>>>>>>>>>>> vague doubts about
>>>>>>>>>>>>           first expression. When a JavaThread is done 
>>>>>>>>>>>> executing Java code and
>>>>>>>>>>>>           it is on its way toward a thread exit, is there 
>>>>>>>>>>>> ever a time when it
>>>>>>>>>>>>           no longer has a last Java frame? I'm thinking a 
>>>>>>>>>>>> handshake late in
>>>>>>>>>>>>           the JavaThread's life...
>>>>>>>>>>>
>>>>>>>>>>> The assert is a copy paste from
>>>>>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>>>>>
>>>>>>>>>>> I have seen this assert trigger twice on windows, either it's 
>>>>>>>>>>> wrong and should
>>>>>>>>>>> be removed from nmethods_do, or it is correct and this will 
>>>>>>>>>>> help us find when
>>>>>>>>>>> this happens.
>>>>>>>>>>
>>>>>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>>>>>
>>>>>>>>>>      assert((!has_last_Java_frame() && java_call_counter() == 
>>>>>>>>>> 0) ||
>>>>>>>>>>             (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>>>>             "unexpected frame info: has_last_frame=%d, 
>>>>>>>>>> java_call_counter=%d",
>>>>>>>>>>             has_last_Java_frame(), java_call_counter());
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>>>>       No comments.
>>>>>>>>>>>>
>>>>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>>>>       L503:           MutexUnlockerEx 
>>>>>>>>>>>> mul(VMOperationQueue_lock,
>>>>>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>>>>>           Please add something like this above L503:
>>>>>>>>>>>>
>>>>>>>>>>>>                       // Have to unlock 
>>>>>>>>>>>> VMOperationQueue_lock just in case
>>>>>>>>>>>>                       // no_op_safepoint() has to do a 
>>>>>>>>>>>> handshake.
>>>>>>>>>>>>
>>>>>>>>>>>>       old L619:     // We want to make sure that we get to a 
>>>>>>>>>>>> safepoint
>>>>>>>>>>>> regularly.
>>>>>>>>>>>>       old L620:     //
>>>>>>>>>>>>       old L621:     if ((_cur_vm_operation =
>>>>>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>>>>>           This call to no_op_safepoint() is at the bottom of 
>>>>>>>>>>>> the "while(true)"
>>>>>>>>>>>>           loop in VMThread::loop(). Before the fix for 
>>>>>>>>>>>> 8219436, this line
>>>>>>>>>>>> used to be:
>>>>>>>>>>>>
>>>>>>>>>>>>               if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>>>>>
>>>>>>>>>>>>           and it was the only call to 
>>>>>>>>>>>> no_op_safepoint_needed() that passed
>>>>>>>>>>>>           'true'. It's the 'true' parameter that made the 
>>>>>>>>>>>> comment on L619
>>>>>>>>>>>>           correct. Why do I say that? Well, the only way we 
>>>>>>>>>>>> get to the
>>>>>>>>>>>>           bottom of the "while(true)" loop is if we had a 
>>>>>>>>>>>> vm_operation to
>>>>>>>>>>>>           perform. Let's say we had a bunch of no-safepoint 
>>>>>>>>>>>> vm_operations
>>>>>>>>>>>>           to perform and we just kept executing them, one 
>>>>>>>>>>>> after another.
>>>>>>>>>>>>           While doing this work, if our time between 
>>>>>>>>>>>> safepoints exceeds
>>>>>>>>>>>>           GuaranteedSafepointInterval, then this 
>>>>>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>>>>>           call is what would detect that we've gone too long 
>>>>>>>>>>>> between
>>>>>>>>>>>> safepoints
>>>>>>>>>>>>           and would force one.
>>>>>>>>>>>>
>>>>>>>>>>>>           With the fix for 8219436, that call was changed to
>>>>>>>>>>>> no_op_safepoint(false)
>>>>>>>>>>>>           and with the v2 version of this fix that call is 
>>>>>>>>>>>> now gone. So we no
>>>>>>>>>>>>           longer have the ability to have 
>>>>>>>>>>>> GuaranteedSafepointInterval work
>>>>>>>>>>>>           right when we are doing lots of non-safepoint VM 
>>>>>>>>>>>> operations.
>>>>>>>>>>>>
>>>>>>>>>>>>           This no_op_safepoint_needed() probe point, either 
>>>>>>>>>>>> with 'true' or
>>>>>>>>>>>>           'false', also gave us the opportunity to force a 
>>>>>>>>>>>> safepoint when
>>>>>>>>>>>>           SafepointALot is true or when
>>>>>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>>>>>           returns true. In the v2 version of this fix, we 
>>>>>>>>>>>> lose the ability
>>>>>>>>>>>>           to SafepointALot after each VM operation. We also 
>>>>>>>>>>>> lose the ability
>>>>>>>>>>>>           to do cleanup() after each VM operation.
>>>>>>>>>>>
>>>>>>>>>>> And I don't think that is an issue, ICache and Monitor, which 
>>>>>>>>>>> is the
>>>>>>>>>>> cleanup we
>>>>>>>>>>> do, just have just basic heuristic. If we cleanup every 
>>>>>>>>>>> second or every other
>>>>>>>>>>> second makes little difference.
>>>>>>>>>>
>>>>>>>>>> My point is that you are changing the semantics of 
>>>>>>>>>> SafepointALot without
>>>>>>>>>> stating clearly that is what you are doing. The point of 
>>>>>>>>>> SafepointALot is
>>>>>>>>>> to inject a safepoint after every VMop even those VMops that 
>>>>>>>>>> have to be
>>>>>>>>>> executed at a safepoint. SafepointALot is considered a stress 
>>>>>>>>>> option for
>>>>>>>>>> a very good reason.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> SafepointALot is just diag, which causes safepoint directly 
>>>>>>>>>>> after a safepoint.
>>>>>>>>>>> I'd rather have it doing one extra safepoint per 
>>>>>>>>>>> GuaranteedSafepointInterval,
>>>>>>>>>>> that way you know how many safepoint you have per second, 
>>>>>>>>>>> which is really
>>>>>>>>>>> useful
>>>>>>>>>>> for benchmarking. In current form it can't be used for 
>>>>>>>>>>> benchmarks.
>>>>>>>>>>
>>>>>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>>>>>> stress option.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>           I think you need to restore the "bool check_time" 
>>>>>>>>>>>> parameter to
>>>>>>>>>>>>           no_op_safepoint() and you need to restore this 
>>>>>>>>>>>> call to as:
>>>>>>>>>>>>           VMThread::no_op_safepoint_needed(true) along with 
>>>>>>>>>>>> the rest
>>>>>>>>>>>>           of old L618-L626.
>>>>>>>>>>>
>>>>>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>>>>>> When we timeout on queue wait we will always pass the time 
>>>>>>>>>>> check.
>>>>>>>>>>
>>>>>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you 
>>>>>>>>>> can miss
>>>>>>>>>> a time check.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> So the difference is that I think the previous behavior was 
>>>>>>>>>>> bad/buggy and
>>>>>>>>>>> you are trying to restore it :)
>>>>>>>>>>>
>>>>>>>>>>> Check-out v3, just full:
>>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>>>
>>>>>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>>>>>> - if timeout
>>>>>>>>>>>      - do handshake a lot
>>>>>>>>>>>      - we are guaranteed to pass the time check, so cleanup 
>>>>>>>>>>> if needed
>>>>>>>>>>>      - else safepoint alot
>>>>>>>>>>> - if op
>>>>>>>>>>>      - after op
>>>>>>>>>>>          - do handshake a lot
>>>>>>>>>>>          - check time, if expired, cleanup if needed
>>>>>>>>>>>          - else safepoint alot
>>>>>>>>>>>
>>>>>>>>>>> I think that is what you want :)
>>>>>>>>>>> And that's what in v3.
>>>>>>>>>>>
>>>>>>>>>>> I would rather have:
>>>>>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>>>>>> - if wait-time > 0
>>>>>>>>>>>      - wait on queue
>>>>>>>>>>> - else or timeout
>>>>>>>>>>>      - do handshake a lot
>>>>>>>>>>>      - cleanup if needed
>>>>>>>>>>>      - else safepoint alot
>>>>>>>>>>>
>>>>>>>>>>> That way we do check for cleanup after non-safepoint in time 
>>>>>>>>>>> but we do not
>>>>>>>>>>> causes extra safepoints after safepoints with SafepointALot. 
>>>>>>>>>>> (I know you
>>>>>>>>>>> think of this as feature :) )
>>>>>>>>>>
>>>>>>>>>> Not so much a feature as a stress option. I love my stress 
>>>>>>>>>> options.
>>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Robbin
>>>>>>>>>>
>>>>>>>>>
>>>