RFR(s): 8220774: Add HandshakeALot diag option

Robbin Ehn robbin.ehn at oracle.com
Mon Mar 25 09:09:21 UTC 2019


Hi David,

On 3/25/19 2:36 AM, David Holmes wrote:
> Hi Robbin,
> 
> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>> Hi David,
>>
>> On 3/22/19 8:05 AM, David Holmes wrote:
>>> Hi Robbin,
>>>
>>> This was a little more complex than I had imagined. :) A couple of comments:
>>>
>>> src/hotspot/share/runtime/vmThread.cpp
>>>
>>>    508           // Have to unlock VMOperationQueue_lock just in case
>>> no_op_safepoint()
>>>    509           // has to do a handshake.
>>>    510           MutexUnlockerEx mul(VMOperationQueue_lock,
>>> Mutex::_no_safepoint_check_flag);
>>>    511           if (timedout && (_cur_vm_operation =
>>> VMThread::no_op_safepoint()) != NULL) {
>>>
>>> wouldn't it be better to check timedout first and only then use the unlocker
>>> then check _cur_vm_operation?
>>
>> Fixed.
>>
>>>
>>> ---
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>>
>>> Can has_last_Java_frame() and java_call_counter() change values between their
>>> use in the assert and their use in the assert message?
>>
>> No, the JavaThread is safepoint_safe here, thus must have a stable java stack.
> 
> Ok.
> 
>>>
>>> ---
>>>
>>> Don't you want to add some tests that exercise this? Or update existing tests
>>> that use SafepointALot to also use HandshakeALot?
>>
>> Fixed!
> 
> I'm unclear how this fix and the renamed test relate to the existing bug 
> JDK-8214174 that was referenced in the ProblemList entry ??

Since we are not using withebox to walk thread stack, instead using
HandshakeALot I removed Walk from name. The original could cause a circular
suspend issue.

> 
> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
> 
> If you need the loop index use the full form of the for-loop. This:
> 
>    43             int i = 0;
>    44             for (Thread thr : _suspend_threads) {
>    45                 if (i++ > _suspend_threads.length -2) {
>    46                     // Leave last 2 threads running.
>    47                     break;
>    48                 }
>    49                 if (Thread.currentThread() != thr) {
> 
> can reduce to simply:
> 
>      // Leave last 2 threads running.
>      for (int i = 0; i < _suspend_threads.length - 2; i++) {
>          if (Thread.currentThread() != _suspend_threads[i]) {
> 

Fixed.

> 
>   67         // Wait for all suspend thread starting to loop.
> 
> ->  // Wait for all suspend-threads to start looping.
> 
>   75             exit_threads[i] = new Thread(new Runnable() { public void run() 
> {} });
> 
> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
> 
> 
>    79         // Try to suspend them.
>    80         for (Thread thr : exit_threads) {
>    81             thr.suspend();
>    82         }
>    83         for (Thread thr : exit_threads) {
>    84             thr.resume();
>    85         }
> 
> there's really no guarantee of getting the suspend during "exit". The 
> SuspendAtExit test seems to use logging to make it more likely to encounter the 
> suspend when desired. Is there a reason not to add a second @run to that test to 
> use HandShakes?

The original issue I had was so unlikely that I needed tenths of thousands of
thread and looping that for several hours. (that issue is fixed)
So the test just try to quickly with a small chance hit interesting paths.

We can do that also.

> 
>    90         do {
>    91             for (Thread thr : _suspend_threads) {
>    92                 thr.resume();
>    93             }
>    94             while (_sem.tryAcquire()) {
>    95                 --waiting;
>    96             }
>    97         } while (waiting > 0);
> 
> why do a busy-wait and resume threads you never suspended? Why not just do a 
> blocking acquire() on the semaphore? The main suspend/resume logic in the 
> _suspend_threads must leave the target thread resumed.

The _suspend_threads may create a circular suspend, so we resume them just in case.

> 
>> All handshakes test passed 100 iteration on each platform.
>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go in first.
> 
> Okay. Just waiting for Dan's stress test results.

Please see v6:
Inc:  http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/

Thanks, Robbin


> 
> Thanks,
> David
> -----
> 
> 
>> v5 full:
>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>> v5 inc:
>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>
>> Thanks, Robbin
>>
>>>
>>> Thanks,
>>> David
>>>
>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>> Check-out v3, just full:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>
>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>       No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>       No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>       L2954:   assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>       L2955:          (has_last_Java_frame() && java_call_counter() > 0),
>>>>> "wrong java_sp info!");
>>>>>          Perhaps change to this:
>>>>>
>>>>>                assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>                       (has_last_Java_frame() && java_call_counter() > 0),
>>>>>                       "unexpected frame info: has_last_frame=%d,
>>>>> java_call_counter=%d",
>>>>>                       has_last_Java_frame(), java_call_counter());
>>>>>
>>>>>           and consider making the same change to the nmethod version.
>>>>>
>>>>
>>>> Fixed both!
>>>>
>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>       No comments.
>>>>>
>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>       L449:   // Must check for handshakes first, since ops returns.
>>>>>           Perhaps:
>>>>>               // Check for handshakes first since we may need to return a 
>>>>> VMop.
>>>>>
>>>>>       L454:   // Check for a true cleanup first, trying to keep stats correct.
>>>>>           Perhaps:
>>>>>               // Check for a cleanup before SafepointALot to keep stats 
>>>>> correct.
>>>>>
>>>>>       L635:     // We want to make sure that we get to a safepoint regularly.
>>>>>           Perhaps an addition:
>>>>>                 // We want to make sure that we get to a safepoint regularly
>>>>>                 // even when executing VMops that don't require safepoints.
>>>>>
>>>>
>>>> Fixed above!
>>>>
>>>> Publishing v4, since I need a second review:
>>>> Full:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>> Inc:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>
>>>> Thanks, Robbin
>>>>
>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>
>>>>> If you decide to tweak the semantics for SafepointALot down the road,
>>>>> it would be best to do that in its own bug rather than as part of
>>>>> another bug.
>>>>>
>>>>> Thumbs up! Your call on whether to tweak the assert or change the comments.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>
>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>
>>>>>>>> It's too difficult to craft my comments relative to the incremental
>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>       L2953:   // This checks that the thread have a correct frame state
>>>>>>>> during a handshake
>>>>>>>>           typo: s/have/has/
>>>>>>>>           Also please add a '.' to end of the sentence.
>>>>>>>
>>>>>>> Fixed.
>>>>>>>
>>>>>>>>
>>>>>>>>       L2954:   assert((!has_last_Java_frame() && java_call_counter() == 
>>>>>>>> 0) ||
>>>>>>>>       L2955:          (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>> "wrong java_sp info!");
>>>>>>>>           Trying to make sure I understand what you are asserting:
>>>>>>>>
>>>>>>>>           1) if the thread does not have a last Java frame then it must 
>>>>>>>> have
>>>>>>>>              never called Java
>>>>>>>>           2) if the thread has a last Java frame, then it must have called
>>>>>>>>              Java at least once
>>>>>>>>
>>>>>>>>           I'm good with the second expression. I have some vague doubts 
>>>>>>>> about
>>>>>>>>           first expression. When a JavaThread is done executing Java 
>>>>>>>> code and
>>>>>>>>           it is on its way toward a thread exit, is there ever a time 
>>>>>>>> when it
>>>>>>>>           no longer has a last Java frame? I'm thinking a handshake late in
>>>>>>>>           the JavaThread's life...
>>>>>>>
>>>>>>> The assert is a copy paste from
>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>
>>>>>>> I have seen this assert trigger twice on windows, either it's wrong and 
>>>>>>> should
>>>>>>> be removed from nmethods_do, or it is correct and this will help us find 
>>>>>>> when
>>>>>>> this happens.
>>>>>>
>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>
>>>>>>      assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>>             (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>             "unexpected frame info: has_last_frame=%d, java_call_counter=%d",
>>>>>>             has_last_Java_frame(), java_call_counter());
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>       L503:           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>           Please add something like this above L503:
>>>>>>>>
>>>>>>>>                       // Have to unlock VMOperationQueue_lock just in case
>>>>>>>>                       // no_op_safepoint() has to do a handshake.
>>>>>>>>
>>>>>>>>       old L619:     // We want to make sure that we get to a safepoint
>>>>>>>> regularly.
>>>>>>>>       old L620:     //
>>>>>>>>       old L621:     if ((_cur_vm_operation =
>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>           This call to no_op_safepoint() is at the bottom of the 
>>>>>>>> "while(true)"
>>>>>>>>           loop in VMThread::loop(). Before the fix for 8219436, this line
>>>>>>>> used to be:
>>>>>>>>
>>>>>>>>               if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>
>>>>>>>>           and it was the only call to no_op_safepoint_needed() that passed
>>>>>>>>           'true'. It's the 'true' parameter that made the comment on L619
>>>>>>>>           correct. Why do I say that? Well, the only way we get to the
>>>>>>>>           bottom of the "while(true)" loop is if we had a vm_operation to
>>>>>>>>           perform. Let's say we had a bunch of no-safepoint vm_operations
>>>>>>>>           to perform and we just kept executing them, one after another.
>>>>>>>>           While doing this work, if our time between safepoints exceeds
>>>>>>>>           GuaranteedSafepointInterval, then this 
>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>           call is what would detect that we've gone too long between
>>>>>>>> safepoints
>>>>>>>>           and would force one.
>>>>>>>>
>>>>>>>>           With the fix for 8219436, that call was changed to
>>>>>>>> no_op_safepoint(false)
>>>>>>>>           and with the v2 version of this fix that call is now gone. So 
>>>>>>>> we no
>>>>>>>>           longer have the ability to have GuaranteedSafepointInterval work
>>>>>>>>           right when we are doing lots of non-safepoint VM operations.
>>>>>>>>
>>>>>>>>           This no_op_safepoint_needed() probe point, either with 'true' or
>>>>>>>>           'false', also gave us the opportunity to force a safepoint when
>>>>>>>>           SafepointALot is true or when
>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>           returns true. In the v2 version of this fix, we lose the ability
>>>>>>>>           to SafepointALot after each VM operation. We also lose the 
>>>>>>>> ability
>>>>>>>>           to do cleanup() after each VM operation.
>>>>>>>
>>>>>>> And I don't think that is an issue, ICache and Monitor, which is the
>>>>>>> cleanup we
>>>>>>> do, just have just basic heuristic. If we cleanup every second or every 
>>>>>>> other
>>>>>>> second makes little difference.
>>>>>>
>>>>>> My point is that you are changing the semantics of SafepointALot without
>>>>>> stating clearly that is what you are doing. The point of SafepointALot is
>>>>>> to inject a safepoint after every VMop even those VMops that have to be
>>>>>> executed at a safepoint. SafepointALot is considered a stress option for
>>>>>> a very good reason.
>>>>>>
>>>>>>
>>>>>>> SafepointALot is just diag, which causes safepoint directly after a 
>>>>>>> safepoint.
>>>>>>> I'd rather have it doing one extra safepoint per 
>>>>>>> GuaranteedSafepointInterval,
>>>>>>> that way you know how many safepoint you have per second, which is really
>>>>>>> useful
>>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>>
>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>> stress option.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>           I think you need to restore the "bool check_time" parameter to
>>>>>>>>           no_op_safepoint() and you need to restore this call to as:
>>>>>>>>           VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>>>>           of old L618-L626.
>>>>>>>
>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>
>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>>>>> a time check.
>>>>>>
>>>>>>
>>>>>>> So the difference is that I think the previous behavior was bad/buggy and
>>>>>>> you are trying to restore it :)
>>>>>>>
>>>>>>> Check-out v3, just full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>
>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>> - if timeout
>>>>>>>      - do handshake a lot
>>>>>>>      - we are guaranteed to pass the time check, so cleanup if needed
>>>>>>>      - else safepoint alot
>>>>>>> - if op
>>>>>>>      - after op
>>>>>>>          - do handshake a lot
>>>>>>>          - check time, if expired, cleanup if needed
>>>>>>>          - else safepoint alot
>>>>>>>
>>>>>>> I think that is what you want :)
>>>>>>> And that's what in v3.
>>>>>>>
>>>>>>> I would rather have:
>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>> - if wait-time > 0
>>>>>>>      - wait on queue
>>>>>>> - else or timeout
>>>>>>>      - do handshake a lot
>>>>>>>      - cleanup if needed
>>>>>>>      - else safepoint alot
>>>>>>>
>>>>>>> That way we do check for cleanup after non-safepoint in time but we do not
>>>>>>> causes extra safepoints after safepoints with SafepointALot. (I know you
>>>>>>> think of this as feature :) )
>>>>>>
>>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>>
>>>>>>> Thanks, Robbin
>>>>>>
>>>>>


More information about the hotspot-runtime-dev mailing list