RFR(s): 8220774: Add HandshakeALot diag option

Mon Mar 25 08:22:11 UTC 2019

Hi Dan,

> test/hotspot/jtreg/ProblemList.txt
>      JDK-8214174 is still open. Are you going to list it in this changeset
>      or close it as a dup of this bug?

I was going to close it.

> 
> test/hotspot/jtreg/runtime/handshake/HandshakeTransitionTest.java
>      Please update the copyright year before you push.
> 
> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>      L45:                 if (i++ > _suspend_threads.length -2) {
>          s/-2/- 2/
> 
>      L60:         // Fire-up suspend thread.
>          s/thread/threads/
> 
>      L67:         // Wait for all suspend thread starting to loop.
>          s/thread/threads/

Fixed.

> 
>      L91:             for (Thread thr : _suspend_threads) {
>      L92:                 thr.resume();
>      L93:             }
>          So why is there an extra resume per _suspend_threads element?
>          You have matched suspend() and resume() calls so what am I
>          missing?

If the two threads never suspended exit first, the other threads can cause a
circular suspend and they hanging forever. One of the issues in 8214174.

> 
> test/hotspot/jtreg/runtime/handshake/HandshakeWalkSuspendExitTest.java
>      So you're replacing this test with HandshakeSuspendExitTest.java.

Since we are not using withebox to walk thread stack, instead using
HandshakeALot I removed Walk from name.

> 
> Thumbs up!

Thanks, Robbin

> 
> Dan
> 
>>
>> Thanks, Robbin
>>
>>> Thanks,
>>> David
>>>
>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>> Check-out v3, just full:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>       No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>       No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>       L2954:   assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>       L2955:          (has_last_Java_frame() && java_call_counter() > 0),
>>>>> "wrong java_sp info!");
>>>>>          Perhaps change to this:
>>>>>
>>>>>                assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>                       (has_last_Java_frame() && java_call_counter() > 0),
>>>>>                       "unexpected frame info: has_last_frame=%d,
>>>>> java_call_counter=%d",
>>>>>                       has_last_Java_frame(), java_call_counter());
>>>>>
>>>>>           and consider making the same change to the nmethod version.
>>>>>
>>>> Fixed both!
>>>>
>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>       No comments.
>>>>>
>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>       L449:   // Must check for handshakes first, since ops returns.
>>>>>           Perhaps:
>>>>>               // Check for handshakes first since we may need to return a 
>>>>> VMop.
>>>>>
>>>>>       L454:   // Check for a true cleanup first, trying to keep stats correct.
>>>>>           Perhaps:
>>>>>               // Check for a cleanup before SafepointALot to keep stats 
>>>>> correct.
>>>>>
>>>>>       L635:     // We want to make sure that we get to a safepoint regularly.
>>>>>           Perhaps an addition:
>>>>>                 // We want to make sure that we get to a safepoint regularly
>>>>>                 // even when executing VMops that don't require safepoints.
>>>>>
>>>> Fixed above!
>>>>
>>>> Publishing v4, since I need a second review:
>>>> Full:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>> Inc:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>
>>>> Thanks, Robbin
>>>>
>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>
>>>>> If you decide to tweak the semantics for SafepointALot down the road,
>>>>> it would be best to do that in its own bug rather than as part of
>>>>> another bug.
>>>>>
>>>>> Thumbs up! Your call on whether to tweak the assert or change the comments.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>
>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>> It's too difficult to craft my comments relative to the incremental
>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>       L2953:   // This checks that the thread have a correct frame state
>>>>>>>> during a handshake
>>>>>>>>           typo: s/have/has/
>>>>>>>>           Also please add a '.' to end of the sentence.
>>>>>>> Fixed.
>>>>>>>
>>>>>>>>       L2954:   assert((!has_last_Java_frame() && java_call_counter() == 
>>>>>>>> 0) ||
>>>>>>>>       L2955:          (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>> "wrong java_sp info!");
>>>>>>>>           Trying to make sure I understand what you are asserting:
>>>>>>>>
>>>>>>>>           1) if the thread does not have a last Java frame then it must 
>>>>>>>> have
>>>>>>>>              never called Java
>>>>>>>>           2) if the thread has a last Java frame, then it must have called
>>>>>>>>              Java at least once
>>>>>>>>
>>>>>>>>           I'm good with the second expression. I have some vague doubts 
>>>>>>>> about
>>>>>>>>           first expression. When a JavaThread is done executing Java 
>>>>>>>> code and
>>>>>>>>           it is on its way toward a thread exit, is there ever a time 
>>>>>>>> when it
>>>>>>>>           no longer has a last Java frame? I'm thinking a handshake late in
>>>>>>>>           the JavaThread's life...
>>>>>>> The assert is a copy paste from
>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>
>>>>>>> I have seen this assert trigger twice on windows, either it's wrong and 
>>>>>>> should
>>>>>>> be removed from nmethods_do, or it is correct and this will help us find 
>>>>>>> when
>>>>>>> this happens.
>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>
>>>>>>      assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>>             (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>             "unexpected frame info: has_last_frame=%d, java_call_counter=%d",
>>>>>>             has_last_Java_frame(), java_call_counter());
>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>       L503:           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>           Please add something like this above L503:
>>>>>>>>
>>>>>>>>                       // Have to unlock VMOperationQueue_lock just in case
>>>>>>>>                       // no_op_safepoint() has to do a handshake.
>>>>>>>>
>>>>>>>>       old L619:     // We want to make sure that we get to a safepoint
>>>>>>>> regularly.
>>>>>>>>       old L620:     //
>>>>>>>>       old L621:     if ((_cur_vm_operation =
>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>           This call to no_op_safepoint() is at the bottom of the 
>>>>>>>> "while(true)"
>>>>>>>>           loop in VMThread::loop(). Before the fix for 8219436, this line
>>>>>>>> used to be:
>>>>>>>>
>>>>>>>>               if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>
>>>>>>>>           and it was the only call to no_op_safepoint_needed() that passed
>>>>>>>>           'true'. It's the 'true' parameter that made the comment on L619
>>>>>>>>           correct. Why do I say that? Well, the only way we get to the
>>>>>>>>           bottom of the "while(true)" loop is if we had a vm_operation to
>>>>>>>>           perform. Let's say we had a bunch of no-safepoint vm_operations
>>>>>>>>           to perform and we just kept executing them, one after another.
>>>>>>>>           While doing this work, if our time between safepoints exceeds
>>>>>>>>           GuaranteedSafepointInterval, then this 
>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>           call is what would detect that we've gone too long between
>>>>>>>> safepoints
>>>>>>>>           and would force one.
>>>>>>>>
>>>>>>>>           With the fix for 8219436, that call was changed to
>>>>>>>> no_op_safepoint(false)
>>>>>>>>           and with the v2 version of this fix that call is now gone. So 
>>>>>>>> we no
>>>>>>>>           longer have the ability to have GuaranteedSafepointInterval work
>>>>>>>>           right when we are doing lots of non-safepoint VM operations.
>>>>>>>>
>>>>>>>>           This no_op_safepoint_needed() probe point, either with 'true' or
>>>>>>>>           'false', also gave us the opportunity to force a safepoint when
>>>>>>>>           SafepointALot is true or when
>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>           returns true. In the v2 version of this fix, we lose the ability
>>>>>>>>           to SafepointALot after each VM operation. We also lose the 
>>>>>>>> ability
>>>>>>>>           to do cleanup() after each VM operation.
>>>>>>> And I don't think that is an issue, ICache and Monitor, which is the
>>>>>>> cleanup we
>>>>>>> do, just have just basic heuristic. If we cleanup every second or every 
>>>>>>> other
>>>>>>> second makes little difference.
>>>>>> My point is that you are changing the semantics of SafepointALot without
>>>>>> stating clearly that is what you are doing. The point of SafepointALot is
>>>>>> to inject a safepoint after every VMop even those VMops that have to be
>>>>>> executed at a safepoint. SafepointALot is considered a stress option for
>>>>>> a very good reason.
>>>>>>
>>>>>>
>>>>>>> SafepointALot is just diag, which causes safepoint directly after a 
>>>>>>> safepoint.
>>>>>>> I'd rather have it doing one extra safepoint per 
>>>>>>> GuaranteedSafepointInterval,
>>>>>>> that way you know how many safepoint you have per second, which is really
>>>>>>> useful
>>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>> stress option.
>>>>>>
>>>>>>>>           I think you need to restore the "bool check_time" parameter to
>>>>>>>>           no_op_safepoint() and you need to restore this call to as:
>>>>>>>>           VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>>>>           of old L618-L626.
>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>>>>> a time check.
>>>>>>
>>>>>>
>>>>>>> So the difference is that I think the previous behavior was bad/buggy and
>>>>>>> you are trying to restore it :)
>>>>>>>
>>>>>>> Check-out v3, just full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>
>>>>>>
>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>> - if timeout
>>>>>>>      - do handshake a lot
>>>>>>>      - we are guaranteed to pass the time check, so cleanup if needed
>>>>>>>      - else safepoint alot
>>>>>>> - if op
>>>>>>>      - after op
>>>>>>>          - do handshake a lot
>>>>>>>          - check time, if expired, cleanup if needed
>>>>>>>          - else safepoint alot
>>>>>>>
>>>>>>> I think that is what you want :)
>>>>>>> And that's what in v3.
>>>>>>>
>>>>>>> I would rather have:
>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>> - if wait-time > 0
>>>>>>>      - wait on queue
>>>>>>> - else or timeout
>>>>>>>      - do handshake a lot
>>>>>>>      - cleanup if needed
>>>>>>>      - else safepoint alot
>>>>>>>
>>>>>>> That way we do check for cleanup after non-safepoint in time but we do not
>>>>>>> causes extra safepoints after safepoints with SafepointALot. (I know you
>>>>>>> think of this as feature :) )
>>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>> Thanks, Robbin
>