RFR(s): 8220774: Add HandshakeALot diag option
David Holmes
david.holmes at oracle.com
Mon Mar 25 22:28:28 UTC 2019
On 26/03/2019 5:55 am, Daniel D. Daugherty wrote:
>> Please see v6:
>> Inc: http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
>
> test/hotspot/jtreg/runtime/Thread/SuspendAtExit.java
> No comments.
>
> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
> No comments (on the current changes).
>
> Regarding this earlier reply:
>
>> L91: for (Thread thr : _suspend_threads) {
>> L92: thr.resume();
>> L93: }
>> So why is there an extra resume per _suspend_threads element?
>> You have matched suspend() and resume() calls so what am I
>> missing?
>>
>> If the two threads never suspended exit first, the other threads can
>> cause a
>> circular suspend and they hanging forever. One of the issues in 8214174.
>
> You might want to add a comment above:
>
> L87: for (Thread thr : _suspend_threads) {
>
> // Resume any worker threads that might have suspended
> // each other at exactly the same time so they can see
> // _exit_now and check in via the semaphore.
I was going to say that is impossible but realized it isn't. :( Mutual
suspends could all get processed at the same safepoint.
This really makes me wonder whether the test truly stresses
suspend/resume or whether it quickly grinds to a halt till the main
thread tells everyone to exit?
Cheers,
David
> Thumbs up!
>
> Dan
>
>
>
> On 3/25/19 5:09 AM, Robbin Ehn wrote:
>> Hi David,
>>
>> On 3/25/19 2:36 AM, David Holmes wrote:
>>> Hi Robbin,
>>>
>>> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>>>> Hi David,
>>>>
>>>> On 3/22/19 8:05 AM, David Holmes wrote:
>>>>> Hi Robbin,
>>>>>
>>>>> This was a little more complex than I had imagined. :) A couple of
>>>>> comments:
>>>>>
>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>
>>>>> 508 // Have to unlock VMOperationQueue_lock just in case
>>>>> no_op_safepoint()
>>>>> 509 // has to do a handshake.
>>>>> 510 MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>> Mutex::_no_safepoint_check_flag);
>>>>> 511 if (timedout && (_cur_vm_operation =
>>>>> VMThread::no_op_safepoint()) != NULL) {
>>>>>
>>>>> wouldn't it be better to check timedout first and only then use the
>>>>> unlocker
>>>>> then check _cur_vm_operation?
>>>>
>>>> Fixed.
>>>>
>>>>>
>>>>> ---
>>>>>
>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>
>>>>> Can has_last_Java_frame() and java_call_counter() change values
>>>>> between their
>>>>> use in the assert and their use in the assert message?
>>>>
>>>> No, the JavaThread is safepoint_safe here, thus must have a stable
>>>> java stack.
>>>
>>> Ok.
>>>
>>>>>
>>>>> ---
>>>>>
>>>>> Don't you want to add some tests that exercise this? Or update
>>>>> existing tests
>>>>> that use SafepointALot to also use HandshakeALot?
>>>>
>>>> Fixed!
>>>
>>> I'm unclear how this fix and the renamed test relate to the existing
>>> bug JDK-8214174 that was referenced in the ProblemList entry ??
>>
>> Since we are not using withebox to walk thread stack, instead using
>> HandshakeALot I removed Walk from name. The original could cause a
>> circular
>> suspend issue.
>>
>>>
>>> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>>>
>>> If you need the loop index use the full form of the for-loop. This:
>>>
>>> 43 int i = 0;
>>> 44 for (Thread thr : _suspend_threads) {
>>> 45 if (i++ > _suspend_threads.length -2) {
>>> 46 // Leave last 2 threads running.
>>> 47 break;
>>> 48 }
>>> 49 if (Thread.currentThread() != thr) {
>>>
>>> can reduce to simply:
>>>
>>> // Leave last 2 threads running.
>>> for (int i = 0; i < _suspend_threads.length - 2; i++) {
>>> if (Thread.currentThread() != _suspend_threads[i]) {
>>>
>>
>> Fixed.
>>
>>>
>>> 67 // Wait for all suspend thread starting to loop.
>>>
>>> -> // Wait for all suspend-threads to start looping.
>>>
>>> 75 exit_threads[i] = new Thread(new Runnable() { public
>>> void run() {} });
>>>
>>> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>>>
>>>
>>> 79 // Try to suspend them.
>>> 80 for (Thread thr : exit_threads) {
>>> 81 thr.suspend();
>>> 82 }
>>> 83 for (Thread thr : exit_threads) {
>>> 84 thr.resume();
>>> 85 }
>>>
>>> there's really no guarantee of getting the suspend during "exit". The
>>> SuspendAtExit test seems to use logging to make it more likely to
>>> encounter the suspend when desired. Is there a reason not to add a
>>> second @run to that test to use HandShakes?
>>
>> The original issue I had was so unlikely that I needed tenths of
>> thousands of
>> thread and looping that for several hours. (that issue is fixed)
>> So the test just try to quickly with a small chance hit interesting
>> paths.
>>
>> We can do that also.
>>
>>>
>>> 90 do {
>>> 91 for (Thread thr : _suspend_threads) {
>>> 92 thr.resume();
>>> 93 }
>>> 94 while (_sem.tryAcquire()) {
>>> 95 --waiting;
>>> 96 }
>>> 97 } while (waiting > 0);
>>>
>>> why do a busy-wait and resume threads you never suspended? Why not
>>> just do a blocking acquire() on the semaphore? The main
>>> suspend/resume logic in the _suspend_threads must leave the target
>>> thread resumed.
>>
>> The _suspend_threads may create a circular suspend, so we resume them
>> just in case.
>>
>>>
>>>> All handshakes test passed 100 iteration on each platform.
>>>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go
>>>> in first.
>>>
>>> Okay. Just waiting for Dan's stress test results.
>>
>> Please see v6:
>> Inc: http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
>> Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
>>
>> Thanks, Robbin
>>
>>
>>>
>>> Thanks,
>>> David
>>> -----
>>>
>>>
>>>> v5 full:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>>>> v5 inc:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>>>
>>>> Thanks, Robbin
>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>>>> Check-out v3, just full:
>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>
>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>> No comments.
>>>>>>>
>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>> No comments.
>>>>>>>
>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>> L2954: assert((!has_last_Java_frame() &&
>>>>>>> java_call_counter() == 0) ||
>>>>>>> L2955: (has_last_Java_frame() &&
>>>>>>> java_call_counter() > 0),
>>>>>>> "wrong java_sp info!");
>>>>>>> Perhaps change to this:
>>>>>>>
>>>>>>> assert((!has_last_Java_frame() &&
>>>>>>> java_call_counter() == 0) ||
>>>>>>> (has_last_Java_frame() &&
>>>>>>> java_call_counter() > 0),
>>>>>>> "unexpected frame info: has_last_frame=%d,
>>>>>>> java_call_counter=%d",
>>>>>>> has_last_Java_frame(), java_call_counter());
>>>>>>>
>>>>>>> and consider making the same change to the nmethod
>>>>>>> version.
>>>>>>>
>>>>>>
>>>>>> Fixed both!
>>>>>>
>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>> No comments.
>>>>>>>
>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>> L449: // Must check for handshakes first, since ops returns.
>>>>>>> Perhaps:
>>>>>>> // Check for handshakes first since we may need to
>>>>>>> return a VMop.
>>>>>>>
>>>>>>> L454: // Check for a true cleanup first, trying to keep
>>>>>>> stats correct.
>>>>>>> Perhaps:
>>>>>>> // Check for a cleanup before SafepointALot to keep
>>>>>>> stats correct.
>>>>>>>
>>>>>>> L635: // We want to make sure that we get to a
>>>>>>> safepoint regularly.
>>>>>>> Perhaps an addition:
>>>>>>> // We want to make sure that we get to a
>>>>>>> safepoint regularly
>>>>>>> // even when executing VMops that don't require
>>>>>>> safepoints.
>>>>>>>
>>>>>>
>>>>>> Fixed above!
>>>>>>
>>>>>> Publishing v4, since I need a second review:
>>>>>> Full:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>>>> Inc:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>>>
>>>>>> Thanks, Robbin
>>>>>>
>>>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>>>
>>>>>>> If you decide to tweak the semantics for SafepointALot down the
>>>>>>> road,
>>>>>>> it would be best to do that in its own bug rather than as part of
>>>>>>> another bug.
>>>>>>>
>>>>>>> Thumbs up! Your call on whether to tweak the assert or change the
>>>>>>> comments.
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>>>> Hi Dan,
>>>>>>>>>
>>>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>>>
>>>>>>>>>> It's too difficult to craft my comments relative to the
>>>>>>>>>> incremental
>>>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>> No comments.
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>> No comments.
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>> L2953: // This checks that the thread have a correct
>>>>>>>>>> frame state
>>>>>>>>>> during a handshake
>>>>>>>>>> typo: s/have/has/
>>>>>>>>>> Also please add a '.' to end of the sentence.
>>>>>>>>>
>>>>>>>>> Fixed.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> L2954: assert((!has_last_Java_frame() &&
>>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>> L2955: (has_last_Java_frame() &&
>>>>>>>>>> java_call_counter() > 0),
>>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>> Trying to make sure I understand what you are
>>>>>>>>>> asserting:
>>>>>>>>>>
>>>>>>>>>> 1) if the thread does not have a last Java frame
>>>>>>>>>> then it must have
>>>>>>>>>> never called Java
>>>>>>>>>> 2) if the thread has a last Java frame, then it must
>>>>>>>>>> have called
>>>>>>>>>> Java at least once
>>>>>>>>>>
>>>>>>>>>> I'm good with the second expression. I have some
>>>>>>>>>> vague doubts about
>>>>>>>>>> first expression. When a JavaThread is done
>>>>>>>>>> executing Java code and
>>>>>>>>>> it is on its way toward a thread exit, is there ever
>>>>>>>>>> a time when it
>>>>>>>>>> no longer has a last Java frame? I'm thinking a
>>>>>>>>>> handshake late in
>>>>>>>>>> the JavaThread's life...
>>>>>>>>>
>>>>>>>>> The assert is a copy paste from
>>>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>>>
>>>>>>>>> I have seen this assert trigger twice on windows, either it's
>>>>>>>>> wrong and should
>>>>>>>>> be removed from nmethods_do, or it is correct and this will
>>>>>>>>> help us find when
>>>>>>>>> this happens.
>>>>>>>>
>>>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>>>
>>>>>>>> assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>>>> (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>> "unexpected frame info: has_last_frame=%d,
>>>>>>>> java_call_counter=%d",
>>>>>>>> has_last_Java_frame(), java_call_counter());
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>> No comments.
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>> L503: MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>>> Please add something like this above L503:
>>>>>>>>>>
>>>>>>>>>> // Have to unlock VMOperationQueue_lock
>>>>>>>>>> just in case
>>>>>>>>>> // no_op_safepoint() has to do a handshake.
>>>>>>>>>>
>>>>>>>>>> old L619: // We want to make sure that we get to a
>>>>>>>>>> safepoint
>>>>>>>>>> regularly.
>>>>>>>>>> old L620: //
>>>>>>>>>> old L621: if ((_cur_vm_operation =
>>>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>>> This call to no_op_safepoint() is at the bottom of
>>>>>>>>>> the "while(true)"
>>>>>>>>>> loop in VMThread::loop(). Before the fix for
>>>>>>>>>> 8219436, this line
>>>>>>>>>> used to be:
>>>>>>>>>>
>>>>>>>>>> if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>>>
>>>>>>>>>> and it was the only call to no_op_safepoint_needed()
>>>>>>>>>> that passed
>>>>>>>>>> 'true'. It's the 'true' parameter that made the
>>>>>>>>>> comment on L619
>>>>>>>>>> correct. Why do I say that? Well, the only way we
>>>>>>>>>> get to the
>>>>>>>>>> bottom of the "while(true)" loop is if we had a
>>>>>>>>>> vm_operation to
>>>>>>>>>> perform. Let's say we had a bunch of no-safepoint
>>>>>>>>>> vm_operations
>>>>>>>>>> to perform and we just kept executing them, one
>>>>>>>>>> after another.
>>>>>>>>>> While doing this work, if our time between
>>>>>>>>>> safepoints exceeds
>>>>>>>>>> GuaranteedSafepointInterval, then this
>>>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>>> call is what would detect that we've gone too long
>>>>>>>>>> between
>>>>>>>>>> safepoints
>>>>>>>>>> and would force one.
>>>>>>>>>>
>>>>>>>>>> With the fix for 8219436, that call was changed to
>>>>>>>>>> no_op_safepoint(false)
>>>>>>>>>> and with the v2 version of this fix that call is now
>>>>>>>>>> gone. So we no
>>>>>>>>>> longer have the ability to have
>>>>>>>>>> GuaranteedSafepointInterval work
>>>>>>>>>> right when we are doing lots of non-safepoint VM
>>>>>>>>>> operations.
>>>>>>>>>>
>>>>>>>>>> This no_op_safepoint_needed() probe point, either
>>>>>>>>>> with 'true' or
>>>>>>>>>> 'false', also gave us the opportunity to force a
>>>>>>>>>> safepoint when
>>>>>>>>>> SafepointALot is true or when
>>>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>>> returns true. In the v2 version of this fix, we lose
>>>>>>>>>> the ability
>>>>>>>>>> to SafepointALot after each VM operation. We also
>>>>>>>>>> lose the ability
>>>>>>>>>> to do cleanup() after each VM operation.
>>>>>>>>>
>>>>>>>>> And I don't think that is an issue, ICache and Monitor, which
>>>>>>>>> is the
>>>>>>>>> cleanup we
>>>>>>>>> do, just have just basic heuristic. If we cleanup every second
>>>>>>>>> or every other
>>>>>>>>> second makes little difference.
>>>>>>>>
>>>>>>>> My point is that you are changing the semantics of SafepointALot
>>>>>>>> without
>>>>>>>> stating clearly that is what you are doing. The point of
>>>>>>>> SafepointALot is
>>>>>>>> to inject a safepoint after every VMop even those VMops that
>>>>>>>> have to be
>>>>>>>> executed at a safepoint. SafepointALot is considered a stress
>>>>>>>> option for
>>>>>>>> a very good reason.
>>>>>>>>
>>>>>>>>
>>>>>>>>> SafepointALot is just diag, which causes safepoint directly
>>>>>>>>> after a safepoint.
>>>>>>>>> I'd rather have it doing one extra safepoint per
>>>>>>>>> GuaranteedSafepointInterval,
>>>>>>>>> that way you know how many safepoint you have per second, which
>>>>>>>>> is really
>>>>>>>>> useful
>>>>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>>>>
>>>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>>>> stress option.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think you need to restore the "bool check_time"
>>>>>>>>>> parameter to
>>>>>>>>>> no_op_safepoint() and you need to restore this call
>>>>>>>>>> to as:
>>>>>>>>>> VMThread::no_op_safepoint_needed(true) along with
>>>>>>>>>> the rest
>>>>>>>>>> of old L618-L626.
>>>>>>>>>
>>>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>>>
>>>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can
>>>>>>>> miss
>>>>>>>> a time check.
>>>>>>>>
>>>>>>>>
>>>>>>>>> So the difference is that I think the previous behavior was
>>>>>>>>> bad/buggy and
>>>>>>>>> you are trying to restore it :)
>>>>>>>>>
>>>>>>>>> Check-out v3, just full:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>
>>>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>>>> - if timeout
>>>>>>>>> - do handshake a lot
>>>>>>>>> - we are guaranteed to pass the time check, so cleanup if
>>>>>>>>> needed
>>>>>>>>> - else safepoint alot
>>>>>>>>> - if op
>>>>>>>>> - after op
>>>>>>>>> - do handshake a lot
>>>>>>>>> - check time, if expired, cleanup if needed
>>>>>>>>> - else safepoint alot
>>>>>>>>>
>>>>>>>>> I think that is what you want :)
>>>>>>>>> And that's what in v3.
>>>>>>>>>
>>>>>>>>> I would rather have:
>>>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>>>> - if wait-time > 0
>>>>>>>>> - wait on queue
>>>>>>>>> - else or timeout
>>>>>>>>> - do handshake a lot
>>>>>>>>> - cleanup if needed
>>>>>>>>> - else safepoint alot
>>>>>>>>>
>>>>>>>>> That way we do check for cleanup after non-safepoint in time
>>>>>>>>> but we do not
>>>>>>>>> causes extra safepoints after safepoints with SafepointALot. (I
>>>>>>>>> know you
>>>>>>>>> think of this as feature :) )
>>>>>>>>
>>>>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, Robbin
>>>>>>>>
>>>>>>>
>
More information about the hotspot-runtime-dev
mailing list