RFR(s): 8220774: Add HandshakeALot diag option
Robbin Ehn
robbin.ehn at oracle.com
Mon Mar 25 09:09:21 UTC 2019
Hi David,
On 3/25/19 2:36 AM, David Holmes wrote:
> Hi Robbin,
>
> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>> Hi David,
>>
>> On 3/22/19 8:05 AM, David Holmes wrote:
>>> Hi Robbin,
>>>
>>> This was a little more complex than I had imagined. :) A couple of comments:
>>>
>>> src/hotspot/share/runtime/vmThread.cpp
>>>
>>> 508 // Have to unlock VMOperationQueue_lock just in case
>>> no_op_safepoint()
>>> 509 // has to do a handshake.
>>> 510 MutexUnlockerEx mul(VMOperationQueue_lock,
>>> Mutex::_no_safepoint_check_flag);
>>> 511 if (timedout && (_cur_vm_operation =
>>> VMThread::no_op_safepoint()) != NULL) {
>>>
>>> wouldn't it be better to check timedout first and only then use the unlocker
>>> then check _cur_vm_operation?
>>
>> Fixed.
>>
>>>
>>> ---
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>>
>>> Can has_last_Java_frame() and java_call_counter() change values between their
>>> use in the assert and their use in the assert message?
>>
>> No, the JavaThread is safepoint_safe here, thus must have a stable java stack.
>
> Ok.
>
>>>
>>> ---
>>>
>>> Don't you want to add some tests that exercise this? Or update existing tests
>>> that use SafepointALot to also use HandshakeALot?
>>
>> Fixed!
>
> I'm unclear how this fix and the renamed test relate to the existing bug
> JDK-8214174 that was referenced in the ProblemList entry ??
Since we are not using withebox to walk thread stack, instead using
HandshakeALot I removed Walk from name. The original could cause a circular
suspend issue.
>
> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>
> If you need the loop index use the full form of the for-loop. This:
>
> 43 int i = 0;
> 44 for (Thread thr : _suspend_threads) {
> 45 if (i++ > _suspend_threads.length -2) {
> 46 // Leave last 2 threads running.
> 47 break;
> 48 }
> 49 if (Thread.currentThread() != thr) {
>
> can reduce to simply:
>
> // Leave last 2 threads running.
> for (int i = 0; i < _suspend_threads.length - 2; i++) {
> if (Thread.currentThread() != _suspend_threads[i]) {
>
Fixed.
>
> 67 // Wait for all suspend thread starting to loop.
>
> -> // Wait for all suspend-threads to start looping.
>
> 75 exit_threads[i] = new Thread(new Runnable() { public void run()
> {} });
>
> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>
>
> 79 // Try to suspend them.
> 80 for (Thread thr : exit_threads) {
> 81 thr.suspend();
> 82 }
> 83 for (Thread thr : exit_threads) {
> 84 thr.resume();
> 85 }
>
> there's really no guarantee of getting the suspend during "exit". The
> SuspendAtExit test seems to use logging to make it more likely to encounter the
> suspend when desired. Is there a reason not to add a second @run to that test to
> use HandShakes?
The original issue I had was so unlikely that I needed tenths of thousands of
thread and looping that for several hours. (that issue is fixed)
So the test just try to quickly with a small chance hit interesting paths.
We can do that also.
>
> 90 do {
> 91 for (Thread thr : _suspend_threads) {
> 92 thr.resume();
> 93 }
> 94 while (_sem.tryAcquire()) {
> 95 --waiting;
> 96 }
> 97 } while (waiting > 0);
>
> why do a busy-wait and resume threads you never suspended? Why not just do a
> blocking acquire() on the semaphore? The main suspend/resume logic in the
> _suspend_threads must leave the target thread resumed.
The _suspend_threads may create a circular suspend, so we resume them just in case.
>
>> All handshakes test passed 100 iteration on each platform.
>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go in first.
>
> Okay. Just waiting for Dan's stress test results.
Please see v6:
Inc: http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
Thanks, Robbin
>
> Thanks,
> David
> -----
>
>
>> v5 full:
>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>> v5 inc:
>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>
>> Thanks, Robbin
>>
>>>
>>> Thanks,
>>> David
>>>
>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>> Check-out v3, just full:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>
>>>>> src/hotspot/share/runtime/globals.hpp
>>>>> No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.hpp
>>>>> No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.cpp
>>>>> L2954: assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>> L2955: (has_last_Java_frame() && java_call_counter() > 0),
>>>>> "wrong java_sp info!");
>>>>> Perhaps change to this:
>>>>>
>>>>> assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>> (has_last_Java_frame() && java_call_counter() > 0),
>>>>> "unexpected frame info: has_last_frame=%d,
>>>>> java_call_counter=%d",
>>>>> has_last_Java_frame(), java_call_counter());
>>>>>
>>>>> and consider making the same change to the nmethod version.
>>>>>
>>>>
>>>> Fixed both!
>>>>
>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>> No comments.
>>>>>
>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>> L449: // Must check for handshakes first, since ops returns.
>>>>> Perhaps:
>>>>> // Check for handshakes first since we may need to return a
>>>>> VMop.
>>>>>
>>>>> L454: // Check for a true cleanup first, trying to keep stats correct.
>>>>> Perhaps:
>>>>> // Check for a cleanup before SafepointALot to keep stats
>>>>> correct.
>>>>>
>>>>> L635: // We want to make sure that we get to a safepoint regularly.
>>>>> Perhaps an addition:
>>>>> // We want to make sure that we get to a safepoint regularly
>>>>> // even when executing VMops that don't require safepoints.
>>>>>
>>>>
>>>> Fixed above!
>>>>
>>>> Publishing v4, since I need a second review:
>>>> Full:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>> Inc:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>
>>>> Thanks, Robbin
>>>>
>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>
>>>>> If you decide to tweak the semantics for SafepointALot down the road,
>>>>> it would be best to do that in its own bug rather than as part of
>>>>> another bug.
>>>>>
>>>>> Thumbs up! Your call on whether to tweak the assert or change the comments.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>
>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>
>>>>>>>> It's too difficult to craft my comments relative to the incremental
>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>> L2953: // This checks that the thread have a correct frame state
>>>>>>>> during a handshake
>>>>>>>> typo: s/have/has/
>>>>>>>> Also please add a '.' to end of the sentence.
>>>>>>>
>>>>>>> Fixed.
>>>>>>>
>>>>>>>>
>>>>>>>> L2954: assert((!has_last_Java_frame() && java_call_counter() ==
>>>>>>>> 0) ||
>>>>>>>> L2955: (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>> "wrong java_sp info!");
>>>>>>>> Trying to make sure I understand what you are asserting:
>>>>>>>>
>>>>>>>> 1) if the thread does not have a last Java frame then it must
>>>>>>>> have
>>>>>>>> never called Java
>>>>>>>> 2) if the thread has a last Java frame, then it must have called
>>>>>>>> Java at least once
>>>>>>>>
>>>>>>>> I'm good with the second expression. I have some vague doubts
>>>>>>>> about
>>>>>>>> first expression. When a JavaThread is done executing Java
>>>>>>>> code and
>>>>>>>> it is on its way toward a thread exit, is there ever a time
>>>>>>>> when it
>>>>>>>> no longer has a last Java frame? I'm thinking a handshake late in
>>>>>>>> the JavaThread's life...
>>>>>>>
>>>>>>> The assert is a copy paste from
>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>
>>>>>>> I have seen this assert trigger twice on windows, either it's wrong and
>>>>>>> should
>>>>>>> be removed from nmethods_do, or it is correct and this will help us find
>>>>>>> when
>>>>>>> this happens.
>>>>>>
>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>
>>>>>> assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>> (has_last_Java_frame() && java_call_counter() > 0),
>>>>>> "unexpected frame info: has_last_frame=%d, java_call_counter=%d",
>>>>>> has_last_Java_frame(), java_call_counter());
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>> L503: MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>> Please add something like this above L503:
>>>>>>>>
>>>>>>>> // Have to unlock VMOperationQueue_lock just in case
>>>>>>>> // no_op_safepoint() has to do a handshake.
>>>>>>>>
>>>>>>>> old L619: // We want to make sure that we get to a safepoint
>>>>>>>> regularly.
>>>>>>>> old L620: //
>>>>>>>> old L621: if ((_cur_vm_operation =
>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>> This call to no_op_safepoint() is at the bottom of the
>>>>>>>> "while(true)"
>>>>>>>> loop in VMThread::loop(). Before the fix for 8219436, this line
>>>>>>>> used to be:
>>>>>>>>
>>>>>>>> if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>
>>>>>>>> and it was the only call to no_op_safepoint_needed() that passed
>>>>>>>> 'true'. It's the 'true' parameter that made the comment on L619
>>>>>>>> correct. Why do I say that? Well, the only way we get to the
>>>>>>>> bottom of the "while(true)" loop is if we had a vm_operation to
>>>>>>>> perform. Let's say we had a bunch of no-safepoint vm_operations
>>>>>>>> to perform and we just kept executing them, one after another.
>>>>>>>> While doing this work, if our time between safepoints exceeds
>>>>>>>> GuaranteedSafepointInterval, then this
>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>> call is what would detect that we've gone too long between
>>>>>>>> safepoints
>>>>>>>> and would force one.
>>>>>>>>
>>>>>>>> With the fix for 8219436, that call was changed to
>>>>>>>> no_op_safepoint(false)
>>>>>>>> and with the v2 version of this fix that call is now gone. So
>>>>>>>> we no
>>>>>>>> longer have the ability to have GuaranteedSafepointInterval work
>>>>>>>> right when we are doing lots of non-safepoint VM operations.
>>>>>>>>
>>>>>>>> This no_op_safepoint_needed() probe point, either with 'true' or
>>>>>>>> 'false', also gave us the opportunity to force a safepoint when
>>>>>>>> SafepointALot is true or when
>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>> returns true. In the v2 version of this fix, we lose the ability
>>>>>>>> to SafepointALot after each VM operation. We also lose the
>>>>>>>> ability
>>>>>>>> to do cleanup() after each VM operation.
>>>>>>>
>>>>>>> And I don't think that is an issue, ICache and Monitor, which is the
>>>>>>> cleanup we
>>>>>>> do, just have just basic heuristic. If we cleanup every second or every
>>>>>>> other
>>>>>>> second makes little difference.
>>>>>>
>>>>>> My point is that you are changing the semantics of SafepointALot without
>>>>>> stating clearly that is what you are doing. The point of SafepointALot is
>>>>>> to inject a safepoint after every VMop even those VMops that have to be
>>>>>> executed at a safepoint. SafepointALot is considered a stress option for
>>>>>> a very good reason.
>>>>>>
>>>>>>
>>>>>>> SafepointALot is just diag, which causes safepoint directly after a
>>>>>>> safepoint.
>>>>>>> I'd rather have it doing one extra safepoint per
>>>>>>> GuaranteedSafepointInterval,
>>>>>>> that way you know how many safepoint you have per second, which is really
>>>>>>> useful
>>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>>
>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>> stress option.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I think you need to restore the "bool check_time" parameter to
>>>>>>>> no_op_safepoint() and you need to restore this call to as:
>>>>>>>> VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>>>> of old L618-L626.
>>>>>>>
>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>
>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>>>>> a time check.
>>>>>>
>>>>>>
>>>>>>> So the difference is that I think the previous behavior was bad/buggy and
>>>>>>> you are trying to restore it :)
>>>>>>>
>>>>>>> Check-out v3, just full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>
>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>> - if timeout
>>>>>>> - do handshake a lot
>>>>>>> - we are guaranteed to pass the time check, so cleanup if needed
>>>>>>> - else safepoint alot
>>>>>>> - if op
>>>>>>> - after op
>>>>>>> - do handshake a lot
>>>>>>> - check time, if expired, cleanup if needed
>>>>>>> - else safepoint alot
>>>>>>>
>>>>>>> I think that is what you want :)
>>>>>>> And that's what in v3.
>>>>>>>
>>>>>>> I would rather have:
>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>> - if wait-time > 0
>>>>>>> - wait on queue
>>>>>>> - else or timeout
>>>>>>> - do handshake a lot
>>>>>>> - cleanup if needed
>>>>>>> - else safepoint alot
>>>>>>>
>>>>>>> That way we do check for cleanup after non-safepoint in time but we do not
>>>>>>> causes extra safepoints after safepoints with SafepointALot. (I know you
>>>>>>> think of this as feature :) )
>>>>>>
>>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>>
>>>>>>> Thanks, Robbin
>>>>>>
>>>>>
More information about the hotspot-runtime-dev
mailing list