RFR: JDK-8032050: TEST_BUG: java/rmi/activation/Activatable/shutdownGracefully/ShutdownGracefully.java fails intermittently
Tristan Yan
tristan.yan at oracle.com
Thu Feb 13 08:25:28 UTC 2014
Thank you Stuart
I have fixed comment in JavaVM.java. Dealing with different cases in
ShutdownGracefully.java, two variables were added. One is a flag
indicate test passed or not. Other variable keeps the error message when
test failed. I put TestLibrary.bomb in the bottom of the main method
which only shows test fail message.
Could you review it again
http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.04/
Tristan
On 02/13/2014 05:29 AM, Stuart Marks wrote:
> Hi Tristan,
>
> JavaVM.waitFor looks mostly fine. The indentation of the start of the
> waitFor(timeout) javadoc comment is a bit off, though; please fix.
>
> There are still some adjustments that need to be made in
> ShutdownGracefully.java. Both have to do with the case where the last
> registration succeeds unexpectedly -- this is the one that we expect
> to fail.
>
> First, if this registration has succeeded unexpectedly, that means
> rmid is still running. If that occurs, the call to
> rmid.waitFor(timeout) will inevitably time out. It may be worth
> calling rmid.destroy() directly at this point.
>
> Second, still in the succeeded-unexpectedly case, at line 154
> TestLibrary.bomb() is called. This throws an exception, but it's
> caught by the catch-block at lines 157-158, which calls
> TestLibrary.bomb() again, saying "unexpected exception". Except that
> this is kind of expected, since it was thrown from an earlier call to
> TestLibrary.bomb(). This is quite confusing.
>
> There are several cases that need to be handled.
>
> 1. Normal case. Registration fails as expected, rmid has terminated
> gracefully. Test passes.
>
> 2. Rmid is still running and has processed the registration request
> successfully. Need to kill rmid and fail the test.
>
> 3. Rmid is still running but is in some bad state where the
> registration request failed. Need to kill rmid and fail the test.
>
> 4. Some other unexpected failure. This is what the catch and finally
> blocks at lines 157-161 are for.
>
> These four cases need to be handled independently. Ideally they should
> be separated from the cleanup code. As noted above, you don't want to
> throw an exception from the try-block, because it will get caught by
> your own catch block. Similarly, it's tempting to return from the
> midst of the try-block in the success case, but this still runs the
> finally-block. This can be quite confusing.
>
> A typical technique for dealing with this kind of issue is to record
> results of operations from within the try block, and then analyze the
> results outside, throwing a test failure (TestLibrary.bomb) at that
> point, where it won't be caught by the test's own catch block.
>
> Editoral:
> - line 153, there should be a space between 'if' and the opening paren
> - line 156, typo, "gracefuuly"
>
> Finally, it would be helpful if you could get webrev to generate the
> actual changeset instead of the plain patch, per my other review email.
>
> Thanks,
>
> s'marks
>
>
> On 2/11/14 9:39 PM, Tristan Yan wrote:
>> Thank you for your thorough mail. This is very educational. I took
>> you advice
>> and generated a new webrev for this.
>>
>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.03/
>> I appreciate you can review this again.
>> Regards
>> Tristan
>>
>>
>> On Feb 11, 2014, at 8:32 AM, Stuart Marks <stuart.marks at oracle.com
>> <mailto:stuart.marks at oracle.com>> wrote:
>>
>>> Hi Tristan,
>>>
>>> Sorry about my recurring delays.
>>>
>>> Several comments on these changes.
>>>
>>> JavaVM.java --
>>>
>>> The waitFor(timeout) method is mostly ok. The new thread started at
>>> line 208
>>> and following seems unnecessary, though. This code is reached when a
>>> timeout
>>> occurs, so throwing TimeoutException is the only thing necessary in
>>> this case.
>>> Should the code to start the new thread be removed?
>>>
>>> There should be a similar check for vm == null as in the waitFor()
>>> [no args]
>>> method.
>>>
>>> ShutdownGracefully.java --
>>>
>>> The condition that's checked after calling
>>> rmid.waitFor(SHUTDOWN_TIMEOUT) is
>>> incorrect. It's testing the exit status against zero. Offhand, when
>>> and if
>>> rmid exits, it might exit with a nonzero exit status. If rmid has
>>> exited at
>>> this point, then the test should succeed.
>>>
>>> Instead of testing against zero, the code should catch
>>> TimeoutException, which
>>> means that rmid is still running. It's probably reasonable to catch
>>> TimeoutException, print a message, and then let the finally-block
>>> destroy the
>>> rmid. Calling TestLibrary.bomb() from within the try-block is
>>> confusing, since
>>> that method throws an exception, which is then caught by the
>>> catch-block, when
>>> then calls TestLibrary.bomb() again.
>>>
>>> We should also make sure to test the success case properly. If
>>> rmid.waitFor()
>>> returns in a timely fashion without throwing TimeoutException, it
>>> doesn't
>>> matter what the exit status is. (It might be worth printing it out.)
>>> At that
>>> point we know that rmid *has* exited gracefully, so we need to set
>>> rmid to
>>> null so that the finally-block doesn't attempt to destroy rmid
>>> redundantly.
>>> Some additional messages about rmid having exited and the test
>>> passing are
>>> also warranted for this case.
>>>
>>> Some additional cleanup can be done here as well, over and above the
>>> changes
>>> you've proposed. (This stuff is left over from earlier RMI test
>>> messes.) In
>>> order to shut down an active object, the code here spawns a new
>>> thread that
>>> sleeps for a while and then deactivates this object. This isn't
>>> necessary. (It
>>> might have been necessary in the past.) It's sufficient simply to
>>> unexport
>>> this object and then deactivate it, directly within the shutdown()
>>> method. See
>>>
>>> test/java/rmi/activation/ActivationSystem/unregisterGroup/UnregisterGroup.java
>>>
>>>
>>> for an example of this. In addition, the run() method can be
>>> removed, and the
>>> "implements Runnable" declaration can also be removed from the
>>> ShutdownGracefully test class.
>>>
>>> Finally, revisiting some code farther up in the test, the try-block
>>> at lines
>>> 135-140 issues a registration request that the test expects to fail.
>>> If it
>>> succeeds, the message at line 139 isn't very clear; it should say
>>> that the
>>> registration request succeeded unexpectedly. This should cause the
>>> test to
>>> fail. We still probably want to go through the waitFor(timeout) path
>>> and
>>> eventual rmid cleanup, but a flag should be set here to ensure that
>>> the test
>>> indeed fails if the registration succeeds unexpectedly, and the
>>> messages
>>> should clearly indicate that this is going on.
>>>
>>> A good way to test this last case is to change rmid's security
>>> manager to the
>>> normal security manager java.lang.SecurityManager instead of
>>> TestSecurityManager.
>>>
>>> Thanks,
>>>
>>> s'marks
>>>
>>>
>>>
>>>
>>> On 2/10/14 2:59 AM, Tristan Yan wrote:
>>>> Hi Stuart
>>>> Could you help to review this.
>>>> Thank you
>>>> Tristan
>>>>
>>>> On Jan 31, 2014, at 4:36 PM, Tristan Yan <tristan.yan at oracle.com
>>>> <mailto:tristan.yan at oracle.com>
>>>> <mailto:tristan.yan at oracle.com>> wrote:
>>>>
>>>>> Thank you for fixing JDK-8023541. Then I leave
>>>>> ActivationLibrary.java for now.
>>>>> I still did some clean up following your suggestion.
>>>>> 1. I changed waitFor(long timeout) method, this method is going to
>>>>> use code
>>>>> like Process.waitFor(timeout, unit). This can be backported to
>>>>> JDK7. Also
>>>>> exitValue is kept as a return value. For making sure there is no
>>>>> Pipe leak, a
>>>>> cleanup thread will start when timeout happens.
>>>>> 2. Change in ShutdownGracefully is a little tricky. I think we
>>>>> should just
>>>>> destroy JVM once exception is thrown. So I move the wait logic
>>>>> into try block
>>>>> instead keep them in finally block.
>>>>> Can you receive it again.
>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.02/
>>>>> Thank you
>>>>> Tristan
>>>>>
>>>>> On 01/29/2014 03:16 PM, Stuart Marks wrote:
>>>>>> Hi Tristan,
>>>>>>
>>>>>> I don't want to put the workaround into
>>>>>> ActivationLibrary.rmidRunning() for a
>>>>>> null return from the lookup, because this is only a workaround
>>>>>> for an actual
>>>>>> bug in rmid initialization. See the review I just posted for
>>>>>> JDK-8023541.
>>>>>>
>>>>>> Adding JavaVM.waitFor(timeout) is something that would be useful
>>>>>> in general,
>>>>>> but it needs to be handled carefully. It uses the new
>>>>>> Process.waitFor(timeout, unit) which is new in Java SE 8; this makes
>>>>>> backporting to 7 more complicated. Not clear whether we'll do so,
>>>>>> but I don't
>>>>>> want to forclose the opportunity without discussion. It's also
>>>>>> not clear how
>>>>>> one can get the vm's exit status after JavaVM.waitFor() has
>>>>>> returned true.
>>>>>> With the Process API it's possible simply to call waitFor() or
>>>>>> exitValue().
>>>>>> With JavaVM, a new API needs to be created, or the rule has to be
>>>>>> established
>>>>>> that one must call JavaVM.waitFor() to collect the exit status as
>>>>>> well as to
>>>>>> close the pipes from the subprocess. If JavaVM.waitFor(timeout,
>>>>>> unit) is
>>>>>> called without subsequently calling JavaVM.waitFor(), the pipes
>>>>>> are leaked.
>>>>>>
>>>>>> In ShutdownGracefully.java, the finally-block needs to check to
>>>>>> see if rmid
>>>>>> is still running, and if it is, to shut it down. Simply calling
>>>>>> waitFor(timeout, unit) isn't sufficient, because if the rmid
>>>>>> process is still
>>>>>> running, it will be left running.
>>>>>>
>>>>>> The straightforward approach would be to call
>>>>>> ActivationLibrary.rmidRunning()
>>>>>> to test if it's still running. Unfortunately this isn't quite
>>>>>> right, because
>>>>>> rmidRunning() has a timeout loop in it -- which should probably
>>>>>> be removed.
>>>>>> (I think there's a bug for this.) Another approach would be
>>>>>> simply to call
>>>>>> rmid.destroy(). This calls rmid's shutdown() method first, which is
>>>>>> reasonable, but I'm not sure it kills the process if that fails.
>>>>>> In any case,
>>>>>> this already has a timeout loop waiting for the process to die, so
>>>>>> ShutdownGracefully.java needn't use a new waitFor(timeout, unit)
>>>>>> call.
>>>>>>
>>>>>> Removing the commented-out code that starts with "no longer
>>>>>> needed" is good,
>>>>>> and removing the ShutdownDetectThread is also good, since that's
>>>>>> unnecessary.
>>>>>>
>>>>>> There are some more cleanups I have in mind here but I'd like to
>>>>>> see a
>>>>>> revised webrev before proceeding.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> s'marks
>>>>>>
>>>>>> On 1/25/14 8:57 PM, Tristan Yan wrote:
>>>>>>> Hi Stuart
>>>>>>> Thank you for your review and suggestion.
>>>>>>> Yes, since this failure mode is very hard to be reproduced. I
>>>>>>> guess it's
>>>>>>> make sense to do some hack. And I also noticed in
>>>>>>> ActivationLibrary.rmidRunning. It does try to look up
>>>>>>> ActivationSystem but
>>>>>>> doesn't check if it's null. So I add the logic to make sure we
>>>>>>> will look up
>>>>>>> the non-null ActivationSystem. Also I did some cleanup if you
>>>>>>> don't mind.
>>>>>>> Add a waitFor(long timeout, TimeUnit unit) for JavaVM. Which we
>>>>>>> can have a
>>>>>>> better waitFor control.
>>>>>>> I appreciate you can review the code again.
>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.01/
>>>>>>> Thank you
>>>>>>> Tristan
>>>>>>>
>>>>>>>
>>>>>>> On 01/25/2014 10:20 AM, Stuart Marks wrote:
>>>>>>>> On 1/23/14 10:34 PM, Tristan Yan wrote:
>>>>>>>>> Hi All
>>>>>>>>> Could you review the bug fix for JDK-8032050.
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/
>>>>>>>>>
>>>>>>>>> Description:
>>>>>>>>> This rare happened failure caused because when RMID starts. It
>>>>>>>>> don't
>>>>>>>>> guarantee
>>>>>>>>> sun.rmi.server.Activation.startActivation finishes.
>>>>>>>>> Fix by adding a iterative getSystem with a 5 seconds timeout.
>>>>>>>>
>>>>>>>> Hi Tristan,
>>>>>>>>
>>>>>>>> Adding a timing/retry loop into this test isn't the correct
>>>>>>>> approach for
>>>>>>>> fixing this test.
>>>>>>>>
>>>>>>>> The timing loop isn't necessary because there is already a
>>>>>>>> timing loop in
>>>>>>>> RMID.start() in the RMI test library. (There's another timing
>>>>>>>> loop in
>>>>>>>> ActivationLibrary.rmidRunning() which should probably be
>>>>>>>> removed.) So the
>>>>>>>> intent of this library call is that it start rmid and wait for
>>>>>>>> it to become
>>>>>>>> ready. That logic doesn't need to be added to the test.
>>>>>>>>
>>>>>>>> In the bug report JDK-8032050 you had mentioned that the
>>>>>>>> NullPointerException was suspicious. You're right! I took a
>>>>>>>> look and it
>>>>>>>> seemed like it was related to JDK-8023541, and I added a note
>>>>>>>> to this
>>>>>>>> effect to the bug report. The problem here is that rmid can
>>>>>>>> come up and
>>>>>>>> transiently return null instead of the stub of the activation
>>>>>>>> system.
>>>>>>>> That's what JDK-8023541 covers. I think that rmid itself needs
>>>>>>>> to be fixed,
>>>>>>>> though modifying the timing loop in the RMI test library to
>>>>>>>> wait for rmid
>>>>>>>> to come up *and* for the lookup to return non-null is an easy
>>>>>>>> way to fix
>>>>>>>> the problem. (Or at least cover it up.)
>>>>>>>>
>>>>>>>> The next step in the analysis is to determine, given that
>>>>>>>> ActivationLibrary.getSystem can sometimes return null, whether
>>>>>>>> this has
>>>>>>>> actually caused this test failure. This is pretty easy to
>>>>>>>> determine; just
>>>>>>>> hack in a line "system = null" in the right place and run the
>>>>>>>> test. I've
>>>>>>>> done this, and the test times out and the output log is pretty
>>>>>>>> much
>>>>>>>> identical to the one in the bug report. (I recommend you try this
>>>>>>>> yourself.) So I think it's fairly safe to say that the problem in
>>>>>>>> JDK-8023541 has caused the failure listed in JDK-8032050.
>>>>>>>>
>>>>>>>> I can see a couple ways to proceed here. One way is just to
>>>>>>>> close this out
>>>>>>>> as a duplicate of JDK-8023541 since that bug caused this failure.
>>>>>>>>
>>>>>>>> Another is that this test could use some cleaning up. This bug
>>>>>>>> certainly
>>>>>>>> covers a failure, but the messages emitted are confusing and in
>>>>>>>> some cases
>>>>>>>> completely wrong. For example, the "rmid has shutdown" message
>>>>>>>> at line 180
>>>>>>>> is incorrect, because in this case rmid is still running and
>>>>>>>> the wait()
>>>>>>>> call has timed out. Most of the code here can be replaced with
>>>>>>>> calls to
>>>>>>>> various bits of the RMI test library. There are a bunch of
>>>>>>>> other things in
>>>>>>>> this test that could be cleaned up as well.
>>>>>>>>
>>>>>>>> It's up to you how you'd like to proceed.
>>>>>>>>
>>>>>>>> s'marks
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
More information about the core-libs-dev
mailing list