RFR: JDK-8032050: TEST_BUG: java/rmi/activation/Activatable/shutdownGracefully/ShutdownGracefully.java fails intermittently

Stuart Marks stuart.marks at oracle.com
Thu Feb 20 00:44:47 UTC 2014


Hi Tristan,

Thanks for the updates. Pushed:

http://hg.openjdk.java.net/jdk9/dev/jdk/rev/d24297425fd0

It's good to see all that unnecessary thread creation stuff removed.

s'marks


On 2/14/14 6:53 PM, Tristan Yan wrote:
> Thank you Stuart
> This is nice. I thought two variables were weird but I didn’t figure out the
> right solution.  Also I wasn’t so sure why we print out so many messages now
> it’s clear to me.
> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.05/
> I’m sorry I have to ask you review this again.
> regards
> Tristan
>
> On Feb 15, 2014, at 9:47 AM, Stuart Marks <stuart.marks at oracle.com
> <mailto:stuart.marks at oracle.com>> wrote:
>
>> Hi Tristan,
>>
>> OK, we're getting close. Just a couple things about ShutdownGracefully.java.
>>
>> It's a bit clumsy to have both a boolean and a message string to keep track of
>> the state of the test, but by itself this isn't such a big deal.
>>
>> A bigger deal is the lack of exception chaining. If we catch an unexpected
>> exception at line 162, its class and message are printed out, but its stack
>> trace isn't. If this were to occur it might be fiendishly difficult to debug.
>>
>> The TimeoutException isn't chained up either, but this isn't so bad, since
>> there's only one place the timeout can occur. Still, it's good to chain
>> exceptions where possible.
>>
>> There's another failure case that doesn't have an exception at all, which is
>> when the registration request we're expecting to fail actually succeeds. This
>> doesn't have an exception, but we should consider creating one.
>>
>> Here's an approach to getting rid of the boolean+string and also chaining up
>> exceptions. The key insight is that an exception can be created in a different
>> place from where it's thrown.
>>
>> Instead of the boolean+stream, have a variable of type Exception that's
>> initialized to null. Anyplace where we catch an exception that indicates
>> failure, fill in this variable. The idea is that if this exception variable is
>> non-null, a failure has occurred. The exception being non-null replaces the
>> boolean, and the exception message replaces the failure string. In addition,
>> the exception has a stack trace, which is essential for debugging.
>>
>> For the case where we don't get the expected exception (i.e., registration
>> succeeds unexpectedly), simply set the exception variable to a newly created
>> exception, but don't throw it yet. For example,
>>
>>    exception = new RuntimeException(
>>        "The registration request succeeded unexpectedly");
>>
>> At the place where we catch an unexpected exception, wrap the caught exception
>> in a new RuntimeException with a message like "caught unexpected exception".
>> The reason to do this is so we can add an additional message. The stack trace
>> will also be a bit easier to follow.
>>
>> For the timeout, just assign the TimeoutException to the exception variable.
>>
>> Also, at each location where the exception variable is assigned to, print out
>> a message at that point. It will at least show the state of the test to be a
>> failure. The reason is that, if rmid.destroy() were to throw an exception from
>> the finally-block, our carefully recorded exception state will be thrown away.
>> (An alternative would be to put this into its own try-block, and add any
>> exceptions caught from it to the suppressed exception list of the exception
>> variable, but it's not clear to me that this is worth it.)
>>
>> At the end of the test, if the exception variable is non-null, call
>> TestLibrary.bomb() with it to fail the test.
>>
>> Finally, one typo: "prcoess".
>>
>> Thanks for updating this webrev again.
>>
>> s'marks
>>
>> On 2/13/14 12:25 AM, Tristan Yan wrote:
>>> Thank you Stuart
>>> I have fixed comment in JavaVM.java. Dealing with different cases in
>>> ShutdownGracefully.java, two variables were added. One is a flag indicate test
>>> passed or not. Other variable keeps the error message when test failed. I put
>>> TestLibrary.bomb in the bottom of the main method which only shows test fail
>>> message.
>>> Could you review it again
>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.04/
>>> Tristan
>>>
>>> On 02/13/2014 05:29 AM, Stuart Marks wrote:
>>>> Hi Tristan,
>>>>
>>>> JavaVM.waitFor looks mostly fine. The indentation of the start of the
>>>> waitFor(timeout) javadoc comment is a bit off, though; please fix.
>>>>
>>>> There are still some adjustments that need to be made in
>>>> ShutdownGracefully.java. Both have to do with the case where the last
>>>> registration succeeds unexpectedly -- this is the one that we expect to fail.
>>>>
>>>> First, if this registration has succeeded unexpectedly, that means rmid is
>>>> still running. If that occurs, the call to rmid.waitFor(timeout) will
>>>> inevitably time out. It may be worth calling rmid.destroy() directly at this
>>>> point.
>>>>
>>>> Second, still in the succeeded-unexpectedly case, at line 154
>>>> TestLibrary.bomb() is called. This throws an exception, but it's caught by the
>>>> catch-block at lines 157-158, which calls TestLibrary.bomb() again, saying
>>>> "unexpected exception". Except that this is kind of expected, since it was
>>>> thrown from an earlier call to TestLibrary.bomb(). This is quite confusing.
>>>>
>>>> There are several cases that need to be handled.
>>>>
>>>> 1. Normal case. Registration fails as expected, rmid has terminated
>>>> gracefully. Test passes.
>>>>
>>>> 2. Rmid is still running and has processed the registration request
>>>> successfully. Need to kill rmid and fail the test.
>>>>
>>>> 3. Rmid is still running but is in some bad state where the registration
>>>> request failed. Need to kill rmid and fail the test.
>>>>
>>>> 4. Some other unexpected failure. This is what the catch and finally blocks at
>>>> lines 157-161 are for.
>>>>
>>>> These four cases need to be handled independently. Ideally they should be
>>>> separated from the cleanup code. As noted above, you don't want to throw an
>>>> exception from the try-block, because it will get caught by your own catch
>>>> block. Similarly, it's tempting to return from the midst of the try-block in
>>>> the success case, but this still runs the finally-block. This can be quite
>>>> confusing.
>>>>
>>>> A typical technique for dealing with this kind of issue is to record results
>>>> of operations from within the try block, and then analyze the results outside,
>>>> throwing a test failure (TestLibrary.bomb) at that point, where it won't be
>>>> caught by the test's own catch block.
>>>>
>>>> Editoral:
>>>> - line 153, there should be a space between 'if' and the opening paren
>>>> - line 156, typo, "gracefuuly"
>>>>
>>>> Finally, it would be helpful if you could get webrev to generate the actual
>>>> changeset instead of the plain patch, per my other review email.
>>>>
>>>> Thanks,
>>>>
>>>> s'marks
>>>>
>>>>
>>>> On 2/11/14 9:39 PM, Tristan Yan wrote:
>>>>> Thank you for your thorough mail. This is very educational. I took you advice
>>>>> and generated a new webrev for this.
>>>>>
>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.03/
>>>>> I appreciate you can review this again.
>>>>> Regards
>>>>> Tristan
>>>>>
>>>>>
>>>>> On Feb 11, 2014, at 8:32 AM, Stuart Marks <stuart.marks at oracle.com
>>>>> <mailto:stuart.marks at oracle.com>> wrote:
>>>>>
>>>>>> Hi Tristan,
>>>>>>
>>>>>> Sorry about my recurring delays.
>>>>>>
>>>>>> Several comments on these changes.
>>>>>>
>>>>>> JavaVM.java --
>>>>>>
>>>>>> The waitFor(timeout) method is mostly ok. The new thread started at line 208
>>>>>> and following seems unnecessary, though. This code is reached when a timeout
>>>>>> occurs, so throwing TimeoutException is the only thing necessary in this case.
>>>>>> Should the code to start the new thread be removed?
>>>>>>
>>>>>> There should be a similar check for vm == null as in the waitFor() [no args]
>>>>>> method.
>>>>>>
>>>>>> ShutdownGracefully.java --
>>>>>>
>>>>>> The condition that's checked after calling rmid.waitFor(SHUTDOWN_TIMEOUT) is
>>>>>> incorrect. It's testing the exit status against zero. Offhand, when and if
>>>>>> rmid exits, it might exit with a nonzero exit status. If rmid has exited at
>>>>>> this point, then the test should succeed.
>>>>>>
>>>>>> Instead of testing against zero, the code should catch TimeoutException, which
>>>>>> means that rmid is still running. It's probably reasonable to catch
>>>>>> TimeoutException, print a message, and then let the finally-block destroy the
>>>>>> rmid. Calling TestLibrary.bomb() from within the try-block is confusing, since
>>>>>> that method throws an exception, which is then caught by the catch-block, when
>>>>>> then calls TestLibrary.bomb() again.
>>>>>>
>>>>>> We should also make sure to test the success case properly. If rmid.waitFor()
>>>>>> returns in a timely fashion without throwing TimeoutException, it doesn't
>>>>>> matter what the exit status is. (It might be worth printing it out.) At that
>>>>>> point we know that rmid *has* exited gracefully, so we need to set rmid to
>>>>>> null so that the finally-block doesn't attempt to destroy rmid redundantly.
>>>>>> Some additional messages about rmid having exited and the test passing are
>>>>>> also warranted for this case.
>>>>>>
>>>>>> Some additional cleanup can be done here as well, over and above the changes
>>>>>> you've proposed. (This stuff is left over from earlier RMI test messes.) In
>>>>>> order to shut down an active object, the code here spawns a new thread that
>>>>>> sleeps for a while and then deactivates this object. This isn't necessary. (It
>>>>>> might have been necessary in the past.) It's sufficient simply to unexport
>>>>>> this object and then deactivate it, directly within the shutdown() method. See
>>>>>>
>>>>>> test/java/rmi/activation/ActivationSystem/unregisterGroup/UnregisterGroup.java
>>>>>>
>>>>>> for an example of this. In addition, the run() method can be removed, and the
>>>>>> "implements Runnable" declaration can also be removed from the
>>>>>> ShutdownGracefully test class.
>>>>>>
>>>>>> Finally, revisiting some code farther up in the test, the try-block at lines
>>>>>> 135-140 issues a registration request that the test expects to fail. If it
>>>>>> succeeds, the message at line 139 isn't very clear; it should say that the
>>>>>> registration request succeeded unexpectedly. This should cause the test to
>>>>>> fail. We still probably want to go through the waitFor(timeout) path and
>>>>>> eventual rmid cleanup, but a flag should be set here to ensure that the test
>>>>>> indeed fails if the registration succeeds unexpectedly, and the messages
>>>>>> should clearly indicate that this is going on.
>>>>>>
>>>>>> A good way to test this last case is to change rmid's security manager to the
>>>>>> normal security manager java.lang.SecurityManager instead of
>>>>>> TestSecurityManager.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> s'marks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/10/14 2:59 AM, Tristan Yan wrote:
>>>>>>> Hi Stuart
>>>>>>> Could you help to review this.
>>>>>>> Thank you
>>>>>>> Tristan
>>>>>>>
>>>>>>> On Jan 31, 2014, at 4:36 PM, Tristan Yan <tristan.yan at oracle.com
>>>>>>> <mailto:tristan.yan at oracle.com>
>>>>>>> <mailto:tristan.yan at oracle.com>
>>>>>>> <mailto:tristan.yan at oracle.com>> wrote:
>>>>>>>
>>>>>>>> Thank you for fixing JDK-8023541. Then I leave ActivationLibrary.java for
>>>>>>>> now.
>>>>>>>> I still did some clean up following your suggestion.
>>>>>>>> 1. I changed waitFor(long timeout) method, this method is going to use code
>>>>>>>> like Process.waitFor(timeout, unit). This can be backported to JDK7. Also
>>>>>>>> exitValue is kept as a return value. For making sure there is no Pipe
>>>>>>>> leak, a
>>>>>>>> cleanup thread will start when timeout happens.
>>>>>>>> 2. Change in ShutdownGracefully is a little tricky. I think we should just
>>>>>>>> destroy JVM once exception is thrown. So I move the wait logic into try
>>>>>>>> block
>>>>>>>> instead keep them in finally block.
>>>>>>>> Can you receive it again.
>>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.02/
>>>>>>>> Thank you
>>>>>>>> Tristan
>>>>>>>>
>>>>>>>> On 01/29/2014 03:16 PM, Stuart Marks wrote:
>>>>>>>>> Hi Tristan,
>>>>>>>>>
>>>>>>>>> I don't want to put the workaround into ActivationLibrary.rmidRunning()
>>>>>>>>> for a
>>>>>>>>> null return from the lookup, because this is only a workaround for an
>>>>>>>>> actual
>>>>>>>>> bug in rmid initialization. See the review I just posted for JDK-8023541.
>>>>>>>>>
>>>>>>>>> Adding JavaVM.waitFor(timeout) is something that would be useful in
>>>>>>>>> general,
>>>>>>>>> but it needs to be handled carefully. It uses the new
>>>>>>>>> Process.waitFor(timeout, unit) which is new in Java SE 8; this makes
>>>>>>>>> backporting to 7 more complicated. Not clear whether we'll do so, but I
>>>>>>>>> don't
>>>>>>>>> want to forclose the opportunity without discussion. It's also not
>>>>>>>>> clear how
>>>>>>>>> one can get the vm's exit status after JavaVM.waitFor() has returned true.
>>>>>>>>> With the Process API it's possible simply to call waitFor() or exitValue().
>>>>>>>>> With JavaVM, a new API needs to be created, or the rule has to be
>>>>>>>>> established
>>>>>>>>> that one must call JavaVM.waitFor() to collect the exit status as well
>>>>>>>>> as to
>>>>>>>>> close the pipes from the subprocess. If JavaVM.waitFor(timeout, unit) is
>>>>>>>>> called without subsequently calling JavaVM.waitFor(), the pipes are leaked.
>>>>>>>>>
>>>>>>>>> In ShutdownGracefully.java, the finally-block needs to check to see if rmid
>>>>>>>>> is still running, and if it is, to shut it down. Simply calling
>>>>>>>>> waitFor(timeout, unit) isn't sufficient, because if the rmid process is
>>>>>>>>> still
>>>>>>>>> running, it will be left running.
>>>>>>>>>
>>>>>>>>> The straightforward approach would be to call
>>>>>>>>> ActivationLibrary.rmidRunning()
>>>>>>>>> to test if it's still running. Unfortunately this isn't quite right,
>>>>>>>>> because
>>>>>>>>> rmidRunning() has a timeout loop in it -- which should probably be removed.
>>>>>>>>> (I think there's a bug for this.) Another approach would be simply to call
>>>>>>>>> rmid.destroy(). This calls rmid's shutdown() method first, which is
>>>>>>>>> reasonable, but I'm not sure it kills the process if that fails. In any
>>>>>>>>> case,
>>>>>>>>> this already has a timeout loop waiting for the process to die, so
>>>>>>>>> ShutdownGracefully.java needn't use a new waitFor(timeout, unit) call.
>>>>>>>>>
>>>>>>>>> Removing the commented-out code that starts with "no longer needed" is
>>>>>>>>> good,
>>>>>>>>> and removing the ShutdownDetectThread is also good, since that's
>>>>>>>>> unnecessary.
>>>>>>>>>
>>>>>>>>> There are some more cleanups I have in mind here but I'd like to see a
>>>>>>>>> revised webrev before proceeding.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> s'marks
>>>>>>>>>
>>>>>>>>> On 1/25/14 8:57 PM, Tristan Yan wrote:
>>>>>>>>>> Hi Stuart
>>>>>>>>>> Thank you for your review and suggestion.
>>>>>>>>>> Yes, since this failure mode is very hard to be reproduced. I guess it's
>>>>>>>>>> make sense  to do some hack. And I also noticed in
>>>>>>>>>> ActivationLibrary.rmidRunning. It does try to look up ActivationSystem but
>>>>>>>>>> doesn't check if it's null. So I add the logic to make sure we will
>>>>>>>>>> look up
>>>>>>>>>> the non-null ActivationSystem. Also I did some cleanup if you don't mind.
>>>>>>>>>> Add a waitFor(long timeout, TimeUnit unit) for JavaVM. Which we can have a
>>>>>>>>>> better waitFor control.
>>>>>>>>>> I appreciate you can review the code again.
>>>>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.01/
>>>>>>>>>> Thank you
>>>>>>>>>> Tristan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 01/25/2014 10:20 AM, Stuart Marks wrote:
>>>>>>>>>>> On 1/23/14 10:34 PM, Tristan Yan wrote:
>>>>>>>>>>>> Hi All
>>>>>>>>>>>> Could you review the bug fix for JDK-8032050.
>>>>>>>>>>>>
>>>>>>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/
>>>>>>>>>>>>
>>>>>>>>>>>> Description:
>>>>>>>>>>>> This rare happened failure caused because when RMID starts. It don't
>>>>>>>>>>>> guarantee
>>>>>>>>>>>> sun.rmi.server.Activation.startActivation finishes.
>>>>>>>>>>>> Fix by adding a iterative getSystem with a 5 seconds timeout.
>>>>>>>>>>>
>>>>>>>>>>> Hi Tristan,
>>>>>>>>>>>
>>>>>>>>>>> Adding a timing/retry loop into this test isn't the correct approach for
>>>>>>>>>>> fixing this test.
>>>>>>>>>>>
>>>>>>>>>>> The timing loop isn't necessary because there is already a timing loop in
>>>>>>>>>>> RMID.start() in the RMI test library. (There's another timing loop in
>>>>>>>>>>> ActivationLibrary.rmidRunning() which should probably be removed.) So the
>>>>>>>>>>> intent of this library call is that it start rmid and wait for it to
>>>>>>>>>>> become
>>>>>>>>>>> ready. That logic doesn't need to be added to the test.
>>>>>>>>>>>
>>>>>>>>>>> In the bug report JDK-8032050 you had mentioned that the
>>>>>>>>>>> NullPointerException was suspicious. You're right! I took a look and it
>>>>>>>>>>> seemed like it was related to JDK-8023541, and I added a note to this
>>>>>>>>>>> effect to the bug report. The problem here is that rmid can come up and
>>>>>>>>>>> transiently return null instead of the stub of the activation system.
>>>>>>>>>>> That's what JDK-8023541 covers. I think that rmid itself needs to be
>>>>>>>>>>> fixed,
>>>>>>>>>>> though modifying the timing loop in the RMI test library to wait for rmid
>>>>>>>>>>> to come up *and* for the lookup to return non-null is an easy way to fix
>>>>>>>>>>> the problem. (Or at least cover it up.)
>>>>>>>>>>>
>>>>>>>>>>> The next step in the analysis is to determine, given that
>>>>>>>>>>> ActivationLibrary.getSystem can sometimes return null, whether this has
>>>>>>>>>>> actually caused this test failure. This is pretty easy to determine; just
>>>>>>>>>>> hack in a line "system = null" in the right place and run the test. I've
>>>>>>>>>>> done this, and the test times out and the output log is pretty much
>>>>>>>>>>> identical to the one in the bug report. (I recommend you try this
>>>>>>>>>>> yourself.) So I think it's fairly safe to say that the problem in
>>>>>>>>>>> JDK-8023541 has caused the failure listed in JDK-8032050.
>>>>>>>>>>>
>>>>>>>>>>> I can see a couple ways to proceed here. One way is just to close
>>>>>>>>>>> this out
>>>>>>>>>>> as a duplicate of JDK-8023541 since that bug caused this failure.
>>>>>>>>>>>
>>>>>>>>>>> Another is that this test could use some cleaning up. This bug certainly
>>>>>>>>>>> covers a failure, but the messages emitted are confusing and in some
>>>>>>>>>>> cases
>>>>>>>>>>> completely wrong. For example, the "rmid has shutdown" message at
>>>>>>>>>>> line 180
>>>>>>>>>>> is incorrect, because in this case rmid is still running and the wait()
>>>>>>>>>>> call has timed out. Most of the code here can be replaced with calls to
>>>>>>>>>>> various bits of the RMI test library. There are a bunch of other
>>>>>>>>>>> things in
>>>>>>>>>>> this test that could be cleaned up as well.
>>>>>>>>>>>
>>>>>>>>>>> It's up to you how you'd like to proceed.
>>>>>>>>>>>
>>>>>>>>>>> s'marks
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>



More information about the core-libs-dev mailing list