RFR: JDK-8032050: TEST_BUG: java/rmi/activation/Activatable/shutdownGracefully/ShutdownGracefully.java fails intermittently

Thu Feb 13 08:25:28 UTC 2014

Thank you Stuart
I have fixed comment in JavaVM.java. Dealing with different cases in 
ShutdownGracefully.java, two variables were added. One is a flag 
indicate test passed or not. Other variable keeps the error message when 
test failed. I put TestLibrary.bomb in the bottom of the main method 
which only shows test fail message.
Could you review it again
http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.04/
Tristan

On 02/13/2014 05:29 AM, Stuart Marks wrote:
> Hi Tristan,
>
> JavaVM.waitFor looks mostly fine. The indentation of the start of the 
> waitFor(timeout) javadoc comment is a bit off, though; please fix.
>
> There are still some adjustments that need to be made in 
> ShutdownGracefully.java. Both have to do with the case where the last 
> registration succeeds unexpectedly -- this is the one that we expect 
> to fail.
>
> First, if this registration has succeeded unexpectedly, that means 
> rmid is still running. If that occurs, the call to 
> rmid.waitFor(timeout) will inevitably time out. It may be worth 
> calling rmid.destroy() directly at this point.
>
> Second, still in the succeeded-unexpectedly case, at line 154 
> TestLibrary.bomb() is called. This throws an exception, but it's 
> caught by the catch-block at lines 157-158, which calls 
> TestLibrary.bomb() again, saying "unexpected exception". Except that 
> this is kind of expected, since it was thrown from an earlier call to 
> TestLibrary.bomb(). This is quite confusing.
>
> There are several cases that need to be handled.
>
> 1. Normal case. Registration fails as expected, rmid has terminated 
> gracefully. Test passes.
>
> 2. Rmid is still running and has processed the registration request 
> successfully. Need to kill rmid and fail the test.
>
> 3. Rmid is still running but is in some bad state where the 
> registration request failed. Need to kill rmid and fail the test.
>
> 4. Some other unexpected failure. This is what the catch and finally 
> blocks at lines 157-161 are for.
>
> These four cases need to be handled independently. Ideally they should 
> be separated from the cleanup code. As noted above, you don't want to 
> throw an exception from the try-block, because it will get caught by 
> your own catch block. Similarly, it's tempting to return from the 
> midst of the try-block in the success case, but this still runs the 
> finally-block. This can be quite confusing.
>
> A typical technique for dealing with this kind of issue is to record 
> results of operations from within the try block, and then analyze the 
> results outside, throwing a test failure (TestLibrary.bomb) at that 
> point, where it won't be caught by the test's own catch block.
>
> Editoral:
>  - line 153, there should be a space between 'if' and the opening paren
>  - line 156, typo, "gracefuuly"
>
> Finally, it would be helpful if you could get webrev to generate the 
> actual changeset instead of the plain patch, per my other review email.
>
> Thanks,
>
> s'marks
>
>
> On 2/11/14 9:39 PM, Tristan Yan wrote:
>> Thank you for your thorough mail. This is very educational. I took 
>> you advice
>> and generated a new webrev for this.
>>
>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.03/
>> I appreciate you can review this again.
>> Regards
>> Tristan
>>
>>
>> On Feb 11, 2014, at 8:32 AM, Stuart Marks <stuart.marks at oracle.com
>> <mailto:stuart.marks at oracle.com>> wrote:
>>
>>> Hi Tristan,
>>>
>>> Sorry about my recurring delays.
>>>
>>> Several comments on these changes.
>>>
>>> JavaVM.java --
>>>
>>> The waitFor(timeout) method is mostly ok. The new thread started at 
>>> line 208
>>> and following seems unnecessary, though. This code is reached when a 
>>> timeout
>>> occurs, so throwing TimeoutException is the only thing necessary in 
>>> this case.
>>> Should the code to start the new thread be removed?
>>>
>>> There should be a similar check for vm == null as in the waitFor() 
>>> [no args]
>>> method.
>>>
>>> ShutdownGracefully.java --
>>>
>>> The condition that's checked after calling 
>>> rmid.waitFor(SHUTDOWN_TIMEOUT) is
>>> incorrect. It's testing the exit status against zero. Offhand, when 
>>> and if
>>> rmid exits, it might exit with a nonzero exit status. If rmid has 
>>> exited at
>>> this point, then the test should succeed.
>>>
>>> Instead of testing against zero, the code should catch 
>>> TimeoutException, which
>>> means that rmid is still running. It's probably reasonable to catch
>>> TimeoutException, print a message, and then let the finally-block 
>>> destroy the
>>> rmid. Calling TestLibrary.bomb() from within the try-block is 
>>> confusing, since
>>> that method throws an exception, which is then caught by the 
>>> catch-block, when
>>> then calls TestLibrary.bomb() again.
>>>
>>> We should also make sure to test the success case properly. If 
>>> rmid.waitFor()
>>> returns in a timely fashion without throwing TimeoutException, it 
>>> doesn't
>>> matter what the exit status is. (It might be worth printing it out.) 
>>> At that
>>> point we know that rmid *has* exited gracefully, so we need to set 
>>> rmid to
>>> null so that the finally-block doesn't attempt to destroy rmid 
>>> redundantly.
>>> Some additional messages about rmid having exited and the test 
>>> passing are
>>> also warranted for this case.
>>>
>>> Some additional cleanup can be done here as well, over and above the 
>>> changes
>>> you've proposed. (This stuff is left over from earlier RMI test 
>>> messes.) In
>>> order to shut down an active object, the code here spawns a new 
>>> thread that
>>> sleeps for a while and then deactivates this object. This isn't 
>>> necessary. (It
>>> might have been necessary in the past.) It's sufficient simply to 
>>> unexport
>>> this object and then deactivate it, directly within the shutdown() 
>>> method. See
>>>
>>> test/java/rmi/activation/ActivationSystem/unregisterGroup/UnregisterGroup.java 
>>>
>>>
>>> for an example of this. In addition, the run() method can be 
>>> removed, and the
>>> "implements Runnable" declaration can also be removed from the
>>> ShutdownGracefully test class.
>>>
>>> Finally, revisiting some code farther up in the test, the try-block 
>>> at lines
>>> 135-140 issues a registration request that the test expects to fail. 
>>> If it
>>> succeeds, the message at line 139 isn't very clear; it should say 
>>> that the
>>> registration request succeeded unexpectedly. This should cause the 
>>> test to
>>> fail. We still probably want to go through the waitFor(timeout) path 
>>> and
>>> eventual rmid cleanup, but a flag should be set here to ensure that 
>>> the test
>>> indeed fails if the registration succeeds unexpectedly, and the 
>>> messages
>>> should clearly indicate that this is going on.
>>>
>>> A good way to test this last case is to change rmid's security 
>>> manager to the
>>> normal security manager java.lang.SecurityManager instead of 
>>> TestSecurityManager.
>>>
>>> Thanks,
>>>
>>> s'marks
>>>
>>>
>>>
>>>
>>> On 2/10/14 2:59 AM, Tristan Yan wrote:
>>>> Hi Stuart
>>>> Could you help to review this.
>>>> Thank you
>>>> Tristan
>>>>
>>>> On Jan 31, 2014, at 4:36 PM, Tristan Yan <tristan.yan at oracle.com
>>>> <mailto:tristan.yan at oracle.com>
>>>> <mailto:tristan.yan at oracle.com>> wrote:
>>>>
>>>>> Thank you for fixing JDK-8023541. Then I leave 
>>>>> ActivationLibrary.java for now.
>>>>> I still did some clean up following your suggestion.
>>>>> 1. I changed waitFor(long timeout) method, this method is going to 
>>>>> use code
>>>>> like Process.waitFor(timeout, unit). This can be backported to 
>>>>> JDK7. Also
>>>>> exitValue is kept as a return value. For making sure there is no 
>>>>> Pipe leak, a
>>>>> cleanup thread will start when timeout happens.
>>>>> 2. Change in ShutdownGracefully is a little tricky. I think we 
>>>>> should just
>>>>> destroy JVM once exception is thrown. So I move the wait logic 
>>>>> into try block
>>>>> instead keep them in finally block.
>>>>> Can you receive it again.
>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.02/
>>>>> Thank you
>>>>> Tristan
>>>>>
>>>>> On 01/29/2014 03:16 PM, Stuart Marks wrote:
>>>>>> Hi Tristan,
>>>>>>
>>>>>> I don't want to put the workaround into 
>>>>>> ActivationLibrary.rmidRunning() for a
>>>>>> null return from the lookup, because this is only a workaround 
>>>>>> for an actual
>>>>>> bug in rmid initialization. See the review I just posted for 
>>>>>> JDK-8023541.
>>>>>>
>>>>>> Adding JavaVM.waitFor(timeout) is something that would be useful 
>>>>>> in general,
>>>>>> but it needs to be handled carefully. It uses the new
>>>>>> Process.waitFor(timeout, unit) which is new in Java SE 8; this makes
>>>>>> backporting to 7 more complicated. Not clear whether we'll do so, 
>>>>>> but I don't
>>>>>> want to forclose the opportunity without discussion. It's also 
>>>>>> not clear how
>>>>>> one can get the vm's exit status after JavaVM.waitFor() has 
>>>>>> returned true.
>>>>>> With the Process API it's possible simply to call waitFor() or 
>>>>>> exitValue().
>>>>>> With JavaVM, a new API needs to be created, or the rule has to be 
>>>>>> established
>>>>>> that one must call JavaVM.waitFor() to collect the exit status as 
>>>>>> well as to
>>>>>> close the pipes from the subprocess. If JavaVM.waitFor(timeout, 
>>>>>> unit) is
>>>>>> called without subsequently calling JavaVM.waitFor(), the pipes 
>>>>>> are leaked.
>>>>>>
>>>>>> In ShutdownGracefully.java, the finally-block needs to check to 
>>>>>> see if rmid
>>>>>> is still running, and if it is, to shut it down. Simply calling
>>>>>> waitFor(timeout, unit) isn't sufficient, because if the rmid 
>>>>>> process is still
>>>>>> running, it will be left running.
>>>>>>
>>>>>> The straightforward approach would be to call 
>>>>>> ActivationLibrary.rmidRunning()
>>>>>> to test if it's still running. Unfortunately this isn't quite 
>>>>>> right, because
>>>>>> rmidRunning() has a timeout loop in it -- which should probably 
>>>>>> be removed.
>>>>>> (I think there's a bug for this.) Another approach would be 
>>>>>> simply to call
>>>>>> rmid.destroy(). This calls rmid's shutdown() method first, which is
>>>>>> reasonable, but I'm not sure it kills the process if that fails. 
>>>>>> In any case,
>>>>>> this already has a timeout loop waiting for the process to die, so
>>>>>> ShutdownGracefully.java needn't use a new waitFor(timeout, unit) 
>>>>>> call.
>>>>>>
>>>>>> Removing the commented-out code that starts with "no longer 
>>>>>> needed" is good,
>>>>>> and removing the ShutdownDetectThread is also good, since that's 
>>>>>> unnecessary.
>>>>>>
>>>>>> There are some more cleanups I have in mind here but I'd like to 
>>>>>> see a
>>>>>> revised webrev before proceeding.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> s'marks
>>>>>>
>>>>>> On 1/25/14 8:57 PM, Tristan Yan wrote:
>>>>>>> Hi Stuart
>>>>>>> Thank you for your review and suggestion.
>>>>>>> Yes, since this failure mode is very hard to be reproduced. I 
>>>>>>> guess it's
>>>>>>> make sense  to do some hack. And I also noticed in
>>>>>>> ActivationLibrary.rmidRunning. It does try to look up 
>>>>>>> ActivationSystem but
>>>>>>> doesn't check if it's null. So I add the logic to make sure we 
>>>>>>> will look up
>>>>>>> the non-null ActivationSystem. Also I did some cleanup if you 
>>>>>>> don't mind.
>>>>>>> Add a waitFor(long timeout, TimeUnit unit) for JavaVM. Which we 
>>>>>>> can have a
>>>>>>> better waitFor control.
>>>>>>> I appreciate you can review the code again.
>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.01/
>>>>>>> Thank you
>>>>>>> Tristan
>>>>>>>
>>>>>>>
>>>>>>> On 01/25/2014 10:20 AM, Stuart Marks wrote:
>>>>>>>> On 1/23/14 10:34 PM, Tristan Yan wrote:
>>>>>>>>> Hi All
>>>>>>>>> Could you review the bug fix for JDK-8032050.
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/
>>>>>>>>>
>>>>>>>>> Description:
>>>>>>>>> This rare happened failure caused because when RMID starts. It 
>>>>>>>>> don't
>>>>>>>>> guarantee
>>>>>>>>> sun.rmi.server.Activation.startActivation finishes.
>>>>>>>>> Fix by adding a iterative getSystem with a 5 seconds timeout.
>>>>>>>>
>>>>>>>> Hi Tristan,
>>>>>>>>
>>>>>>>> Adding a timing/retry loop into this test isn't the correct 
>>>>>>>> approach for
>>>>>>>> fixing this test.
>>>>>>>>
>>>>>>>> The timing loop isn't necessary because there is already a 
>>>>>>>> timing loop in
>>>>>>>> RMID.start() in the RMI test library. (There's another timing 
>>>>>>>> loop in
>>>>>>>> ActivationLibrary.rmidRunning() which should probably be 
>>>>>>>> removed.) So the
>>>>>>>> intent of this library call is that it start rmid and wait for 
>>>>>>>> it to become
>>>>>>>> ready. That logic doesn't need to be added to the test.
>>>>>>>>
>>>>>>>> In the bug report JDK-8032050 you had mentioned that the
>>>>>>>> NullPointerException was suspicious. You're right! I took a 
>>>>>>>> look and it
>>>>>>>> seemed like it was related to JDK-8023541, and I added a note 
>>>>>>>> to this
>>>>>>>> effect to the bug report. The problem here is that rmid can 
>>>>>>>> come up and
>>>>>>>> transiently return null instead of the stub of the activation 
>>>>>>>> system.
>>>>>>>> That's what JDK-8023541 covers. I think that rmid itself needs 
>>>>>>>> to be fixed,
>>>>>>>> though modifying the timing loop in the RMI test library to 
>>>>>>>> wait for rmid
>>>>>>>> to come up *and* for the lookup to return non-null is an easy 
>>>>>>>> way to fix
>>>>>>>> the problem. (Or at least cover it up.)
>>>>>>>>
>>>>>>>> The next step in the analysis is to determine, given that
>>>>>>>> ActivationLibrary.getSystem can sometimes return null, whether 
>>>>>>>> this has
>>>>>>>> actually caused this test failure. This is pretty easy to 
>>>>>>>> determine; just
>>>>>>>> hack in a line "system = null" in the right place and run the 
>>>>>>>> test. I've
>>>>>>>> done this, and the test times out and the output log is pretty 
>>>>>>>> much
>>>>>>>> identical to the one in the bug report. (I recommend you try this
>>>>>>>> yourself.) So I think it's fairly safe to say that the problem in
>>>>>>>> JDK-8023541 has caused the failure listed in JDK-8032050.
>>>>>>>>
>>>>>>>> I can see a couple ways to proceed here. One way is just to 
>>>>>>>> close this out
>>>>>>>> as a duplicate of JDK-8023541 since that bug caused this failure.
>>>>>>>>
>>>>>>>> Another is that this test could use some cleaning up. This bug 
>>>>>>>> certainly
>>>>>>>> covers a failure, but the messages emitted are confusing and in 
>>>>>>>> some cases
>>>>>>>> completely wrong. For example, the "rmid has shutdown" message 
>>>>>>>> at line 180
>>>>>>>> is incorrect, because in this case rmid is still running and 
>>>>>>>> the wait()
>>>>>>>> call has timed out. Most of the code here can be replaced with 
>>>>>>>> calls to
>>>>>>>> various bits of the RMI test library. There are a bunch of 
>>>>>>>> other things in
>>>>>>>> this test that could be cleaned up as well.
>>>>>>>>
>>>>>>>> It's up to you how you'd like to proceed.
>>>>>>>>
>>>>>>>> s'marks
>>>>>>>
>>>>>>
>>>>>
>>>>
>>