RFR: JDK-8032050: TEST_BUG: java/rmi/activation/Activatable/shutdownGracefully/ShutdownGracefully.java fails intermittently
Stuart Marks
stuart.marks at oracle.com
Wed Jan 29 07:16:46 UTC 2014
Hi Tristan,
I don't want to put the workaround into ActivationLibrary.rmidRunning() for a
null return from the lookup, because this is only a workaround for an actual bug
in rmid initialization. See the review I just posted for JDK-8023541.
Adding JavaVM.waitFor(timeout) is something that would be useful in general, but
it needs to be handled carefully. It uses the new Process.waitFor(timeout, unit)
which is new in Java SE 8; this makes backporting to 7 more complicated. Not
clear whether we'll do so, but I don't want to forclose the opportunity without
discussion. It's also not clear how one can get the vm's exit status after
JavaVM.waitFor() has returned true. With the Process API it's possible simply to
call waitFor() or exitValue(). With JavaVM, a new API needs to be created, or
the rule has to be established that one must call JavaVM.waitFor() to collect
the exit status as well as to close the pipes from the subprocess. If
JavaVM.waitFor(timeout, unit) is called without subsequently calling
JavaVM.waitFor(), the pipes are leaked.
In ShutdownGracefully.java, the finally-block needs to check to see if rmid is
still running, and if it is, to shut it down. Simply calling waitFor(timeout,
unit) isn't sufficient, because if the rmid process is still running, it will be
left running.
The straightforward approach would be to call ActivationLibrary.rmidRunning() to
test if it's still running. Unfortunately this isn't quite right, because
rmidRunning() has a timeout loop in it -- which should probably be removed. (I
think there's a bug for this.) Another approach would be simply to call
rmid.destroy(). This calls rmid's shutdown() method first, which is reasonable,
but I'm not sure it kills the process if that fails. In any case, this already
has a timeout loop waiting for the process to die, so ShutdownGracefully.java
needn't use a new waitFor(timeout, unit) call.
Removing the commented-out code that starts with "no longer needed" is good, and
removing the ShutdownDetectThread is also good, since that's unnecessary.
There are some more cleanups I have in mind here but I'd like to see a revised
webrev before proceeding.
Thanks,
s'marks
On 1/25/14 8:57 PM, Tristan Yan wrote:
> Hi Stuart
> Thank you for your review and suggestion.
> Yes, since this failure mode is very hard to be reproduced. I guess it's make
> sense to do some hack. And I also noticed in ActivationLibrary.rmidRunning.
> It does try to look up ActivationSystem but doesn't check if it's null. So I
> add the logic to make sure we will look up the non-null ActivationSystem. Also
> I did some cleanup if you don't mind. Add a waitFor(long timeout, TimeUnit
> unit) for JavaVM. Which we can have a better waitFor control.
> I appreciate you can review the code again.
> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.01/
> Thank you
> Tristan
>
>
> On 01/25/2014 10:20 AM, Stuart Marks wrote:
>> On 1/23/14 10:34 PM, Tristan Yan wrote:
>>> Hi All
>>> Could you review the bug fix for JDK-8032050.
>>>
>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/
>>>
>>> Description:
>>> This rare happened failure caused because when RMID starts. It don't guarantee
>>> sun.rmi.server.Activation.startActivation finishes.
>>> Fix by adding a iterative getSystem with a 5 seconds timeout.
>>
>> Hi Tristan,
>>
>> Adding a timing/retry loop into this test isn't the correct approach for
>> fixing this test.
>>
>> The timing loop isn't necessary because there is already a timing loop in
>> RMID.start() in the RMI test library. (There's another timing loop in
>> ActivationLibrary.rmidRunning() which should probably be removed.) So the
>> intent of this library call is that it start rmid and wait for it to become
>> ready. That logic doesn't need to be added to the test.
>>
>> In the bug report JDK-8032050 you had mentioned that the NullPointerException
>> was suspicious. You're right! I took a look and it seemed like it was related
>> to JDK-8023541, and I added a note to this effect to the bug report. The
>> problem here is that rmid can come up and transiently return null instead of
>> the stub of the activation system. That's what JDK-8023541 covers. I think
>> that rmid itself needs to be fixed, though modifying the timing loop in the
>> RMI test library to wait for rmid to come up *and* for the lookup to return
>> non-null is an easy way to fix the problem. (Or at least cover it up.)
>>
>> The next step in the analysis is to determine, given that
>> ActivationLibrary.getSystem can sometimes return null, whether this has
>> actually caused this test failure. This is pretty easy to determine; just
>> hack in a line "system = null" in the right place and run the test. I've done
>> this, and the test times out and the output log is pretty much identical to
>> the one in the bug report. (I recommend you try this yourself.) So I think
>> it's fairly safe to say that the problem in JDK-8023541 has caused the
>> failure listed in JDK-8032050.
>>
>> I can see a couple ways to proceed here. One way is just to close this out as
>> a duplicate of JDK-8023541 since that bug caused this failure.
>>
>> Another is that this test could use some cleaning up. This bug certainly
>> covers a failure, but the messages emitted are confusing and in some cases
>> completely wrong. For example, the "rmid has shutdown" message at line 180 is
>> incorrect, because in this case rmid is still running and the wait() call has
>> timed out. Most of the code here can be replaced with calls to various bits
>> of the RMI test library. There are a bunch of other things in this test that
>> could be cleaned up as well.
>>
>> It's up to you how you'd like to proceed.
>>
>> s'marks
>
More information about the core-libs-dev
mailing list