RFR: JDK-8032050: TEST_BUG: java/rmi/activation/Activatable/shutdownGracefully/ShutdownGracefully.java fails intermittently
Tristan Yan
tristan.yan at oracle.com
Fri Jan 31 08:36:56 UTC 2014
Thank you for fixing JDK-8023541. Then I leave ActivationLibrary.java
for now.
I still did some clean up following your suggestion.
1. I changed waitFor(long timeout) method, this method is going to use
code like Process.waitFor(timeout, unit). This can be backported to
JDK7. Also exitValue is kept as a return value. For making sure there is
no Pipe leak, a cleanup thread will start when timeout happens.
2. Change in ShutdownGracefully is a little tricky. I think we should
just destroy JVM once exception is thrown. So I move the wait logic into
try block instead keep them in finally block.
Can you receive it again.
http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.02/
Thank you
Tristan
On 01/29/2014 03:16 PM, Stuart Marks wrote:
> Hi Tristan,
>
> I don't want to put the workaround into
> ActivationLibrary.rmidRunning() for a null return from the lookup,
> because this is only a workaround for an actual bug in rmid
> initialization. See the review I just posted for JDK-8023541.
>
> Adding JavaVM.waitFor(timeout) is something that would be useful in
> general, but it needs to be handled carefully. It uses the new
> Process.waitFor(timeout, unit) which is new in Java SE 8; this makes
> backporting to 7 more complicated. Not clear whether we'll do so, but
> I don't want to forclose the opportunity without discussion. It's also
> not clear how one can get the vm's exit status after JavaVM.waitFor()
> has returned true. With the Process API it's possible simply to call
> waitFor() or exitValue(). With JavaVM, a new API needs to be created,
> or the rule has to be established that one must call JavaVM.waitFor()
> to collect the exit status as well as to close the pipes from the
> subprocess. If JavaVM.waitFor(timeout, unit) is called without
> subsequently calling JavaVM.waitFor(), the pipes are leaked.
>
> In ShutdownGracefully.java, the finally-block needs to check to see if
> rmid is still running, and if it is, to shut it down. Simply calling
> waitFor(timeout, unit) isn't sufficient, because if the rmid process
> is still running, it will be left running.
>
> The straightforward approach would be to call
> ActivationLibrary.rmidRunning() to test if it's still running.
> Unfortunately this isn't quite right, because rmidRunning() has a
> timeout loop in it -- which should probably be removed. (I think
> there's a bug for this.) Another approach would be simply to call
> rmid.destroy(). This calls rmid's shutdown() method first, which is
> reasonable, but I'm not sure it kills the process if that fails. In
> any case, this already has a timeout loop waiting for the process to
> die, so ShutdownGracefully.java needn't use a new waitFor(timeout,
> unit) call.
>
> Removing the commented-out code that starts with "no longer needed" is
> good, and removing the ShutdownDetectThread is also good, since that's
> unnecessary.
>
> There are some more cleanups I have in mind here but I'd like to see a
> revised webrev before proceeding.
>
> Thanks,
>
> s'marks
>
> On 1/25/14 8:57 PM, Tristan Yan wrote:
>> Hi Stuart
>> Thank you for your review and suggestion.
>> Yes, since this failure mode is very hard to be reproduced. I guess
>> it's make sense to do some hack. And I also noticed in
>> ActivationLibrary.rmidRunning. It does try to look up
>> ActivationSystem but doesn't check if it's null. So I add the logic
>> to make sure we will look up the non-null ActivationSystem. Also I
>> did some cleanup if you don't mind. Add a waitFor(long timeout,
>> TimeUnit unit) for JavaVM. Which we can have a better waitFor control.
>> I appreciate you can review the code again.
>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.01/
>> Thank you
>> Tristan
>>
>>
>> On 01/25/2014 10:20 AM, Stuart Marks wrote:
>>> On 1/23/14 10:34 PM, Tristan Yan wrote:
>>>> Hi All
>>>> Could you review the bug fix for JDK-8032050.
>>>>
>>>> http://cr.openjdk.java.net/~tyan/JDK-8032050/webrev.00/
>>>>
>>>> Description:
>>>> This rare happened failure caused because when RMID starts. It
>>>> don't guarantee
>>>> sun.rmi.server.Activation.startActivation finishes.
>>>> Fix by adding a iterative getSystem with a 5 seconds timeout.
>>>
>>> Hi Tristan,
>>>
>>> Adding a timing/retry loop into this test isn't the correct approach
>>> for fixing this test.
>>>
>>> The timing loop isn't necessary because there is already a timing
>>> loop in RMID.start() in the RMI test library. (There's another
>>> timing loop in ActivationLibrary.rmidRunning() which should probably
>>> be removed.) So the intent of this library call is that it start
>>> rmid and wait for it to become ready. That logic doesn't need to be
>>> added to the test.
>>>
>>> In the bug report JDK-8032050 you had mentioned that the
>>> NullPointerException was suspicious. You're right! I took a look and
>>> it seemed like it was related to JDK-8023541, and I added a note to
>>> this effect to the bug report. The problem here is that rmid can
>>> come up and transiently return null instead of the stub of the
>>> activation system. That's what JDK-8023541 covers. I think that rmid
>>> itself needs to be fixed, though modifying the timing loop in the
>>> RMI test library to wait for rmid to come up *and* for the lookup to
>>> return non-null is an easy way to fix the problem. (Or at least
>>> cover it up.)
>>>
>>> The next step in the analysis is to determine, given that
>>> ActivationLibrary.getSystem can sometimes return null, whether this
>>> has actually caused this test failure. This is pretty easy to
>>> determine; just hack in a line "system = null" in the right place
>>> and run the test. I've done this, and the test times out and the
>>> output log is pretty much identical to the one in the bug report. (I
>>> recommend you try this yourself.) So I think it's fairly safe to say
>>> that the problem in JDK-8023541 has caused the failure listed in
>>> JDK-8032050.
>>>
>>> I can see a couple ways to proceed here. One way is just to close
>>> this out as a duplicate of JDK-8023541 since that bug caused this
>>> failure.
>>>
>>> Another is that this test could use some cleaning up. This bug
>>> certainly covers a failure, but the messages emitted are confusing
>>> and in some cases completely wrong. For example, the "rmid has
>>> shutdown" message at line 180 is incorrect, because in this case
>>> rmid is still running and the wait() call has timed out. Most of the
>>> code here can be replaced with calls to various bits of the RMI test
>>> library. There are a bunch of other things in this test that could
>>> be cleaned up as well.
>>>
>>> It's up to you how you'd like to proceed.
>>>
>>> s'marks
>>
>
More information about the core-libs-dev
mailing list