From jaroslav.bachorik at oracle.com  Wed Sep  3 13:43:25 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Wed, 03 Sep 2014 15:43:25 +0200
Subject: jmx-dev RFR 8057150: Add more diagnostics to JMXStartStopTest
Message-ID: <54071AFD.7030103@oracle.com>

Please, review this trivial test change

Issue : https://bugs.openjdk.java.net/browse/JDK-8057150
Webrev: http://cr.openjdk.java.net/~jbachorik/8057150/webrev.00

This change is to provide us with more info about why the test fails as 
described in https://bugs.openjdk.java.net/browse/JDK-8057149

It seems that the string matching fails for some reason but currently 
available data in logs don't provide any clue. Hopefully, by adding more 
logging it would be possible to identify the problem.


Thanks,

-JB-

From staffan.larsen at oracle.com  Wed Sep  3 13:48:32 2014
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Wed, 3 Sep 2014 15:48:32 +0200
Subject: jmx-dev RFR 8057150: Add more diagnostics to JMXStartStopTest
In-Reply-To: <54071AFD.7030103@oracle.com>
References: <54071AFD.7030103@oracle.com>
Message-ID: <68312515-0E34-4C1F-8764-4C37EEAD00A3@oracle.com>

Looks good!

Thanks,
/Staffan

On 3 sep 2014, at 15:43, Jaroslav Bachorik <jaroslav.bachorik at oracle.com> wrote:

> Please, review this trivial test change
> 
> Issue : https://bugs.openjdk.java.net/browse/JDK-8057150
> Webrev: http://cr.openjdk.java.net/~jbachorik/8057150/webrev.00
> 
> This change is to provide us with more info about why the test fails as described in https://bugs.openjdk.java.net/browse/JDK-8057149
> 
> It seems that the string matching fails for some reason but currently available data in logs don't provide any clue. Hopefully, by adding more logging it would be possible to identify the problem.
> 
> 
> Thanks,
> 
> -JB-


From daniel.fuchs at oracle.com  Wed Sep  3 13:57:13 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 03 Sep 2014 15:57:13 +0200
Subject: jmx-dev RFR 8057150: Add more diagnostics to JMXStartStopTest
In-Reply-To: <54071AFD.7030103@oracle.com>
References: <54071AFD.7030103@oracle.com>
Message-ID: <54071E39.3040108@oracle.com>

Hi Jaroslav,

Looks good.

I wonder however - are these messages internationalized?
If so could it be a locale/env issue (e.g. matching english
against some other language)?

best regards,

-- daniel

On 9/3/14 3:43 PM, Jaroslav Bachorik wrote:
> Please, review this trivial test change
>
> Issue : https://bugs.openjdk.java.net/browse/JDK-8057150
> Webrev: http://cr.openjdk.java.net/~jbachorik/8057150/webrev.00
>
> This change is to provide us with more info about why the test fails as
> described in https://bugs.openjdk.java.net/browse/JDK-8057149
>
> It seems that the string matching fails for some reason but currently
> available data in logs don't provide any clue. Hopefully, by adding more
> logging it would be possible to identify the problem.
>
>
> Thanks,
>
> -JB-


From jaroslav.bachorik at oracle.com  Wed Sep  3 14:02:16 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Wed, 03 Sep 2014 16:02:16 +0200
Subject: jmx-dev RFR 8057134:
 sun/management/jmxremote/startstop/JMXStartStopTest.java failing
 intermittently
Message-ID: <54071F68.5050307@oracle.com>

Please, review this test change

Issue : https://bugs.openjdk.java.net/browse/JDK-8057134
Webrev: http://cr.openjdk.java.net/~jbachorik/8057134/webrev.02

Currently the test expects one of the following exception types when 
trying to connect to a port not being server by the management agent - 
NoSuchObjectException and ConnectException. However, under certain 
circumstances RMI might throw ConnectIOException or any other suiting 
subtype of RemoteException.

The solution is to check for the thrown exception being a subtype of 
RemoteException instead.

Thanks,

-JB-

From jaroslav.bachorik at oracle.com  Wed Sep  3 14:03:16 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Wed, 03 Sep 2014 16:03:16 +0200
Subject: jmx-dev RFR 8057150: Add more diagnostics to JMXStartStopTest
In-Reply-To: <54071E39.3040108@oracle.com>
References: <54071AFD.7030103@oracle.com> <54071E39.3040108@oracle.com>
Message-ID: <54071FA4.3050100@oracle.com>

On 09/03/2014 03:57 PM, Daniel Fuchs wrote:
> Hi Jaroslav,
>
> Looks good.
>
> I wonder however - are these messages internationalized?
> If so could it be a locale/env issue (e.g. matching english
> against some other language)?

I am not 100% certain. The strange thing is that you can see the error 
message correctly in the log - but the test behaves like it weren't there.

-JB-

>
> best regards,
>
> -- daniel
>
> On 9/3/14 3:43 PM, Jaroslav Bachorik wrote:
>> Please, review this trivial test change
>>
>> Issue : https://bugs.openjdk.java.net/browse/JDK-8057150
>> Webrev: http://cr.openjdk.java.net/~jbachorik/8057150/webrev.00
>>
>> This change is to provide us with more info about why the test fails as
>> described in https://bugs.openjdk.java.net/browse/JDK-8057149
>>
>> It seems that the string matching fails for some reason but currently
>> available data in logs don't provide any clue. Hopefully, by adding more
>> logging it would be possible to identify the problem.
>>
>>
>> Thanks,
>>
>> -JB-
>


From daniel.fuchs at oracle.com  Wed Sep  3 14:30:14 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 03 Sep 2014 16:30:14 +0200
Subject: jmx-dev RFR 8057134:
 sun/management/jmxremote/startstop/JMXStartStopTest.java failing
 intermittently
In-Reply-To: <54071F68.5050307@oracle.com>
References: <54071F68.5050307@oracle.com>
Message-ID: <540725F6.9020101@oracle.com>

Hi Jaroslav,

- import java.net.ConnectException;

java.net.ConnectException is not a RemoteException, but
java.rmi.ConnectException is.

I wonder whether the connect code may throw java.net.ConnectException,
or whether the test was wrong in the first place.

If the java.net.ConnectException may be thrown then replacing
it with RemoteException will probably cause new failures.

best regards,

-- daniel


On 9/3/14 4:02 PM, Jaroslav Bachorik wrote:
> Please, review this test change
>
> Issue : https://bugs.openjdk.java.net/browse/JDK-8057134
> Webrev: http://cr.openjdk.java.net/~jbachorik/8057134/webrev.02
>
> Currently the test expects one of the following exception types when
> trying to connect to a port not being server by the management agent -
> NoSuchObjectException and ConnectException. However, under certain
> circumstances RMI might throw ConnectIOException or any other suiting
> subtype of RemoteException.
>
> The solution is to check for the thrown exception being a subtype of
> RemoteException instead.
>
> Thanks,
>
> -JB-


From staffan.larsen at oracle.com  Wed Sep  3 14:30:12 2014
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Wed, 3 Sep 2014 16:30:12 +0200
Subject: jmx-dev RFR 8057134:
	sun/management/jmxremote/startstop/JMXStartStopTest.java
	failing intermittently
In-Reply-To: <54071F68.5050307@oracle.com>
References: <54071F68.5050307@oracle.com>
Message-ID: <36378782-8980-47C9-89B8-BF37277821D4@oracle.com>

Looks good!

Thanks,
/Staffan

On 3 sep 2014, at 16:02, Jaroslav Bachorik <jaroslav.bachorik at oracle.com> wrote:

> Please, review this test change
> 
> Issue : https://bugs.openjdk.java.net/browse/JDK-8057134
> Webrev: http://cr.openjdk.java.net/~jbachorik/8057134/webrev.02
> 
> Currently the test expects one of the following exception types when trying to connect to a port not being server by the management agent - NoSuchObjectException and ConnectException. However, under certain circumstances RMI might throw ConnectIOException or any other suiting subtype of RemoteException.
> 
> The solution is to check for the thrown exception being a subtype of RemoteException instead.
> 
> Thanks,
> 
> -JB-


From jaroslav.bachorik at oracle.com  Wed Sep  3 14:47:36 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Wed, 03 Sep 2014 16:47:36 +0200
Subject: jmx-dev RFR 8057134:
 sun/management/jmxremote/startstop/JMXStartStopTest.java failing
 intermittently
In-Reply-To: <540725F6.9020101@oracle.com>
References: <54071F68.5050307@oracle.com> <540725F6.9020101@oracle.com>
Message-ID: <54072A08.7030403@oracle.com>

On 09/03/2014 04:30 PM, Daniel Fuchs wrote:
> Hi Jaroslav,
>
> - import java.net.ConnectException;
>
> java.net.ConnectException is not a RemoteException, but
> java.rmi.ConnectException is.
>
> I wonder whether the connect code may throw java.net.ConnectException,
> or whether the test was wrong in the first place.
>
> If the java.net.ConnectException may be thrown then replacing
> it with RemoteException will probably cause new failures.

Good catch! Removing java.net.ConnectException doesn't seem to cause any 
apparent regressions but I'd better keep it there.

http://cr.openjdk.java.net/~jbachorik/8057134/webrev.03

-JB-


>
> best regards,
>
> -- daniel
>
>
> On 9/3/14 4:02 PM, Jaroslav Bachorik wrote:
>> Please, review this test change
>>
>> Issue : https://bugs.openjdk.java.net/browse/JDK-8057134
>> Webrev: http://cr.openjdk.java.net/~jbachorik/8057134/webrev.02
>>
>> Currently the test expects one of the following exception types when
>> trying to connect to a port not being server by the management agent -
>> NoSuchObjectException and ConnectException. However, under certain
>> circumstances RMI might throw ConnectIOException or any other suiting
>> subtype of RemoteException.
>>
>> The solution is to check for the thrown exception being a subtype of
>> RemoteException instead.
>>
>> Thanks,
>>
>> -JB-
>


From daniel.fuchs at oracle.com  Wed Sep  3 15:06:56 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 03 Sep 2014 17:06:56 +0200
Subject: jmx-dev RFR 8057134:
 sun/management/jmxremote/startstop/JMXStartStopTest.java failing
 intermittently
In-Reply-To: <54072A08.7030403@oracle.com>
References: <54071F68.5050307@oracle.com> <540725F6.9020101@oracle.com>
	<54072A08.7030403@oracle.com>
Message-ID: <54072E90.4040201@oracle.com>

Looks good!

On 9/3/14 4:47 PM, Jaroslav Bachorik wrote:
> On 09/03/2014 04:30 PM, Daniel Fuchs wrote:
>> Hi Jaroslav,
>>
>> - import java.net.ConnectException;
>>
>> java.net.ConnectException is not a RemoteException, but
>> java.rmi.ConnectException is.
>>
>> I wonder whether the connect code may throw java.net.ConnectException,
>> or whether the test was wrong in the first place.
>>
>> If the java.net.ConnectException may be thrown then replacing
>> it with RemoteException will probably cause new failures.
>
> Good catch! Removing java.net.ConnectException doesn't seem to cause any
> apparent regressions but I'd better keep it there.
>
> http://cr.openjdk.java.net/~jbachorik/8057134/webrev.03
>
> -JB-
>
>
>>
>> best regards,
>>
>> -- daniel
>>
>>
>> On 9/3/14 4:02 PM, Jaroslav Bachorik wrote:
>>> Please, review this test change
>>>
>>> Issue : https://bugs.openjdk.java.net/browse/JDK-8057134
>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8057134/webrev.02
>>>
>>> Currently the test expects one of the following exception types when
>>> trying to connect to a port not being server by the management agent -
>>> NoSuchObjectException and ConnectException. However, under certain
>>> circumstances RMI might throw ConnectIOException or any other suiting
>>> subtype of RemoteException.
>>>
>>> The solution is to check for the thrown exception being a subtype of
>>> RemoteException instead.
>>>
>>> Thanks,
>>>
>>> -JB-
>>
>


From shanliang.jiang at oracle.com  Mon Sep  8 09:27:46 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Mon, 08 Sep 2014 11:27:46 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <54004ACA.4040802@oracle.com>
References: <53FBDB1A.2070501@oracle.com>
	<53FF5177.7000700@oracle.com>	<5400471F.3050802@oracle.com>
	<54004ACA.4040802@oracle.com>
Message-ID: <540D7692.4000009@oracle.com>

Jaroslav,

Your fix was to close a connection if the IOException was not related to 
a serialization problem, without testing whether the connection was back.

This might modify the current RMIConnector behaviors, because the method
    RMIClientCommunicatorAdmin.gotIOException
was called not only by a notification fetching thread, it is called by 
all remote requests of a RMI client, look at:
    RMIConnector.createMBean
            try {
                return connection.createMBean(className,
                        name,
                        loaderName,
                        delegationSubject);

            } catch (IOException ioe) {
                communicatorAdmin.gotIOException(ioe);

                return connection.createMBean(className,
                        name,
                        loaderName,
                        delegationSubject);

            } finally {
                popDefaultClassLoader(old);
            }

with the suggested fix, no more second call of connection.createMBean 
(Yes, we need more tests to cover these cases).

So a fix is better added in RMIConnector.RMINotifClient.fetchNotifs.

Thanks,
Shanliang

Jaroslav Bachorik wrote:
> On 08/29/2014 11:25 AM, Daniel Fuchs wrote:
>> Hi Jaroslav,
>>
>> I am not sure to understand how this solves the problem.
>> The old code first checked the connection, and if that failed,
>> sent the FAILED notification, closed the connector, and rethrew
>> the exception.
>
> This problem seems to have something to do with the way RMI works - 
> the customer had problems with one set of ties/stubs while the other 
> set of ties/stubs worked just fine. Seems like in cases of transient 
> network failures the connection check was not reliable.
>
>>
>> The new code directly throws the exception without
>> checking the connection, and therefore without closing
>> the connection and sending the FAILED notification.
>
> It only does so for the cases where the connection itself is not the 
> culprit - error while executing the method on the server, marshalling 
> problems etc.
>
>>
>> So is the fix a change of behavior by which the RMIConnector
>> will - in some cases - not try to autoclose the connection but
>> instead simply wait for the caller to explicitely call close()?
>
> Not really - the change is in relying on the RMI providing the 
> information whether the connection is still usable or not. The code 
> didn't autoclose the connection when 
> "connection.getDefaultDomain(null)" didn't throw IOException either.
>
>>
>> I'd be interested to hear what Shanliang has to say...
>
> Yep. The code does a lot of things at once and without any spec for 
> handling failures and recovery we can only rely on the tests.
>
> -JB-
>
>>
>> best regards,
>>
>> -- daniel
>>
>>
>> On 8/28/14 5:57 PM, Jaroslav Bachorik wrote:
>>> I have taken over this issue from Poonam since she will be unavailable
>>> for the next month or so.
>>>
>>> Could I have reviews for this change:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8049303/webrev.00
>>>
>>> Problem and fix:
>>> By default the JMX client side notification fetch timeout
>>> (jmx.remote.x.notification.fetch.timeout) is 1 minute and the default
>>> server connection timeout (jmx.remote.x.server.connection.timeout) is 2
>>> minutes.
>>>
>>> If the client side connector thread makes a notification fetch request
>>> to the server, but a transient network problem prevents the server
>>> response from reaching the client, the client side connector will wait
>>> for a response until the timeout period (1 minute) has expired before
>>> throwing an IOException.
>>>
>>> The client side RMIConnector implementation handles the IOException, by
>>> re-checking the connection status to understand whether or not it is
>>> broken. If the connection is not available at that moment, the 
>>> connector
>>> fails by re-throwing the initial IOException. The problem is that this
>>> re-check of the connection passes because the server side of the
>>> connection doesn't time out until 2 minutes has passed (by default), so
>>> the NotifFetcher thread
>>> dies without posting a failed notification, and the client application
>>> does not get a chance to recover.
>>>
>>> The fix is to forward the non connection-related exceptions on the JMX
>>> client side instead of checking the connection status. The
>>> connection-related exceptions will cause closing the session as an
>>> unsuccessful connection check would have done.
>>>
>>> Testing:
>>> All the jdk_jmx and jdk_management regression tests passed.
>>> All the related JCK tests passed.
>>>
>>> The fix applies cleanly to 8u and 7u repos.
>>>
>>>
>>> Thanks,
>>> -JB-
>>>
>>>
>>
>


From shanliang.jiang at oracle.com  Wed Sep 10 15:42:45 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Wed, 10 Sep 2014 17:42:45 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems cause
 JMX	thread to fail silenty
Message-ID: <54107175.7030200@oracle.com>

Hi,

The issue could happen like this:
 1) RMIConnector.RMINotifClient.fetchNotifs got an IOException
 2) communicatorAdmin.gotIOException(ioe) was called to check the 
connection, it did not close the connection because the connection was 
now OK.
 3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original 
exception and found it was not a dersialization exception, it re-threw 
the original IOException
 4) the caller ClientNotifForwarder did not know how to treat this 
exception, decided to end silently.

The fix is to modify RMIConnector.RMINotifClient.fetchNotifs:

if the fetchNotifs request gets an IOException, we examine the chain of 
exceptions to determine whether this is a deserialization issue. If so - 
we propagate the appropriate exception to the caller, who will then 
proceed with fetching notifications one by one, otherwise we call 
communicatorAdmin.gotIOException(ioe), there are 2 kinds of response:
    1) the call returns OK, means the connection is re-established, we 
re-call the fetchNotifs;
    2) the call throws IOException, we check the connection status:
        2-1) "terminated", that means the connection is closed, we 
re-throw the original IOException, the caller will end silently.
        2-2) not "terminated", we add a flag "retried" for this 
situation, if the flag is false, we set the flag to true and re-do the 
fetchNotifs request, this is useful for a transient network problem, 
otherwise we close the connection and re-throw the original IOException, 
it is here we fix the bug.

We do not modify communicatorAdmin.gotIOException(ioe), it is called too 
by all other remote requests.

It is not easy to have a test reproducing the bug.

Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
webrev: http://cr.openjdk.java.net/~sjiang/JDK-8049303/00/ 
<http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/00/>

Thanks,
Shanliang

From daniel.fuchs at oracle.com  Wed Sep 10 16:54:55 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 10 Sep 2014 18:54:55 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <54107175.7030200@oracle.com>
References: <54107175.7030200@oracle.com>
Message-ID: <5410825F.80504@oracle.com>

Hi,

Thanks for the detailed explanations.

The fact that the server doesn't store any client state
and can arbitrarily close the connection, leaving it to
the client to reestablish the connection, makes all this
code quite tricky and hard to follow.

I believe what you propose - making sure that the
notification thread doesn't stop without closing the
connection, is the right thing to do.

I wonder however if the code that closes the connection
should better be moved to ClientNotifForwarder::fetchNotifs?

ClientNotifForwarder::fetchNotifs has the following statement:

601         if (!shouldStop()) {
602             logger.error("NotifFetcher-run",
603                          "Failed to fetch notification, " +
604                          "stopping thread. Error is: " + ioe, ioe);
605             logger.debug("NotifFetcher-run",ioe);
607         }

Then it proceeds to return null, which causes the thread to die.

It looks as if that's the place where the connection should ideally
be closed if it is not already closed, because it would ensure
that the thread never dies silently.

Otherwise I'd suggest improving the comment below:

1369                             // JDK-8049303
1370                             // possible again transient or a
1371                             // non-deserialization exception, not 
know how
1372                             // to treat, close the connection

May I suggest something like:

// JDK-8049303
// We received an IOException - but the communicatorAdmin
// did not close the connection - possibly because the
// the original exception was raised by a transient network
// problem?
// We already know that this exception is not due to a deserialization
// issue as we already took care of that before involving the
// communicatorAdmin. Moreover - we already made one retry attempt
// at fetching the next batch of notifications - and the
// problem persisted.
// Since trying again doesn't seem to solve the issue, we will now
// close the connection. Doing otherwise might cause the
// NotifFetcher thread to die silently.

best regards,

-- daniel

On 9/10/14 5:42 PM, shanliang wrote:
> Hi,
>
> The issue could happen like this:
> 1) RMIConnector.RMINotifClient.fetchNotifs got an IOException
> 2) communicatorAdmin.gotIOException(ioe) was called to check the
> connection, it did not close the connection because the connection was
> now OK.
> 3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original
> exception and found it was not a dersialization exception, it re-threw
> the original IOException
> 4) the caller ClientNotifForwarder did not know how to treat this
> exception, decided to end silently.
>
> The fix is to modify RMIConnector.RMINotifClient.fetchNotifs:
>
> if the fetchNotifs request gets an IOException, we examine the chain of
> exceptions to determine whether this is a deserialization issue. If so -
> we propagate the appropriate exception to the caller, who will then
> proceed with fetching notifications one by one, otherwise we call
> communicatorAdmin.gotIOException(ioe), there are 2 kinds of response:
>     1) the call returns OK, means the connection is re-established, we
> re-call the fetchNotifs;
>     2) the call throws IOException, we check the connection status:
>         2-1) "terminated", that means the connection is closed, we
> re-throw the original IOException, the caller will end silently.
>         2-2) not "terminated", we add a flag "retried" for this
> situation, if the flag is false, we set the flag to true and re-do the
> fetchNotifs request, this is useful for a transient network problem,
> otherwise we close the connection and re-throw the original IOException,
> it is here we fix the bug.
>
> We do not modify communicatorAdmin.gotIOException(ioe), it is called too
> by all other remote requests.
>
> It is not easy to have a test reproducing the bug.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8049303/00/
> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/00/>
>
> Thanks,
> Shanliang


From shanliang.jiang at oracle.com  Wed Sep 10 17:18:09 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Wed, 10 Sep 2014 19:18:09 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <5410825F.80504@oracle.com>
References: <54107175.7030200@oracle.com> <5410825F.80504@oracle.com>
Message-ID: <541087D1.9080109@oracle.com>

Daniel Fuchs wrote:
> Hi,
>
> Thanks for the detailed explanations.
>
> The fact that the server doesn't store any client state
> and can arbitrarily close the connection, leaving it to
> the client to reestablish the connection, makes all this
> code quite tricky and hard to follow.
Yes it is complicated, we allowed a server to close a client after a 
specific timeout because in some case a client was dead but the server 
needed long long time to be informed, that could make memory problem.
>
> I believe what you propose - making sure that the
> notification thread doesn't stop without closing the
> connection, is the right thing to do.
>
> I wonder however if the code that closes the connection
> should better be moved to ClientNotifForwarder::fetchNotifs?
>
> ClientNotifForwarder::fetchNotifs has the following statement:
>
> 601         if (!shouldStop()) {
> 602             logger.error("NotifFetcher-run",
> 603                          "Failed to fetch notification, " +
> 604                          "stopping thread. Error is: " + ioe, ioe);
> 605             logger.debug("NotifFetcher-run",ioe);
> 607         }
>
> Then it proceeds to return null, which causes the thread to die.
ClientNotifForwarder is an abstract super class and does not know how to 
close a connection, this class is extended by 
RMIConnector.RMINotifClient and JMXMP, if we modify the class to have 
connection reference, that might make problem for JMXMP.

>
> It looks as if that's the place where the connection should ideally
> be closed if it is not already closed, because it would ensure
> that the thread never dies silently.
>
> Otherwise I'd suggest improving the comment below:
>
> 1369                             // JDK-8049303
> 1370                             // possible again transient or a
> 1371                             // non-deserialization exception, not 
> know how
> 1372                             // to treat, close the connection
>
> May I suggest something like:
>
> // JDK-8049303
> // We received an IOException - but the communicatorAdmin
> // did not close the connection - possibly because the
> // the original exception was raised by a transient network
> // problem?
> // We already know that this exception is not due to a deserialization
> // issue as we already took care of that before involving the
> // communicatorAdmin. Moreover - we already made one retry attempt
> // at fetching the next batch of notifications - and the
> // problem persisted.
> // Since trying again doesn't seem to solve the issue, we will now
> // close the connection. Doing otherwise might cause the
> // NotifFetcher thread to die silently.
Yes more clear, here is the new webrev:

http://cr.openjdk.java.net/~sjiang/JDK-8049303/01/

Thanks Daniel!
Shanliang
>
> best regards,
>
> -- daniel
>
> On 9/10/14 5:42 PM, shanliang wrote:
>> Hi,
>>
>> The issue could happen like this:
>> 1) RMIConnector.RMINotifClient.fetchNotifs got an IOException
>> 2) communicatorAdmin.gotIOException(ioe) was called to check the
>> connection, it did not close the connection because the connection was
>> now OK.
>> 3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original
>> exception and found it was not a dersialization exception, it re-threw
>> the original IOException
>> 4) the caller ClientNotifForwarder did not know how to treat this
>> exception, decided to end silently.
>>
>> The fix is to modify RMIConnector.RMINotifClient.fetchNotifs:
>>
>> if the fetchNotifs request gets an IOException, we examine the chain of
>> exceptions to determine whether this is a deserialization issue. If so -
>> we propagate the appropriate exception to the caller, who will then
>> proceed with fetching notifications one by one, otherwise we call
>> communicatorAdmin.gotIOException(ioe), there are 2 kinds of response:
>>     1) the call returns OK, means the connection is re-established, we
>> re-call the fetchNotifs;
>>     2) the call throws IOException, we check the connection status:
>>         2-1) "terminated", that means the connection is closed, we
>> re-throw the original IOException, the caller will end silently.
>>         2-2) not "terminated", we add a flag "retried" for this
>> situation, if the flag is false, we set the flag to true and re-do the
>> fetchNotifs request, this is useful for a transient network problem,
>> otherwise we close the connection and re-throw the original IOException,
>> it is here we fix the bug.
>>
>> We do not modify communicatorAdmin.gotIOException(ioe), it is called too
>> by all other remote requests.
>>
>> It is not easy to have a test reproducing the bug.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8049303/00/
>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/00/>
>>
>> Thanks,
>> Shanliang
>


From shanliang.jiang at oracle.com  Wed Sep 10 19:45:00 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Wed, 10 Sep 2014 21:45:00 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <541087D1.9080109@oracle.com>
References: <54107175.7030200@oracle.com> <5410825F.80504@oracle.com>
	<541087D1.9080109@oracle.com>
Message-ID: <5410AA3C.70009@oracle.com>

shanliang wrote:
> Daniel Fuchs wrote:
>> Hi,
>>
>> Thanks for the detailed explanations.
>>
>> The fact that the server doesn't store any client state
>> and can arbitrarily close the connection, leaving it to
>> the client to reestablish the connection, makes all this
>> code quite tricky and hard to follow.
> Yes it is complicated, we allowed a server to close a client after a 
> specific timeout because in some case a client was dead but the server 
> needed long long time to be informed, that could make memory problem.
>>
>> I believe what you propose - making sure that the
>> notification thread doesn't stop without closing the
>> connection, is the right thing to do.
>>
>> I wonder however if the code that closes the connection
>> should better be moved to ClientNotifForwarder::fetchNotifs?
>>
>> ClientNotifForwarder::fetchNotifs has the following statement:
>>
>> 601         if (!shouldStop()) {
>> 602             logger.error("NotifFetcher-run",
>> 603                          "Failed to fetch notification, " +
>> 604                          "stopping thread. Error is: " + ioe, ioe);
>> 605             logger.debug("NotifFetcher-run",ioe);
>> 607         }
>>
>> Then it proceeds to return null, which causes the thread to die.
> ClientNotifForwarder is an abstract super class and does not know how 
> to close a connection, this class is extended by 
> RMIConnector.RMINotifClient and JMXMP, if we modify the class to have 
> connection reference, that might make problem for JMXMP.
>
>>
>> It looks as if that's the place where the connection should ideally
>> be closed if it is not already closed, because it would ensure
>> that the thread never dies silently.
>>
>> Otherwise I'd suggest improving the comment below:
>>
>> 1369                             // JDK-8049303
>> 1370                             // possible again transient or a
>> 1371                             // non-deserialization exception, 
>> not know how
>> 1372                             // to treat, close the connection
>>
>> May I suggest something like:
>>
>> // JDK-8049303
>> // We received an IOException - but the communicatorAdmin
>> // did not close the connection - possibly because the
>> // the original exception was raised by a transient network
>> // problem?
>> // We already know that this exception is not due to a deserialization
>> // issue as we already took care of that before involving the
>> // communicatorAdmin. Moreover - we already made one retry attempt
>> // at fetching the next batch of notifications - and the
>> // problem persisted.
>> // Since trying again doesn't seem to solve the issue, we will now
>> // close the connection. Doing otherwise might cause the
>> // NotifFetcher thread to die silently.
> Yes more clear, here is the new webrev:
>
> http://cr.openjdk.java.net/~sjiang/JDK-8049303/01/
Oh, not one retry attempt fetching the next batch of notifications, but 
the *SAME* batch of notifications.

http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/ 
<http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>

Shanliang
>
> Thanks Daniel!
> Shanliang
>>
>> best regards,
>>
>> -- daniel
>>
>> On 9/10/14 5:42 PM, shanliang wrote:
>>> Hi,
>>>
>>> The issue could happen like this:
>>> 1) RMIConnector.RMINotifClient.fetchNotifs got an IOException
>>> 2) communicatorAdmin.gotIOException(ioe) was called to check the
>>> connection, it did not close the connection because the connection was
>>> now OK.
>>> 3) RMIConnector.RMINotifClient.fetchNotifs analyzed the original
>>> exception and found it was not a dersialization exception, it re-threw
>>> the original IOException
>>> 4) the caller ClientNotifForwarder did not know how to treat this
>>> exception, decided to end silently.
>>>
>>> The fix is to modify RMIConnector.RMINotifClient.fetchNotifs:
>>>
>>> if the fetchNotifs request gets an IOException, we examine the chain of
>>> exceptions to determine whether this is a deserialization issue. If 
>>> so -
>>> we propagate the appropriate exception to the caller, who will then
>>> proceed with fetching notifications one by one, otherwise we call
>>> communicatorAdmin.gotIOException(ioe), there are 2 kinds of response:
>>>     1) the call returns OK, means the connection is re-established, we
>>> re-call the fetchNotifs;
>>>     2) the call throws IOException, we check the connection status:
>>>         2-1) "terminated", that means the connection is closed, we
>>> re-throw the original IOException, the caller will end silently.
>>>         2-2) not "terminated", we add a flag "retried" for this
>>> situation, if the flag is false, we set the flag to true and re-do the
>>> fetchNotifs request, this is useful for a transient network problem,
>>> otherwise we close the connection and re-throw the original 
>>> IOException,
>>> it is here we fix the bug.
>>>
>>> We do not modify communicatorAdmin.gotIOException(ioe), it is called 
>>> too
>>> by all other remote requests.
>>>
>>> It is not easy to have a test reproducing the bug.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8049303
>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8049303/00/
>>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/00/>
>>>
>>> Thanks,
>>> Shanliang
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/jmx-dev/attachments/20140910/c7a38026/attachment-0001.html>

From jaroslav.bachorik at oracle.com  Thu Sep 11 09:54:58 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Thu, 11 Sep 2014 11:54:58 +0200
Subject: jmx-dev RFR 8058089:
 api/javax_management/loading/MLetArgsSupport.html\#MLetArgsSupportTest0001
 fails because of java.lang.IllegalArgumentException (argument type
 mismatch)
Message-ID: <54117172.5040700@oracle.com>

Please, review the following regression fix

Issue  : https://bugs.openjdk.java.net/browse/JDK-8058089
Webrev : http://cr.openjdk.java.net/~jbachorik/8058089/webrev.00

The regression was introduced by an en-mass update of new 
`Integer(param)` to `Integer.valueOf(param)`. For some reason in the 
MLet code only `param` was used instead of `Integer.valueOf(param)`. 
I've fixed this problem and also took the liberty to convert new Float() 
and new Double() to .valueOf(...) form. I also added a reg test 
asserting the correctness of param conversions performed by the MLet class.


Thanks,

-JB-

From daniel.fuchs at oracle.com  Thu Sep 11 10:31:52 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Thu, 11 Sep 2014 12:31:52 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <5410AA3C.70009@oracle.com>
References: <54107175.7030200@oracle.com> <5410825F.80504@oracle.com>
	<541087D1.9080109@oracle.com> <5410AA3C.70009@oracle.com>
Message-ID: <54117A18.4040206@oracle.com>

On 9/10/14 9:45 PM, shanliang wrote:
> Oh, not one retry attempt fetching the next batch of notifications, but
> the *SAME* batch of notifications.
>
> http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/
> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>
>
> Shanliang
>>

This looks good Shanliang!

Make sure to rerun all the JCK/JDK tests before pushing.
This was really a tricky problem!

best regards,

-- daniel

From daniel.fuchs at oracle.com  Thu Sep 11 10:38:02 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Thu, 11 Sep 2014 12:38:02 +0200
Subject: jmx-dev RFR 8058089:
 api/javax_management/loading/MLetArgsSupport.html\#MLetArgsSupportTest0001
 fails because of java.lang.IllegalArgumentException (argument type
 mismatch)
In-Reply-To: <54117172.5040700@oracle.com>
References: <54117172.5040700@oracle.com>
Message-ID: <54117B8A.7070101@oracle.com>

Hi Jaroslav,

This looks good.

best regards,

-- daniel

On 9/11/14 11:54 AM, Jaroslav Bachorik wrote:
> Please, review the following regression fix
>
> Issue  : https://bugs.openjdk.java.net/browse/JDK-8058089
> Webrev : http://cr.openjdk.java.net/~jbachorik/8058089/webrev.00
>
> The regression was introduced by an en-mass update of new
> `Integer(param)` to `Integer.valueOf(param)`. For some reason in the
> MLet code only `param` was used instead of `Integer.valueOf(param)`.
> I've fixed this problem and also took the liberty to convert new Float()
> and new Double() to .valueOf(...) form. I also added a reg test
> asserting the correctness of param conversions performed by the MLet class.
>
>
> Thanks,
>
> -JB-


From jaroslav.bachorik at oracle.com  Thu Sep 11 10:49:18 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Thu, 11 Sep 2014 12:49:18 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <54117A18.4040206@oracle.com>
References: <54107175.7030200@oracle.com>
	<5410825F.80504@oracle.com>	<541087D1.9080109@oracle.com>
	<5410AA3C.70009@oracle.com> <54117A18.4040206@oracle.com>
Message-ID: <54117E2E.4060005@oracle.com>

Hi,

On 09/11/2014 12:31 PM, Daniel Fuchs wrote:
> On 9/10/14 9:45 PM, shanliang wrote:
>> Oh, not one retry attempt fetching the next batch of notifications, but
>> the *SAME* batch of notifications.
>>
>> http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/
>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>
>>
>> Shanliang
>>>
>
> This looks good Shanliang!
>

I have just one nit - rename "throwsDeserializationException()" to 
"rethrowDeserializationException()" - it makes its purpose clear.

Otherwise - Thumbs Up!

Cheers,

-JB-
> Make sure to rerun all the JCK/JDK tests before pushing.
> This was really a tricky problem!
>
> best regards,
>
> -- daniel


From shanliang.jiang at oracle.com  Thu Sep 11 13:56:11 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Thu, 11 Sep 2014 15:56:11 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <54117A18.4040206@oracle.com>
References: <54107175.7030200@oracle.com> <5410825F.80504@oracle.com>
	<541087D1.9080109@oracle.com> <5410AA3C.70009@oracle.com>
	<54117A18.4040206@oracle.com>
Message-ID: <5411A9FB.8020904@oracle.com>

Daniel Fuchs wrote:
> On 9/10/14 9:45 PM, shanliang wrote:
>> Oh, not one retry attempt fetching the next batch of notifications, but
>> the *SAME* batch of notifications.
>>
>> http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/
>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>
>>
>> Shanliang
>>>
>
> This looks good Shanliang!
>
> Make sure to rerun all the JCK/JDK tests before pushing.
Yes already ran, but will do once again to make sure.
> This was really a tricky problem!
Yes, it is!

Thanks for your help!
Shanliang
>
> best regards,
>
> -- daniel


From shanliang.jiang at oracle.com  Thu Sep 11 13:58:03 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Thu, 11 Sep 2014 15:58:03 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <54117E2E.4060005@oracle.com>
References: <54107175.7030200@oracle.com>	<5410825F.80504@oracle.com>	<541087D1.9080109@oracle.com>	<5410AA3C.70009@oracle.com>
	<54117A18.4040206@oracle.com> <54117E2E.4060005@oracle.com>
Message-ID: <5411AA6B.1050404@oracle.com>

Jaroslav Bachorik wrote:
> Hi,
>
> On 09/11/2014 12:31 PM, Daniel Fuchs wrote:
>> On 9/10/14 9:45 PM, shanliang wrote:
>>> Oh, not one retry attempt fetching the next batch of notifications, but
>>> the *SAME* batch of notifications.
>>>
>>> http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/
>>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>
>>>
>>> Shanliang
>>>>
>>
>> This looks good Shanliang!
>>
>
> I have just one nit - rename "throwsDeserializationException()" to 
> "rethrowDeserializationException()" - it makes its purpose clear.
I have already added internal comments to explain the call, but why not.
>
> Otherwise - Thumbs Up!
Thanks for review!

Shanliang
>
> Cheers,
>
> -JB-
>> Make sure to rerun all the JCK/JDK tests before pushing.
>> This was really a tricky problem!
>>
>> best regards,
>>
>> -- daniel
>


From daniel.fuchs at oracle.com  Thu Sep 11 14:04:30 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Thu, 11 Sep 2014 16:04:30 +0200
Subject: jmx-dev Review request: 8049303: Transient network problems
 cause JMX thread to fail silenty
In-Reply-To: <5411AA6B.1050404@oracle.com>
References: <54107175.7030200@oracle.com>	<5410825F.80504@oracle.com>	<541087D1.9080109@oracle.com>	<5410AA3C.70009@oracle.com>	<54117A18.4040206@oracle.com>
	<54117E2E.4060005@oracle.com> <5411AA6B.1050404@oracle.com>
Message-ID: <5411ABEE.7060607@oracle.com>

Hi Shanliang,

If there's still time - you might want to add the following
comment too:

1353                     try {
                              // The server may have closed the
                              // connection, expecting us to reconnect.
                              // The communicatorAdmin will handle
                              // this case.
1354                         communicatorAdmin.gotIOException(ioe);
1355                         // reconnection OK, back to "while" to do again

No need to generate a new webrev!

best,

-- daniel

On 9/11/14 3:58 PM, shanliang wrote:
> Jaroslav Bachorik wrote:
>> Hi,
>>
>> On 09/11/2014 12:31 PM, Daniel Fuchs wrote:
>>> On 9/10/14 9:45 PM, shanliang wrote:
>>>> Oh, not one retry attempt fetching the next batch of notifications, but
>>>> the *SAME* batch of notifications.
>>>>
>>>> http://cr.openjdk.java.net/~sjiang/JDK-8049303/02/
>>>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8049303/02/>
>>>>
>>>> Shanliang
>>>>>
>>>
>>> This looks good Shanliang!
>>>
>>
>> I have just one nit - rename "throwsDeserializationException()" to
>> "rethrowDeserializationException()" - it makes its purpose clear.
> I have already added internal comments to explain the call, but why not.
>>
>> Otherwise - Thumbs Up!
> Thanks for review!
>
> Shanliang
>>
>> Cheers,
>>
>> -JB-
>>> Make sure to rerun all the JCK/JDK tests before pushing.
>>> This was really a tricky problem!
>>>
>>> best regards,
>>>
>>> -- daniel
>>
>


From shanliang.jiang at oracle.com  Mon Sep 15 13:05:32 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Mon, 15 Sep 2014 15:05:32 +0200
Subject: jmx-dev Review request: 8042205: javax/management/monitor/*: some
 test didn't get all the notifications
In-Reply-To: <54107175.7030200@oracle.com>
References: <54107175.7030200@oracle.com>
Message-ID: <5416E41C.4080605@oracle.com>

Hi,

Please review the following fix, I changed the way to check received 
notifications.

Bug: https://bugs.openjdk.java.net/browse/JDK-8042205
Webrec: http://cr.openjdk.java.net/~sjiang/JDK-8042205/00/ 
<http://cr.openjdk.java.net/%7Esjiang/JDK-8042205/00/>

Thanks, shanliang


From daniel.fuchs at oracle.com  Mon Sep 15 13:37:39 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Mon, 15 Sep 2014 15:37:39 +0200
Subject: jmx-dev Review request: 8042205: javax/management/monitor/*:
 some test didn't get all the notifications
In-Reply-To: <5416E41C.4080605@oracle.com>
References: <54107175.7030200@oracle.com> <5416E41C.4080605@oracle.com>
Message-ID: <5416EBA3.6060904@oracle.com>

Looks good Shanliang.

The synchronization is a bit strange, with the flag being
volatile and sometime modified within synchronized blocks and
sometime being modified outside of any s-block, but I believe
it is working (AFAIU the synchronized is mostly needed because
you call notifyAll() and wait() and the fact that the flag is
also modified within the block is just coincidence ;-) ).
I'm OK with this.

-- daniel

On 9/15/14 3:05 PM, shanliang wrote:
> Hi,
>
> Please review the following fix, I changed the way to check received
> notifications.
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8042205
> Webrec: http://cr.openjdk.java.net/~sjiang/JDK-8042205/00/
> <http://cr.openjdk.java.net/%7Esjiang/JDK-8042205/00/>
>
> Thanks, shanliang
>


From shanliang.jiang at oracle.com  Mon Sep 15 14:59:18 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Mon, 15 Sep 2014 16:59:18 +0200
Subject: jmx-dev Review request: 8042205: javax/management/monitor/*:
 some test didn't get all the notifications
In-Reply-To: <5416EBA3.6060904@oracle.com>
References: <54107175.7030200@oracle.com> <5416E41C.4080605@oracle.com>
	<5416EBA3.6060904@oracle.com>
Message-ID: <5416FEC6.20100@oracle.com>

Daniel Fuchs wrote:
> Looks good Shanliang.
>
> The synchronization is a bit strange, with the flag being
> volatile and sometime modified within synchronized blocks and
> sometime being modified outside of any s-block, but I believe
> it is working (AFAIU the synchronized is mostly needed because
> you call notifyAll() and wait() and the fact that the flag is
> also modified within the block is just coincidence ;-) ).
> I'm OK with this.
Indeed it is not necessary to modify the flag within the 
synchronization, but is harmless and "coincidence".

Here is the new version with the modification for test/ProblemList.txt, 
simply removing the tests modified.

http://cr.openjdk.java.net/~sjiang/JDK-8042205/01/

Thanks,
Daniel
>
> -- daniel
>
> On 9/15/14 3:05 PM, shanliang wrote:
>> Hi,
>>
>> Please review the following fix, I changed the way to check received
>> notifications.
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8042205
>> Webrec: http://cr.openjdk.java.net/~sjiang/JDK-8042205/00/
>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8042205/00/>
>>
>> Thanks, shanliang
>>
>


From daniel.fuchs at oracle.com  Mon Sep 15 15:25:04 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Mon, 15 Sep 2014 17:25:04 +0200
Subject: jmx-dev Review request: 8042205: javax/management/monitor/*:
 some test didn't get all the notifications
In-Reply-To: <5416FEC6.20100@oracle.com>
References: <54107175.7030200@oracle.com> <5416E41C.4080605@oracle.com>
	<5416EBA3.6060904@oracle.com> <5416FEC6.20100@oracle.com>
Message-ID: <541704D0.9050601@oracle.com>

Looks good!

-- daniel

On 9/15/14 4:59 PM, shanliang wrote:
> Daniel Fuchs wrote:
>> Looks good Shanliang.
>>
>> The synchronization is a bit strange, with the flag being
>> volatile and sometime modified within synchronized blocks and
>> sometime being modified outside of any s-block, but I believe
>> it is working (AFAIU the synchronized is mostly needed because
>> you call notifyAll() and wait() and the fact that the flag is
>> also modified within the block is just coincidence ;-) ).
>> I'm OK with this.
> Indeed it is not necessary to modify the flag within the
> synchronization, but is harmless and "coincidence".
>
> Here is the new version with the modification for test/ProblemList.txt,
> simply removing the tests modified.
>
> http://cr.openjdk.java.net/~sjiang/JDK-8042205/01/
>
> Thanks,
> Daniel
>>
>> -- daniel
>>
>> On 9/15/14 3:05 PM, shanliang wrote:
>>> Hi,
>>>
>>> Please review the following fix, I changed the way to check received
>>> notifications.
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8042205
>>> Webrec: http://cr.openjdk.java.net/~sjiang/JDK-8042205/00/
>>> <http://cr.openjdk.java.net/%7Esjiang/JDK-8042205/00/>
>>>
>>> Thanks, shanliang
>>>
>>
>


From shanliang.jiang at oracle.com  Tue Sep 16 09:12:35 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Tue, 16 Sep 2014 11:12:35 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
Message-ID: <5417FF03.5090104@oracle.com>

Hi,

Please review the following fix:

bug: https://bugs.openjdk.java.net/browse/JDK-8050115
webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/

Thanks,
Shanliang

From daniel.fuchs at oracle.com  Tue Sep 16 09:42:30 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Tue, 16 Sep 2014 11:42:30 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <5417FF03.5090104@oracle.com>
References: <5417FF03.5090104@oracle.com>
Message-ID: <54180606.2050300@oracle.com>

Hi Shanliang,

line 116 - you could use a CountDownLatch instead of an
AtomicInteger. It would avoid having to use the busy loop at
lines 134-136.

I also wonder whether you could increase the sleep timeout
at line 107 - to make that loop a bit less buzy.
Unless that would alter the test too much and make the
deadlock less probable?

Otherwise the changes look reasonable.

Have you tried reproducing the failure (before fixing) on your
own machine using the same config than what was reported to
fail? (fastdebug build with all the -server + -XX etc options?)

best regards,

-- daniel


On 9/16/14 11:12 AM, shanliang wrote:
> Hi,
>
> Please review the following fix:
>
> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>
> Thanks,
> Shanliang


From shanliang.jiang at oracle.com  Tue Sep 16 11:24:54 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Tue, 16 Sep 2014 13:24:54 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <54180606.2050300@oracle.com>
References: <5417FF03.5090104@oracle.com> <54180606.2050300@oracle.com>
Message-ID: <54181E06.3090606@oracle.com>

Daniel Fuchs wrote:
> Hi Shanliang,
>
> line 116 - you could use a CountDownLatch instead of an
> AtomicInteger. It would avoid having to use the busy loop at
> lines 134-136.
Yes CountDownLatch is really a good idea, I tried to modify the code as 
less as possible, I prefer to keep the old code this time, another 
reason is that we still need to do same thing at line 103.
>
> I also wonder whether you could increase the sleep timeout
> at line 107 - to make that loop a bit less buzy.
> Unless that would alter the test too much and make the
> deadlock less probable?
Line 95: monitorProxy.setGranularityPeriod(10L); // 10 ms
So waiting 10ms seems reasonable.
>
> Otherwise the changes look reasonable.
>
> Have you tried reproducing the failure (before fixing) on your
> own machine using the same config than what was reported to
> fail? (fastdebug build with all the -server + -XX etc options?)
I reproduced the bug only by reducing the waiting time.

Thanks Daniel!
Shanliang
>
> best regards,
>
> -- daniel
>
>
> On 9/16/14 11:12 AM, shanliang wrote:
>> Hi,
>>
>> Please review the following fix:
>>
>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>
>> Thanks,
>> Shanliang
>


From david.holmes at oracle.com  Tue Sep 16 11:58:19 2014
From: david.holmes at oracle.com (David Holmes)
Date: Tue, 16 Sep 2014 21:58:19 +1000
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <5417FF03.5090104@oracle.com>
References: <5417FF03.5090104@oracle.com>
Message-ID: <541825DB.3000300@oracle.com>

Hi Shanliang,

On 16/09/2014 7:12 PM, shanliang wrote:
> Hi,
>
> Please review the following fix:

I don't see any functional change. You seem to have replaced a built-in 
timeout with the externally applied test harness timeout.

Style nit: add a space after 'while' -> while (cond) {

David
-----

> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>
> Thanks,
> Shanliang

From daniel.fuchs at oracle.com  Tue Sep 16 12:28:26 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Tue, 16 Sep 2014 14:28:26 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <54181E06.3090606@oracle.com>
References: <5417FF03.5090104@oracle.com> <54180606.2050300@oracle.com>
	<54181E06.3090606@oracle.com>
Message-ID: <54182CEA.2060404@oracle.com>

On 9/16/14 1:24 PM, shanliang wrote:
> Daniel Fuchs wrote:
>> Hi Shanliang,
>>
>> line 116 - you could use a CountDownLatch instead of an
>> AtomicInteger. It would avoid having to use the busy loop at
>> lines 134-136.
> Yes CountDownLatch is really a good idea, I tried to modify the code as
> less as possible, I prefer to keep the old code this time, another
> reason is that we still need to do same thing at line 103.
>>
>> I also wonder whether you could increase the sleep timeout
>> at line 107 - to make that loop a bit less buzy.
>> Unless that would alter the test too much and make the
>> deadlock less probable?
> Line 95: monitorProxy.setGranularityPeriod(10L); // 10 ms
> So waiting 10ms seems reasonable.
>>
>> Otherwise the changes look reasonable.
>>
>> Have you tried reproducing the failure (before fixing) on your
>> own machine using the same config than what was reported to
>> fail? (fastdebug build with all the -server + -XX etc options?)
> I reproduced the bug only by reducing the waiting time.

Well - if your changes fix it - then push it :-)

-- daniel

>
> Thanks Daniel!
> Shanliang
>>
>> best regards,
>>
>> -- daniel
>>
>>
>> On 9/16/14 11:12 AM, shanliang wrote:
>>> Hi,
>>>
>>> Please review the following fix:
>>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>
>>> Thanks,
>>> Shanliang
>>
>


From shanliang.jiang at oracle.com  Tue Sep 16 21:01:40 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Tue, 16 Sep 2014 23:01:40 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <541825DB.3000300@oracle.com>
References: <5417FF03.5090104@oracle.com> <541825DB.3000300@oracle.com>
Message-ID: <5418A534.6050204@oracle.com>

David Holmes wrote:
> Hi Shanliang,
>
> On 16/09/2014 7:12 PM, shanliang wrote:
>> Hi,
>>
>> Please review the following fix:
>
> I don't see any functional change. You seem to have replaced a 
> built-in timeout with the externally applied test harness timeout.
Yes no functional change here, we thought that the test needed more time 
to wait a change if a testing VM or machine was really slow, the test 
harness timeout was the maximum time we could give the test. 
>
> Style nit: add a space after 'while' -> while (cond) {
OK, I will do it before pushing.

Thanks,
Shanliang
>
> David
> -----
>
>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>
>> Thanks,
>> Shanliang


From david.holmes at oracle.com  Wed Sep 17 00:22:18 2014
From: david.holmes at oracle.com (David Holmes)
Date: Wed, 17 Sep 2014 10:22:18 +1000
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <5418A534.6050204@oracle.com>
References: <5417FF03.5090104@oracle.com> <541825DB.3000300@oracle.com>
	<5418A534.6050204@oracle.com>
Message-ID: <5418D43A.50206@oracle.com>

On 17/09/2014 7:01 AM, shanliang wrote:
> David Holmes wrote:
>> Hi Shanliang,
>>
>> On 16/09/2014 7:12 PM, shanliang wrote:
>>> Hi,
>>>
>>> Please review the following fix:
>>
>> I don't see any functional change. You seem to have replaced a
>> built-in timeout with the externally applied test harness timeout.
> Yes no functional change here, we thought that the test needed more time
> to wait a change if a testing VM or machine was really slow, the test
> harness timeout was the maximum time we could give the test.

Do we have confidence that the harness timeout is sufficient to handle 
the intermittent failures?

Thanks,
David


>>
>> Style nit: add a space after 'while' -> while (cond) {
> OK, I will do it before pushing.
>
> Thanks,
> Shanliang
>>
>> David
>> -----
>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>
>>> Thanks,
>>> Shanliang
>

From shanliang.jiang at oracle.com  Wed Sep 17 08:55:44 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Wed, 17 Sep 2014 10:55:44 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <5418D43A.50206@oracle.com>
References: <5417FF03.5090104@oracle.com> <541825DB.3000300@oracle.com>
	<5418A534.6050204@oracle.com> <5418D43A.50206@oracle.com>
Message-ID: <54194C90.2080703@oracle.com>

David Holmes wrote:
> On 17/09/2014 7:01 AM, shanliang wrote:
>> David Holmes wrote:
>>> Hi Shanliang,
>>>
>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>> Hi,
>>>>
>>>> Please review the following fix:
>>>
>>> I don't see any functional change. You seem to have replaced a
>>> built-in timeout with the externally applied test harness timeout.
>> Yes no functional change here, we thought that the test needed more time
>> to wait a change if a testing VM or machine was really slow, the test
>> harness timeout was the maximum time we could give the test.
>
> Do we have confidence that the harness timeout is sufficient to handle 
> the intermittent failures?
Really a good question :)

Here is new version:
    http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/

I added a deadlocked check in every 1 second, hope to get more info in 
case of failure.

I changed also the sleep time to 100ms, 10ms seems too short as Daniel 
pointed out.

Thanks,
Shanliang
>
> Thanks,
> David
>
>
>>>
>>> Style nit: add a space after 'while' -> while (cond) {
>> OK, I will do it before pushing.
>>
>> Thanks,
>> Shanliang
>>>
>>> David
>>> -----
>>>
>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>
>>>> Thanks,
>>>> Shanliang
>>


From daniel.fuchs at oracle.com  Wed Sep 17 09:19:04 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 17 Sep 2014 11:19:04 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <54194C90.2080703@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
Message-ID: <54195208.4030309@oracle.com>

On 9/17/14 10:55 AM, shanliang wrote:
> David Holmes wrote:
>> On 17/09/2014 7:01 AM, shanliang wrote:
>>> David Holmes wrote:
>>>> Hi Shanliang,
>>>>
>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>> Hi,
>>>>>
>>>>> Please review the following fix:
>>>>
>>>> I don't see any functional change. You seem to have replaced a
>>>> built-in timeout with the externally applied test harness timeout.
>>> Yes no functional change here, we thought that the test needed more time
>>> to wait a change if a testing VM or machine was really slow, the test
>>> harness timeout was the maximum time we could give the test.
>>
>> Do we have confidence that the harness timeout is sufficient to handle
>> the intermittent failures?
> Really a good question :)
>
> Here is new version:
>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>
> I added a deadlocked check in every 1 second, hope to get more info in
> case of failure.

The following comment seems to imply that this check is not
very useful:

  112             // This won't show up as a deadlock in CTRL-\ or in
  113             // ThreadMXBean.findDeadlockedThreads(), because they 
don't
  114             // see that thread A is waiting for thread B 
(B.join()), and
  115             // thread B is waiting for a lock held by thread A

best regards,

-- daniel

>
> I changed also the sleep time to 100ms, 10ms seems too short as Daniel
> pointed out.
>
> Thanks,
> Shanliang
>>
>> Thanks,
>> David
>>
>>
>>>>
>>>> Style nit: add a space after 'while' -> while (cond) {
>>> OK, I will do it before pushing.
>>>
>>> Thanks,
>>> Shanliang
>>>>
>>>> David
>>>> -----
>>>>
>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>
>>>>> Thanks,
>>>>> Shanliang
>>>
>


From shanliang.jiang at oracle.com  Wed Sep 17 12:19:48 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Wed, 17 Sep 2014 14:19:48 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <54195208.4030309@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com>
Message-ID: <54197C64.5050605@oracle.com>

Daniel,

The test does 2 steps of verifications, the new check is useful for the 
first step, and the trace in the bug showed that the test failed on the 
first step.

Yes the check might not work for the second step, I added the new code 
for the second step to check the tested thread state and hope to have 
useful info if the test failed on the second step.

Here is the new version:
    http://cr.openjdk.java.net/~sjiang/JDK-8050115/02/

Thanks,
Shanliang


Daniel Fuchs wrote:
> On 9/17/14 10:55 AM, shanliang wrote:
>> David Holmes wrote:
>>> On 17/09/2014 7:01 AM, shanliang wrote:
>>>> David Holmes wrote:
>>>>> Hi Shanliang,
>>>>>
>>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Please review the following fix:
>>>>>
>>>>> I don't see any functional change. You seem to have replaced a
>>>>> built-in timeout with the externally applied test harness timeout.
>>>> Yes no functional change here, we thought that the test needed more 
>>>> time
>>>> to wait a change if a testing VM or machine was really slow, the test
>>>> harness timeout was the maximum time we could give the test.
>>>
>>> Do we have confidence that the harness timeout is sufficient to handle
>>> the intermittent failures?
>> Really a good question :)
>>
>> Here is new version:
>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>>
>> I added a deadlocked check in every 1 second, hope to get more info in
>> case of failure.
>
> The following comment seems to imply that this check is not
> very useful:
>
>  112             // This won't show up as a deadlock in CTRL-\ or in
>  113             // ThreadMXBean.findDeadlockedThreads(), because they 
> don't
>  114             // see that thread A is waiting for thread B 
> (B.join()), and
>  115             // thread B is waiting for a lock held by thread A
>
> best regards,
>
> -- daniel
>
>>
>> I changed also the sleep time to 100ms, 10ms seems too short as Daniel
>> pointed out.
>>
>> Thanks,
>> Shanliang
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>>>
>>>>> Style nit: add a space after 'while' -> while (cond) {
>>>> OK, I will do it before pushing.
>>>>
>>>> Thanks,
>>>> Shanliang
>>>>>
>>>>> David
>>>>> -----
>>>>>
>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>>
>>>>>> Thanks,
>>>>>> Shanliang
>>>>
>>
>


From daniel.fuchs at oracle.com  Wed Sep 17 13:29:44 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 17 Sep 2014 15:29:44 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <54197C64.5050605@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com> <54197C64.5050605@oracle.com>
Message-ID: <54198CC8.40104@oracle.com>

Hi Shanliang,

On 9/17/14 2:19 PM, shanliang wrote:
> Daniel,
>
> The test does 2 steps of verifications, the new check is useful for the
> first step, and the trace in the bug showed that the test failed on the
> first step.
>
> Yes the check might not work for the second step, I added the new code
> for the second step to check the tested thread state and hope to have
> useful info if the test failed on the second step.
>
> Here is the new version:
>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/02/
>
> Thanks,
> Shanliang

If I understand the issue correctly - the test fails in timeout
mostly on very slow machines/configurations (fastdebug with some
combinations of options).

I worry that printing a thread dump every seconds (1000ms) is going
to make things worse: the test will spend its time printing thread
dumps instead of doing what it is supposed to do - and will have
even less CPU cycles to execute its 'real' code.

I would have advised printing the thread dumps only at the end,
when it is detected that there might be a deadlock - except that
now we can't do that since the timeout is managed completely
by the harness (so we don't get the upper hand at the end in
case of timeout).

I think depending on the harness to set the appropriate timeout
rather than depending on an arbitrary timeout set in the test itself
is the right thing to do. It's been a pattern in many tests that
failed in timeout intermittently on some slow machines/configuration.

In any case - 1s seems really too frequent.
I suppose you could inspect the system properties set by the harness
(timeout + timeout factor) to devise an acceptable frequency for
your checks - if you really want to print this info.

 From the log I see that the timeout factor passed to the harness
for the slow configuration that failed is
-Dtest.timeout.factor=8.0
There's no explicit timeout given - and jtreg -onlineHelp reveals
that in this case the default timeout is two minutes.

This means that the harness has allocated 2*8=16mins for the test to
execute.
I don't think you want to take the risk of printing a thread dump
every seconds during 16 minutes ;-)

Of course I'm over simplifying here. Before your changes - the test
was deciding after 46.893 seconds that there must be a deadlock.
47s is obviously way too short for a possibly slow machine running
the test in fastdebug mode.

Something like the following might be more reasonable:

// default timeout factor is 1.0
double factor =
     Double.parseDouble(
        System.getProperty("test.timeout.factor", "1.0"));
// default timeout is 2mins = 120s.
double timeout = Double.parseDouble(
        System.getProperty("test.timeout", "120"));

// total time is timeout * timeout factor * 1000 ms
long total = (long) factor * timeout * 1000;

// Don't print thread dumps too often.
// every 5s for a total timeout of 120s seems reasonable.
// 120s/5s = 24; we will lengthen the delay if the total
// timeout is greater than 120s, so we're taking the max between
// 5s and total/24
long delayBetweenThreadDumps = Math.max(5000, total/24);

Of course 5s and total/24 are just arbitrary...
But 24 full thread dumps in a log for a single test is enough data
to analyze I think ;-)

best regards,

-- daniel

>
>
> Daniel Fuchs wrote:
>> On 9/17/14 10:55 AM, shanliang wrote:
>>> David Holmes wrote:
>>>> On 17/09/2014 7:01 AM, shanliang wrote:
>>>>> David Holmes wrote:
>>>>>> Hi Shanliang,
>>>>>>
>>>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Please review the following fix:
>>>>>>
>>>>>> I don't see any functional change. You seem to have replaced a
>>>>>> built-in timeout with the externally applied test harness timeout.
>>>>> Yes no functional change here, we thought that the test needed more
>>>>> time
>>>>> to wait a change if a testing VM or machine was really slow, the test
>>>>> harness timeout was the maximum time we could give the test.
>>>>
>>>> Do we have confidence that the harness timeout is sufficient to handle
>>>> the intermittent failures?
>>> Really a good question :)
>>>
>>> Here is new version:
>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>>>
>>> I added a deadlocked check in every 1 second, hope to get more info in
>>> case of failure.
>>
>> The following comment seems to imply that this check is not
>> very useful:
>>
>>  112             // This won't show up as a deadlock in CTRL-\ or in
>>  113             // ThreadMXBean.findDeadlockedThreads(), because they
>> don't
>>  114             // see that thread A is waiting for thread B
>> (B.join()), and
>>  115             // thread B is waiting for a lock held by thread A
>>
>> best regards,
>>
>> -- daniel
>>
>>>
>>> I changed also the sleep time to 100ms, 10ms seems too short as Daniel
>>> pointed out.
>>>
>>> Thanks,
>>> Shanliang
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>>>>
>>>>>> Style nit: add a space after 'while' -> while (cond) {
>>>>> OK, I will do it before pushing.
>>>>>
>>>>> Thanks,
>>>>> Shanliang
>>>>>>
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shanliang
>>>>>
>>>
>>
>


From shanliang.jiang at oracle.com  Wed Sep 17 14:43:06 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Wed, 17 Sep 2014 16:43:06 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <54198CC8.40104@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com> <54197C64.5050605@oracle.com>
	<54198CC8.40104@oracle.com>
Message-ID: <54199DFA.2070105@oracle.com>

Daniel,

We could not be sure that the test failed of timeout, that's why I tried 
to add more checks.

The check for Step 1: all thread traces were printed out only if 
deadlock was found, and the test failed immediately.
The check for Step 2:
    1) all thread traces were printed out only if the tested thread was 
blocked, but the test did not fail because we were not sure if deadlock 
happened, but the info might be helpful;
    2) otherwise only the trace of the tested thread was printed out.

In case that the test gets interrupted again by the test harness, hope 
we can have some useful info from these 2 checks.

It must not be so heavy but still could impact the test, your suggestion 
to use test.timeout.factor is a good idea, I added the code to calculate 
the checking time based on it:
    http://cr.openjdk.java.net/~sjiang/JDK-8050115/03/

Thanks,
Shanliang

Daniel Fuchs wrote:
> Hi Shanliang,
>
> On 9/17/14 2:19 PM, shanliang wrote:
>> Daniel,
>>
>> The test does 2 steps of verifications, the new check is useful for the
>> first step, and the trace in the bug showed that the test failed on the
>> first step.
>>
>> Yes the check might not work for the second step, I added the new code
>> for the second step to check the tested thread state and hope to have
>> useful info if the test failed on the second step.
>>
>> Here is the new version:
>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/02/
>>
>> Thanks,
>> Shanliang
>
> If I understand the issue correctly - the test fails in timeout
> mostly on very slow machines/configurations (fastdebug with some
> combinations of options).
>
> I worry that printing a thread dump every seconds (1000ms) is going
> to make things worse: the test will spend its time printing thread
> dumps instead of doing what it is supposed to do - and will have
> even less CPU cycles to execute its 'real' code.
>
> I would have advised printing the thread dumps only at the end,
> when it is detected that there might be a deadlock - except that
> now we can't do that since the timeout is managed completely
> by the harness (so we don't get the upper hand at the end in
> case of timeout).
>
> I think depending on the harness to set the appropriate timeout
> rather than depending on an arbitrary timeout set in the test itself
> is the right thing to do. It's been a pattern in many tests that
> failed in timeout intermittently on some slow machines/configuration.
>
> In any case - 1s seems really too frequent.
> I suppose you could inspect the system properties set by the harness
> (timeout + timeout factor) to devise an acceptable frequency for
> your checks - if you really want to print this info.
>
> From the log I see that the timeout factor passed to the harness
> for the slow configuration that failed is
> -Dtest.timeout.factor=8.0
> There's no explicit timeout given - and jtreg -onlineHelp reveals
> that in this case the default timeout is two minutes.
>
> This means that the harness has allocated 2*8=16mins for the test to
> execute.
> I don't think you want to take the risk of printing a thread dump
> every seconds during 16 minutes ;-)
>
> Of course I'm over simplifying here. Before your changes - the test
> was deciding after 46.893 seconds that there must be a deadlock.
> 47s is obviously way too short for a possibly slow machine running
> the test in fastdebug mode.
>
> Something like the following might be more reasonable:
>
> // default timeout factor is 1.0
> double factor =
>     Double.parseDouble(
>        System.getProperty("test.timeout.factor", "1.0"));
> // default timeout is 2mins = 120s.
> double timeout = Double.parseDouble(
>        System.getProperty("test.timeout", "120"));
>
> // total time is timeout * timeout factor * 1000 ms
> long total = (long) factor * timeout * 1000;
>
> // Don't print thread dumps too often.
> // every 5s for a total timeout of 120s seems reasonable.
> // 120s/5s = 24; we will lengthen the delay if the total
> // timeout is greater than 120s, so we're taking the max between
> // 5s and total/24
> long delayBetweenThreadDumps = Math.max(5000, total/24);
>
> Of course 5s and total/24 are just arbitrary...
> But 24 full thread dumps in a log for a single test is enough data
> to analyze I think ;-)
>
> best regards,
>
> -- daniel
>
>>
>>
>> Daniel Fuchs wrote:
>>> On 9/17/14 10:55 AM, shanliang wrote:
>>>> David Holmes wrote:
>>>>> On 17/09/2014 7:01 AM, shanliang wrote:
>>>>>> David Holmes wrote:
>>>>>>> Hi Shanliang,
>>>>>>>
>>>>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Please review the following fix:
>>>>>>>
>>>>>>> I don't see any functional change. You seem to have replaced a
>>>>>>> built-in timeout with the externally applied test harness timeout.
>>>>>> Yes no functional change here, we thought that the test needed more
>>>>>> time
>>>>>> to wait a change if a testing VM or machine was really slow, the 
>>>>>> test
>>>>>> harness timeout was the maximum time we could give the test.
>>>>>
>>>>> Do we have confidence that the harness timeout is sufficient to 
>>>>> handle
>>>>> the intermittent failures?
>>>> Really a good question :)
>>>>
>>>> Here is new version:
>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>>>>
>>>> I added a deadlocked check in every 1 second, hope to get more info in
>>>> case of failure.
>>>
>>> The following comment seems to imply that this check is not
>>> very useful:
>>>
>>>  112             // This won't show up as a deadlock in CTRL-\ or in
>>>  113             // ThreadMXBean.findDeadlockedThreads(), because they
>>> don't
>>>  114             // see that thread A is waiting for thread B
>>> (B.join()), and
>>>  115             // thread B is waiting for a lock held by thread A
>>>
>>> best regards,
>>>
>>> -- daniel
>>>
>>>>
>>>> I changed also the sleep time to 100ms, 10ms seems too short as Daniel
>>>> pointed out.
>>>>
>>>> Thanks,
>>>> Shanliang
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>>>>
>>>>>>> Style nit: add a space after 'while' -> while (cond) {
>>>>>> OK, I will do it before pushing.
>>>>>>
>>>>>> Thanks,
>>>>>> Shanliang
>>>>>>>
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shanliang
>>>>>>
>>>>
>>>
>>
>


From daniel.fuchs at oracle.com  Wed Sep 17 14:55:38 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 17 Sep 2014 16:55:38 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <54199DFA.2070105@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com> <54197C64.5050605@oracle.com>
	<54198CC8.40104@oracle.com> <54199DFA.2070105@oracle.com>
Message-ID: <5419A0EA.8060903@oracle.com>

On 9/17/14 4:43 PM, shanliang wrote:
> Daniel,
>
> We could not be sure that the test failed of timeout, that's why I tried
> to add more checks.
>
> The check for Step 1: all thread traces were printed out only if
> deadlock was found, and the test failed immediately.
> The check for Step 2:
>     1) all thread traces were printed out only if the tested thread was
> blocked, but the test did not fail because we were not sure if deadlock
> happened, but the info might be helpful;
>     2) otherwise only the trace of the tested thread was printed out.
>
> In case that the test gets interrupted again by the test harness, hope
> we can have some useful info from these 2 checks.
>
> It must not be so heavy but still could impact the test, your suggestion
> to use test.timeout.factor is a good idea, I added the code to calculate
> the checking time based on it:
>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/03/

I Shanliang,

This looks much better, thanks.
May I suggested taking the current time again at lines 125
and 179:

    checkedTime = System.currentTimeMillis();

It would allow to discount the time spent in checking.

best regards,

-- daniel

>
> Thanks,
> Shanliang
>
> Daniel Fuchs wrote:
>> Hi Shanliang,
>>
>> On 9/17/14 2:19 PM, shanliang wrote:
>>> Daniel,
>>>
>>> The test does 2 steps of verifications, the new check is useful for the
>>> first step, and the trace in the bug showed that the test failed on the
>>> first step.
>>>
>>> Yes the check might not work for the second step, I added the new code
>>> for the second step to check the tested thread state and hope to have
>>> useful info if the test failed on the second step.
>>>
>>> Here is the new version:
>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/02/
>>>
>>> Thanks,
>>> Shanliang
>>
>> If I understand the issue correctly - the test fails in timeout
>> mostly on very slow machines/configurations (fastdebug with some
>> combinations of options).
>>
>> I worry that printing a thread dump every seconds (1000ms) is going
>> to make things worse: the test will spend its time printing thread
>> dumps instead of doing what it is supposed to do - and will have
>> even less CPU cycles to execute its 'real' code.
>>
>> I would have advised printing the thread dumps only at the end,
>> when it is detected that there might be a deadlock - except that
>> now we can't do that since the timeout is managed completely
>> by the harness (so we don't get the upper hand at the end in
>> case of timeout).
>>
>> I think depending on the harness to set the appropriate timeout
>> rather than depending on an arbitrary timeout set in the test itself
>> is the right thing to do. It's been a pattern in many tests that
>> failed in timeout intermittently on some slow machines/configuration.
>>
>> In any case - 1s seems really too frequent.
>> I suppose you could inspect the system properties set by the harness
>> (timeout + timeout factor) to devise an acceptable frequency for
>> your checks - if you really want to print this info.
>>
>> From the log I see that the timeout factor passed to the harness
>> for the slow configuration that failed is
>> -Dtest.timeout.factor=8.0
>> There's no explicit timeout given - and jtreg -onlineHelp reveals
>> that in this case the default timeout is two minutes.
>>
>> This means that the harness has allocated 2*8=16mins for the test to
>> execute.
>> I don't think you want to take the risk of printing a thread dump
>> every seconds during 16 minutes ;-)
>>
>> Of course I'm over simplifying here. Before your changes - the test
>> was deciding after 46.893 seconds that there must be a deadlock.
>> 47s is obviously way too short for a possibly slow machine running
>> the test in fastdebug mode.
>>
>> Something like the following might be more reasonable:
>>
>> // default timeout factor is 1.0
>> double factor =
>>     Double.parseDouble(
>>        System.getProperty("test.timeout.factor", "1.0"));
>> // default timeout is 2mins = 120s.
>> double timeout = Double.parseDouble(
>>        System.getProperty("test.timeout", "120"));
>>
>> // total time is timeout * timeout factor * 1000 ms
>> long total = (long) factor * timeout * 1000;
>>
>> // Don't print thread dumps too often.
>> // every 5s for a total timeout of 120s seems reasonable.
>> // 120s/5s = 24; we will lengthen the delay if the total
>> // timeout is greater than 120s, so we're taking the max between
>> // 5s and total/24
>> long delayBetweenThreadDumps = Math.max(5000, total/24);
>>
>> Of course 5s and total/24 are just arbitrary...
>> But 24 full thread dumps in a log for a single test is enough data
>> to analyze I think ;-)
>>
>> best regards,
>>
>> -- daniel
>>
>>>
>>>
>>> Daniel Fuchs wrote:
>>>> On 9/17/14 10:55 AM, shanliang wrote:
>>>>> David Holmes wrote:
>>>>>> On 17/09/2014 7:01 AM, shanliang wrote:
>>>>>>> David Holmes wrote:
>>>>>>>> Hi Shanliang,
>>>>>>>>
>>>>>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Please review the following fix:
>>>>>>>>
>>>>>>>> I don't see any functional change. You seem to have replaced a
>>>>>>>> built-in timeout with the externally applied test harness timeout.
>>>>>>> Yes no functional change here, we thought that the test needed more
>>>>>>> time
>>>>>>> to wait a change if a testing VM or machine was really slow, the
>>>>>>> test
>>>>>>> harness timeout was the maximum time we could give the test.
>>>>>>
>>>>>> Do we have confidence that the harness timeout is sufficient to
>>>>>> handle
>>>>>> the intermittent failures?
>>>>> Really a good question :)
>>>>>
>>>>> Here is new version:
>>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>>>>>
>>>>> I added a deadlocked check in every 1 second, hope to get more info in
>>>>> case of failure.
>>>>
>>>> The following comment seems to imply that this check is not
>>>> very useful:
>>>>
>>>>  112             // This won't show up as a deadlock in CTRL-\ or in
>>>>  113             // ThreadMXBean.findDeadlockedThreads(), because they
>>>> don't
>>>>  114             // see that thread A is waiting for thread B
>>>> (B.join()), and
>>>>  115             // thread B is waiting for a lock held by thread A
>>>>
>>>> best regards,
>>>>
>>>> -- daniel
>>>>
>>>>>
>>>>> I changed also the sleep time to 100ms, 10ms seems too short as Daniel
>>>>> pointed out.
>>>>>
>>>>> Thanks,
>>>>> Shanliang
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Style nit: add a space after 'while' -> while (cond) {
>>>>>>> OK, I will do it before pushing.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shanliang
>>>>>>>>
>>>>>>>> David
>>>>>>>> -----
>>>>>>>>
>>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shanliang
>>>>>>>
>>>>>
>>>>
>>>
>>
>


From shanliang.jiang at oracle.com  Wed Sep 17 15:09:18 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Wed, 17 Sep 2014 17:09:18 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <5419A0EA.8060903@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com> <54197C64.5050605@oracle.com>
	<54198CC8.40104@oracle.com> <54199DFA.2070105@oracle.com>
	<5419A0EA.8060903@oracle.com>
Message-ID: <5419A41E.70801@oracle.com>

Daniel Fuchs wrote:
> On 9/17/14 4:43 PM, shanliang wrote:
>> Daniel,
>>
>> We could not be sure that the test failed of timeout, that's why I tried
>> to add more checks.
>>
>> The check for Step 1: all thread traces were printed out only if
>> deadlock was found, and the test failed immediately.
>> The check for Step 2:
>>     1) all thread traces were printed out only if the tested thread was
>> blocked, but the test did not fail because we were not sure if deadlock
>> happened, but the info might be helpful;
>>     2) otherwise only the trace of the tested thread was printed out.
>>
>> In case that the test gets interrupted again by the test harness, hope
>> we can have some useful info from these 2 checks.
>>
>> It must not be so heavy but still could impact the test, your suggestion
>> to use test.timeout.factor is a good idea, I added the code to calculate
>> the checking time based on it:
>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/03/
>
> I Shanliang,
>
> This looks much better, thanks.
> May I suggested taking the current time again at lines 125
> and 179:
>
>    checkedTime = System.currentTimeMillis();
>
> It would allow to discount the time spent in checking.
Here is the new version with your suggestion to re-calculate checkedTime.

    http://cr.openjdk.java.net/~sjiang/JDK-8050115/04/

Thanks a lot for your time!
Shanliang
>
> best regards,
>
> -- daniel
>
>>
>> Thanks,
>> Shanliang
>>
>> Daniel Fuchs wrote:
>>> Hi Shanliang,
>>>
>>> On 9/17/14 2:19 PM, shanliang wrote:
>>>> Daniel,
>>>>
>>>> The test does 2 steps of verifications, the new check is useful for 
>>>> the
>>>> first step, and the trace in the bug showed that the test failed on 
>>>> the
>>>> first step.
>>>>
>>>> Yes the check might not work for the second step, I added the new code
>>>> for the second step to check the tested thread state and hope to have
>>>> useful info if the test failed on the second step.
>>>>
>>>> Here is the new version:
>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/02/
>>>>
>>>> Thanks,
>>>> Shanliang
>>>
>>> If I understand the issue correctly - the test fails in timeout
>>> mostly on very slow machines/configurations (fastdebug with some
>>> combinations of options).
>>>
>>> I worry that printing a thread dump every seconds (1000ms) is going
>>> to make things worse: the test will spend its time printing thread
>>> dumps instead of doing what it is supposed to do - and will have
>>> even less CPU cycles to execute its 'real' code.
>>>
>>> I would have advised printing the thread dumps only at the end,
>>> when it is detected that there might be a deadlock - except that
>>> now we can't do that since the timeout is managed completely
>>> by the harness (so we don't get the upper hand at the end in
>>> case of timeout).
>>>
>>> I think depending on the harness to set the appropriate timeout
>>> rather than depending on an arbitrary timeout set in the test itself
>>> is the right thing to do. It's been a pattern in many tests that
>>> failed in timeout intermittently on some slow machines/configuration.
>>>
>>> In any case - 1s seems really too frequent.
>>> I suppose you could inspect the system properties set by the harness
>>> (timeout + timeout factor) to devise an acceptable frequency for
>>> your checks - if you really want to print this info.
>>>
>>> From the log I see that the timeout factor passed to the harness
>>> for the slow configuration that failed is
>>> -Dtest.timeout.factor=8.0
>>> There's no explicit timeout given - and jtreg -onlineHelp reveals
>>> that in this case the default timeout is two minutes.
>>>
>>> This means that the harness has allocated 2*8=16mins for the test to
>>> execute.
>>> I don't think you want to take the risk of printing a thread dump
>>> every seconds during 16 minutes ;-)
>>>
>>> Of course I'm over simplifying here. Before your changes - the test
>>> was deciding after 46.893 seconds that there must be a deadlock.
>>> 47s is obviously way too short for a possibly slow machine running
>>> the test in fastdebug mode.
>>>
>>> Something like the following might be more reasonable:
>>>
>>> // default timeout factor is 1.0
>>> double factor =
>>>     Double.parseDouble(
>>>        System.getProperty("test.timeout.factor", "1.0"));
>>> // default timeout is 2mins = 120s.
>>> double timeout = Double.parseDouble(
>>>        System.getProperty("test.timeout", "120"));
>>>
>>> // total time is timeout * timeout factor * 1000 ms
>>> long total = (long) factor * timeout * 1000;
>>>
>>> // Don't print thread dumps too often.
>>> // every 5s for a total timeout of 120s seems reasonable.
>>> // 120s/5s = 24; we will lengthen the delay if the total
>>> // timeout is greater than 120s, so we're taking the max between
>>> // 5s and total/24
>>> long delayBetweenThreadDumps = Math.max(5000, total/24);
>>>
>>> Of course 5s and total/24 are just arbitrary...
>>> But 24 full thread dumps in a log for a single test is enough data
>>> to analyze I think ;-)
>>>
>>> best regards,
>>>
>>> -- daniel
>>>
>>>>
>>>>
>>>> Daniel Fuchs wrote:
>>>>> On 9/17/14 10:55 AM, shanliang wrote:
>>>>>> David Holmes wrote:
>>>>>>> On 17/09/2014 7:01 AM, shanliang wrote:
>>>>>>>> David Holmes wrote:
>>>>>>>>> Hi Shanliang,
>>>>>>>>>
>>>>>>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Please review the following fix:
>>>>>>>>>
>>>>>>>>> I don't see any functional change. You seem to have replaced a
>>>>>>>>> built-in timeout with the externally applied test harness 
>>>>>>>>> timeout.
>>>>>>>> Yes no functional change here, we thought that the test needed 
>>>>>>>> more
>>>>>>>> time
>>>>>>>> to wait a change if a testing VM or machine was really slow, the
>>>>>>>> test
>>>>>>>> harness timeout was the maximum time we could give the test.
>>>>>>>
>>>>>>> Do we have confidence that the harness timeout is sufficient to
>>>>>>> handle
>>>>>>> the intermittent failures?
>>>>>> Really a good question :)
>>>>>>
>>>>>> Here is new version:
>>>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>>>>>>
>>>>>> I added a deadlocked check in every 1 second, hope to get more 
>>>>>> info in
>>>>>> case of failure.
>>>>>
>>>>> The following comment seems to imply that this check is not
>>>>> very useful:
>>>>>
>>>>>  112             // This won't show up as a deadlock in CTRL-\ or in
>>>>>  113             // ThreadMXBean.findDeadlockedThreads(), because 
>>>>> they
>>>>> don't
>>>>>  114             // see that thread A is waiting for thread B
>>>>> (B.join()), and
>>>>>  115             // thread B is waiting for a lock held by thread A
>>>>>
>>>>> best regards,
>>>>>
>>>>> -- daniel
>>>>>
>>>>>>
>>>>>> I changed also the sleep time to 100ms, 10ms seems too short as 
>>>>>> Daniel
>>>>>> pointed out.
>>>>>>
>>>>>> Thanks,
>>>>>> Shanliang
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> Style nit: add a space after 'while' -> while (cond) {
>>>>>>>> OK, I will do it before pushing.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shanliang
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>> -----
>>>>>>>>>
>>>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>>>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shanliang
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


From daniel.fuchs at oracle.com  Wed Sep 17 15:22:21 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 17 Sep 2014 17:22:21 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <5419A41E.70801@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com> <54197C64.5050605@oracle.com>
	<54198CC8.40104@oracle.com> <54199DFA.2070105@oracle.com>
	<5419A0EA.8060903@oracle.com> <5419A41E.70801@oracle.com>
Message-ID: <5419A72D.1030607@oracle.com>

On 9/17/14 5:09 PM, shanliang wrote:
>> This looks much better, thanks.
>> May I suggested taking the current time again at lines 125
>> and 179:
>>
>>    checkedTime = System.currentTimeMillis();
>>
>> It would allow to discount the time spent in checking.
> Here is the new version with your suggestion to re-calculate checkedTime.
>
> http://cr.openjdk.java.net/~sjiang/JDK-8050115/04/
>
> Thanks a lot for your time!
> Shanliang

Looks good Shanliang!

Thanks a lot for your patience ;-)

-- daniel

From david.holmes at oracle.com  Thu Sep 18 02:32:37 2014
From: david.holmes at oracle.com (David Holmes)
Date: Thu, 18 Sep 2014 12:32:37 +1000
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <5419A41E.70801@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com> <54197C64.5050605@oracle.com>
	<54198CC8.40104@oracle.com> <54199DFA.2070105@oracle.com>
	<5419A0EA.8060903@oracle.com> <5419A41E.70801@oracle.com>
Message-ID: <541A4445.7000601@oracle.com>

Still not 100% sure about the deadlock detection logic here, but as long 
as it does no harm - ok.

Minor style nit:

   59         System.out.println("=== checkingTime = "+checkingTime+"ms");

spaces needed around the + operator.

Thanks,
David

On 18/09/2014 1:09 AM, shanliang wrote:
> Daniel Fuchs wrote:
>> On 9/17/14 4:43 PM, shanliang wrote:
>>> Daniel,
>>>
>>> We could not be sure that the test failed of timeout, that's why I tried
>>> to add more checks.
>>>
>>> The check for Step 1: all thread traces were printed out only if
>>> deadlock was found, and the test failed immediately.
>>> The check for Step 2:
>>>     1) all thread traces were printed out only if the tested thread was
>>> blocked, but the test did not fail because we were not sure if deadlock
>>> happened, but the info might be helpful;
>>>     2) otherwise only the trace of the tested thread was printed out.
>>>
>>> In case that the test gets interrupted again by the test harness, hope
>>> we can have some useful info from these 2 checks.
>>>
>>> It must not be so heavy but still could impact the test, your suggestion
>>> to use test.timeout.factor is a good idea, I added the code to calculate
>>> the checking time based on it:
>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/03/
>>
>> I Shanliang,
>>
>> This looks much better, thanks.
>> May I suggested taking the current time again at lines 125
>> and 179:
>>
>>    checkedTime = System.currentTimeMillis();
>>
>> It would allow to discount the time spent in checking.
> Here is the new version with your suggestion to re-calculate checkedTime.
>
>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/04/
>
> Thanks a lot for your time!
> Shanliang
>>
>> best regards,
>>
>> -- daniel
>>
>>>
>>> Thanks,
>>> Shanliang
>>>
>>> Daniel Fuchs wrote:
>>>> Hi Shanliang,
>>>>
>>>> On 9/17/14 2:19 PM, shanliang wrote:
>>>>> Daniel,
>>>>>
>>>>> The test does 2 steps of verifications, the new check is useful for
>>>>> the
>>>>> first step, and the trace in the bug showed that the test failed on
>>>>> the
>>>>> first step.
>>>>>
>>>>> Yes the check might not work for the second step, I added the new code
>>>>> for the second step to check the tested thread state and hope to have
>>>>> useful info if the test failed on the second step.
>>>>>
>>>>> Here is the new version:
>>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/02/
>>>>>
>>>>> Thanks,
>>>>> Shanliang
>>>>
>>>> If I understand the issue correctly - the test fails in timeout
>>>> mostly on very slow machines/configurations (fastdebug with some
>>>> combinations of options).
>>>>
>>>> I worry that printing a thread dump every seconds (1000ms) is going
>>>> to make things worse: the test will spend its time printing thread
>>>> dumps instead of doing what it is supposed to do - and will have
>>>> even less CPU cycles to execute its 'real' code.
>>>>
>>>> I would have advised printing the thread dumps only at the end,
>>>> when it is detected that there might be a deadlock - except that
>>>> now we can't do that since the timeout is managed completely
>>>> by the harness (so we don't get the upper hand at the end in
>>>> case of timeout).
>>>>
>>>> I think depending on the harness to set the appropriate timeout
>>>> rather than depending on an arbitrary timeout set in the test itself
>>>> is the right thing to do. It's been a pattern in many tests that
>>>> failed in timeout intermittently on some slow machines/configuration.
>>>>
>>>> In any case - 1s seems really too frequent.
>>>> I suppose you could inspect the system properties set by the harness
>>>> (timeout + timeout factor) to devise an acceptable frequency for
>>>> your checks - if you really want to print this info.
>>>>
>>>> From the log I see that the timeout factor passed to the harness
>>>> for the slow configuration that failed is
>>>> -Dtest.timeout.factor=8.0
>>>> There's no explicit timeout given - and jtreg -onlineHelp reveals
>>>> that in this case the default timeout is two minutes.
>>>>
>>>> This means that the harness has allocated 2*8=16mins for the test to
>>>> execute.
>>>> I don't think you want to take the risk of printing a thread dump
>>>> every seconds during 16 minutes ;-)
>>>>
>>>> Of course I'm over simplifying here. Before your changes - the test
>>>> was deciding after 46.893 seconds that there must be a deadlock.
>>>> 47s is obviously way too short for a possibly slow machine running
>>>> the test in fastdebug mode.
>>>>
>>>> Something like the following might be more reasonable:
>>>>
>>>> // default timeout factor is 1.0
>>>> double factor =
>>>>     Double.parseDouble(
>>>>        System.getProperty("test.timeout.factor", "1.0"));
>>>> // default timeout is 2mins = 120s.
>>>> double timeout = Double.parseDouble(
>>>>        System.getProperty("test.timeout", "120"));
>>>>
>>>> // total time is timeout * timeout factor * 1000 ms
>>>> long total = (long) factor * timeout * 1000;
>>>>
>>>> // Don't print thread dumps too often.
>>>> // every 5s for a total timeout of 120s seems reasonable.
>>>> // 120s/5s = 24; we will lengthen the delay if the total
>>>> // timeout is greater than 120s, so we're taking the max between
>>>> // 5s and total/24
>>>> long delayBetweenThreadDumps = Math.max(5000, total/24);
>>>>
>>>> Of course 5s and total/24 are just arbitrary...
>>>> But 24 full thread dumps in a log for a single test is enough data
>>>> to analyze I think ;-)
>>>>
>>>> best regards,
>>>>
>>>> -- daniel
>>>>
>>>>>
>>>>>
>>>>> Daniel Fuchs wrote:
>>>>>> On 9/17/14 10:55 AM, shanliang wrote:
>>>>>>> David Holmes wrote:
>>>>>>>> On 17/09/2014 7:01 AM, shanliang wrote:
>>>>>>>>> David Holmes wrote:
>>>>>>>>>> Hi Shanliang,
>>>>>>>>>>
>>>>>>>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Please review the following fix:
>>>>>>>>>>
>>>>>>>>>> I don't see any functional change. You seem to have replaced a
>>>>>>>>>> built-in timeout with the externally applied test harness
>>>>>>>>>> timeout.
>>>>>>>>> Yes no functional change here, we thought that the test needed
>>>>>>>>> more
>>>>>>>>> time
>>>>>>>>> to wait a change if a testing VM or machine was really slow, the
>>>>>>>>> test
>>>>>>>>> harness timeout was the maximum time we could give the test.
>>>>>>>>
>>>>>>>> Do we have confidence that the harness timeout is sufficient to
>>>>>>>> handle
>>>>>>>> the intermittent failures?
>>>>>>> Really a good question :)
>>>>>>>
>>>>>>> Here is new version:
>>>>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>>>>>>>
>>>>>>> I added a deadlocked check in every 1 second, hope to get more
>>>>>>> info in
>>>>>>> case of failure.
>>>>>>
>>>>>> The following comment seems to imply that this check is not
>>>>>> very useful:
>>>>>>
>>>>>>  112             // This won't show up as a deadlock in CTRL-\ or in
>>>>>>  113             // ThreadMXBean.findDeadlockedThreads(), because
>>>>>> they
>>>>>> don't
>>>>>>  114             // see that thread A is waiting for thread B
>>>>>> (B.join()), and
>>>>>>  115             // thread B is waiting for a lock held by thread A
>>>>>>
>>>>>> best regards,
>>>>>>
>>>>>> -- daniel
>>>>>>
>>>>>>>
>>>>>>> I changed also the sleep time to 100ms, 10ms seems too short as
>>>>>>> Daniel
>>>>>>> pointed out.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shanliang
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Style nit: add a space after 'while' -> while (cond) {
>>>>>>>>> OK, I will do it before pushing.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shanliang
>>>>>>>>>>
>>>>>>>>>> David
>>>>>>>>>> -----
>>>>>>>>>>
>>>>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>>>>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Shanliang
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

From shanliang.jiang at oracle.com  Thu Sep 18 14:27:30 2014
From: shanliang.jiang at oracle.com (shanliang)
Date: Thu, 18 Sep 2014 16:27:30 +0200
Subject: jmx-dev Codereview request: 8050115
 javax/management/monitor/GaugeMonitorDeadlockTest.java fails intermittently
In-Reply-To: <541A4445.7000601@oracle.com>
References: <5417FF03.5090104@oracle.com>
	<541825DB.3000300@oracle.com>	<5418A534.6050204@oracle.com>
	<5418D43A.50206@oracle.com> <54194C90.2080703@oracle.com>
	<54195208.4030309@oracle.com> <54197C64.5050605@oracle.com>
	<54198CC8.40104@oracle.com> <54199DFA.2070105@oracle.com>
	<5419A0EA.8060903@oracle.com> <5419A41E.70801@oracle.com>
	<541A4445.7000601@oracle.com>
Message-ID: <541AEBD2.7090307@oracle.com>

David Holmes wrote:
> Still not 100% sure about the deadlock detection logic here, but as 
> long as it does no harm - ok.
>
> Minor style nit:
>
>   59         System.out.println("=== checkingTime = "+checkingTime+"ms");
>
> spaces needed around the + operator.
OK, I have added spaces.

Thanks David for the review.
Shanliang
>
> Thanks,
> David
>
> On 18/09/2014 1:09 AM, shanliang wrote:
>> Daniel Fuchs wrote:
>>> On 9/17/14 4:43 PM, shanliang wrote:
>>>> Daniel,
>>>>
>>>> We could not be sure that the test failed of timeout, that's why I 
>>>> tried
>>>> to add more checks.
>>>>
>>>> The check for Step 1: all thread traces were printed out only if
>>>> deadlock was found, and the test failed immediately.
>>>> The check for Step 2:
>>>>     1) all thread traces were printed out only if the tested thread 
>>>> was
>>>> blocked, but the test did not fail because we were not sure if 
>>>> deadlock
>>>> happened, but the info might be helpful;
>>>>     2) otherwise only the trace of the tested thread was printed out.
>>>>
>>>> In case that the test gets interrupted again by the test harness, hope
>>>> we can have some useful info from these 2 checks.
>>>>
>>>> It must not be so heavy but still could impact the test, your 
>>>> suggestion
>>>> to use test.timeout.factor is a good idea, I added the code to 
>>>> calculate
>>>> the checking time based on it:
>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/03/
>>>
>>> I Shanliang,
>>>
>>> This looks much better, thanks.
>>> May I suggested taking the current time again at lines 125
>>> and 179:
>>>
>>>    checkedTime = System.currentTimeMillis();
>>>
>>> It would allow to discount the time spent in checking.
>> Here is the new version with your suggestion to re-calculate 
>> checkedTime.
>>
>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/04/
>>
>> Thanks a lot for your time!
>> Shanliang
>>>
>>> best regards,
>>>
>>> -- daniel
>>>
>>>>
>>>> Thanks,
>>>> Shanliang
>>>>
>>>> Daniel Fuchs wrote:
>>>>> Hi Shanliang,
>>>>>
>>>>> On 9/17/14 2:19 PM, shanliang wrote:
>>>>>> Daniel,
>>>>>>
>>>>>> The test does 2 steps of verifications, the new check is useful for
>>>>>> the
>>>>>> first step, and the trace in the bug showed that the test failed on
>>>>>> the
>>>>>> first step.
>>>>>>
>>>>>> Yes the check might not work for the second step, I added the new 
>>>>>> code
>>>>>> for the second step to check the tested thread state and hope to 
>>>>>> have
>>>>>> useful info if the test failed on the second step.
>>>>>>
>>>>>> Here is the new version:
>>>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/02/
>>>>>>
>>>>>> Thanks,
>>>>>> Shanliang
>>>>>
>>>>> If I understand the issue correctly - the test fails in timeout
>>>>> mostly on very slow machines/configurations (fastdebug with some
>>>>> combinations of options).
>>>>>
>>>>> I worry that printing a thread dump every seconds (1000ms) is going
>>>>> to make things worse: the test will spend its time printing thread
>>>>> dumps instead of doing what it is supposed to do - and will have
>>>>> even less CPU cycles to execute its 'real' code.
>>>>>
>>>>> I would have advised printing the thread dumps only at the end,
>>>>> when it is detected that there might be a deadlock - except that
>>>>> now we can't do that since the timeout is managed completely
>>>>> by the harness (so we don't get the upper hand at the end in
>>>>> case of timeout).
>>>>>
>>>>> I think depending on the harness to set the appropriate timeout
>>>>> rather than depending on an arbitrary timeout set in the test itself
>>>>> is the right thing to do. It's been a pattern in many tests that
>>>>> failed in timeout intermittently on some slow machines/configuration.
>>>>>
>>>>> In any case - 1s seems really too frequent.
>>>>> I suppose you could inspect the system properties set by the harness
>>>>> (timeout + timeout factor) to devise an acceptable frequency for
>>>>> your checks - if you really want to print this info.
>>>>>
>>>>> From the log I see that the timeout factor passed to the harness
>>>>> for the slow configuration that failed is
>>>>> -Dtest.timeout.factor=8.0
>>>>> There's no explicit timeout given - and jtreg -onlineHelp reveals
>>>>> that in this case the default timeout is two minutes.
>>>>>
>>>>> This means that the harness has allocated 2*8=16mins for the test to
>>>>> execute.
>>>>> I don't think you want to take the risk of printing a thread dump
>>>>> every seconds during 16 minutes ;-)
>>>>>
>>>>> Of course I'm over simplifying here. Before your changes - the test
>>>>> was deciding after 46.893 seconds that there must be a deadlock.
>>>>> 47s is obviously way too short for a possibly slow machine running
>>>>> the test in fastdebug mode.
>>>>>
>>>>> Something like the following might be more reasonable:
>>>>>
>>>>> // default timeout factor is 1.0
>>>>> double factor =
>>>>>     Double.parseDouble(
>>>>>        System.getProperty("test.timeout.factor", "1.0"));
>>>>> // default timeout is 2mins = 120s.
>>>>> double timeout = Double.parseDouble(
>>>>>        System.getProperty("test.timeout", "120"));
>>>>>
>>>>> // total time is timeout * timeout factor * 1000 ms
>>>>> long total = (long) factor * timeout * 1000;
>>>>>
>>>>> // Don't print thread dumps too often.
>>>>> // every 5s for a total timeout of 120s seems reasonable.
>>>>> // 120s/5s = 24; we will lengthen the delay if the total
>>>>> // timeout is greater than 120s, so we're taking the max between
>>>>> // 5s and total/24
>>>>> long delayBetweenThreadDumps = Math.max(5000, total/24);
>>>>>
>>>>> Of course 5s and total/24 are just arbitrary...
>>>>> But 24 full thread dumps in a log for a single test is enough data
>>>>> to analyze I think ;-)
>>>>>
>>>>> best regards,
>>>>>
>>>>> -- daniel
>>>>>
>>>>>>
>>>>>>
>>>>>> Daniel Fuchs wrote:
>>>>>>> On 9/17/14 10:55 AM, shanliang wrote:
>>>>>>>> David Holmes wrote:
>>>>>>>>> On 17/09/2014 7:01 AM, shanliang wrote:
>>>>>>>>>> David Holmes wrote:
>>>>>>>>>>> Hi Shanliang,
>>>>>>>>>>>
>>>>>>>>>>> On 16/09/2014 7:12 PM, shanliang wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Please review the following fix:
>>>>>>>>>>>
>>>>>>>>>>> I don't see any functional change. You seem to have replaced a
>>>>>>>>>>> built-in timeout with the externally applied test harness
>>>>>>>>>>> timeout.
>>>>>>>>>> Yes no functional change here, we thought that the test needed
>>>>>>>>>> more
>>>>>>>>>> time
>>>>>>>>>> to wait a change if a testing VM or machine was really slow, the
>>>>>>>>>> test
>>>>>>>>>> harness timeout was the maximum time we could give the test.
>>>>>>>>>
>>>>>>>>> Do we have confidence that the harness timeout is sufficient to
>>>>>>>>> handle
>>>>>>>>> the intermittent failures?
>>>>>>>> Really a good question :)
>>>>>>>>
>>>>>>>> Here is new version:
>>>>>>>>     http://cr.openjdk.java.net/~sjiang/JDK-8050115/01/
>>>>>>>>
>>>>>>>> I added a deadlocked check in every 1 second, hope to get more
>>>>>>>> info in
>>>>>>>> case of failure.
>>>>>>>
>>>>>>> The following comment seems to imply that this check is not
>>>>>>> very useful:
>>>>>>>
>>>>>>>  112             // This won't show up as a deadlock in CTRL-\ 
>>>>>>> or in
>>>>>>>  113             // ThreadMXBean.findDeadlockedThreads(), because
>>>>>>> they
>>>>>>> don't
>>>>>>>  114             // see that thread A is waiting for thread B
>>>>>>> (B.join()), and
>>>>>>>  115             // thread B is waiting for a lock held by thread A
>>>>>>>
>>>>>>> best regards,
>>>>>>>
>>>>>>> -- daniel
>>>>>>>
>>>>>>>>
>>>>>>>> I changed also the sleep time to 100ms, 10ms seems too short as
>>>>>>>> Daniel
>>>>>>>> pointed out.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shanliang
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Style nit: add a space after 'while' -> while (cond) {
>>>>>>>>>> OK, I will do it before pushing.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shanliang
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>> -----
>>>>>>>>>>>
>>>>>>>>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8050115
>>>>>>>>>>>> webrev: http://cr.openjdk.java.net/~sjiang/JDK-8050115/00/
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shanliang
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>


From jaroslav.bachorik at oracle.com  Tue Sep 23 15:59:19 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Tue, 23 Sep 2014 17:59:19 +0200
Subject: jmx-dev RFR 8057149:
 sun/management/jmxremote/startstop/JMXStartStopTest.java fails with
 "Starting agent on port ... should report port in use"
Message-ID: <542198D7.4010804@oracle.com>

Please, review this test change

Issue : https://bugs.openjdk.java.net/browse/JDK-8057149
Webrev: http://cr.openjdk.java.net/~jbachorik/8057149/webrev.00

The test is using 'jcmd' to start/stop the JMX agent dynamically with 
various parameters. Some of the tests use a server socket to simulate 
starting the agent with the desired port not being available. However, 
it seems that in certain situations the agent does not provide a 
conscious error message and fails with RuntimeException instead. This 
seems to be timing related and therefore the solution is to retry the 
'jcmd' command a few times with a delay to provide some cushion for the 
timing related problems.

Thanks,

-JB-

From staffan.larsen at oracle.com  Tue Sep 23 18:20:56 2014
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Tue, 23 Sep 2014 20:20:56 +0200
Subject: jmx-dev RFR 8057149:
	sun/management/jmxremote/startstop/JMXStartStopTest.java
	fails with "Starting agent on port ... should report port in use"
In-Reply-To: <542198D7.4010804@oracle.com>
References: <542198D7.4010804@oracle.com>
Message-ID: <56CD5367-1331-4202-82D4-4AAF7FB0FA02@oracle.com>

I think we should add some logging to say that we are retrying.

nit: some weird indention on line 623.

Otherwise looks good.

/Staffan

On 23 sep 2014, at 17:59, Jaroslav Bachorik <jaroslav.bachorik at oracle.com> wrote:

> Please, review this test change
> 
> Issue : https://bugs.openjdk.java.net/browse/JDK-8057149
> Webrev: http://cr.openjdk.java.net/~jbachorik/8057149/webrev.00
> 
> The test is using 'jcmd' to start/stop the JMX agent dynamically with various parameters. Some of the tests use a server socket to simulate starting the agent with the desired port not being available. However, it seems that in certain situations the agent does not provide a conscious error message and fails with RuntimeException instead. This seems to be timing related and therefore the solution is to retry the 'jcmd' command a few times with a delay to provide some cushion for the timing related problems.
> 
> Thanks,
> 
> -JB-


From jaroslav.bachorik at oracle.com  Wed Sep 24 09:35:15 2014
From: jaroslav.bachorik at oracle.com (Jaroslav Bachorik)
Date: Wed, 24 Sep 2014 11:35:15 +0200
Subject: jmx-dev RFR 8057149:
 sun/management/jmxremote/startstop/JMXStartStopTest.java fails with
 "Starting agent on port ... should report port in use"
In-Reply-To: <56CD5367-1331-4202-82D4-4AAF7FB0FA02@oracle.com>
References: <542198D7.4010804@oracle.com>
	<56CD5367-1331-4202-82D4-4AAF7FB0FA02@oracle.com>
Message-ID: <54229053.40502@oracle.com>

Thanks, Staffan

On 09/23/2014 08:20 PM, Staffan Larsen wrote:
> I think we should add some logging to say that we are retrying.
>
> nit: some weird indention on line 623.

I added logging and improved the retry detection logic a bit (in the 
previous version the retry counter could have got decremented multiple 
times during one invocation of 'jcmd' - even though it was very unlikely).

The weird indentation is gone.

http://cr.openjdk.java.net/~jbachorik/8057149/webrev.01

-JB-

>
> Otherwise looks good.
>
> /Staffan
>
> On 23 sep 2014, at 17:59, Jaroslav Bachorik <jaroslav.bachorik at oracle.com> wrote:
>
>> Please, review this test change
>>
>> Issue : https://bugs.openjdk.java.net/browse/JDK-8057149
>> Webrev: http://cr.openjdk.java.net/~jbachorik/8057149/webrev.00
>>
>> The test is using 'jcmd' to start/stop the JMX agent dynamically with various parameters. Some of the tests use a server socket to simulate starting the agent with the desired port not being available. However, it seems that in certain situations the agent does not provide a conscious error message and fails with RuntimeException instead. This seems to be timing related and therefore the solution is to retry the 'jcmd' command a few times with a delay to provide some cushion for the timing related problems.
>>
>> Thanks,
>>
>> -JB-
>


From staffan.larsen at oracle.com  Wed Sep 24 09:45:29 2014
From: staffan.larsen at oracle.com (Staffan Larsen)
Date: Wed, 24 Sep 2014 11:45:29 +0200
Subject: jmx-dev RFR 8057149:
	sun/management/jmxremote/startstop/JMXStartStopTest.java
	fails with "Starting agent on port ... should report port in use"
In-Reply-To: <54229053.40502@oracle.com>
References: <542198D7.4010804@oracle.com>
	<56CD5367-1331-4202-82D4-4AAF7FB0FA02@oracle.com>
	<54229053.40502@oracle.com>
Message-ID: <11FC3CA1-DEC6-4C0A-A60A-FA3DA1991233@oracle.com>

Looks good!

Thanks,
/Staffan

On 24 sep 2014, at 11:35, Jaroslav Bachorik <jaroslav.bachorik at oracle.com> wrote:

> Thanks, Staffan
> 
> On 09/23/2014 08:20 PM, Staffan Larsen wrote:
>> I think we should add some logging to say that we are retrying.
>> 
>> nit: some weird indention on line 623.
> 
> I added logging and improved the retry detection logic a bit (in the previous version the retry counter could have got decremented multiple times during one invocation of 'jcmd' - even though it was very unlikely).
> 
> The weird indentation is gone.
> 
> http://cr.openjdk.java.net/~jbachorik/8057149/webrev.01
> 
> -JB-
> 
>> 
>> Otherwise looks good.
>> 
>> /Staffan
>> 
>> On 23 sep 2014, at 17:59, Jaroslav Bachorik <jaroslav.bachorik at oracle.com> wrote:
>> 
>>> Please, review this test change
>>> 
>>> Issue : https://bugs.openjdk.java.net/browse/JDK-8057149
>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8057149/webrev.00
>>> 
>>> The test is using 'jcmd' to start/stop the JMX agent dynamically with various parameters. Some of the tests use a server socket to simulate starting the agent with the desired port not being available. However, it seems that in certain situations the agent does not provide a conscious error message and fails with RuntimeException instead. This seems to be timing related and therefore the solution is to retry the 'jcmd' command a few times with a delay to provide some cushion for the timing related problems.
>>> 
>>> Thanks,
>>> 
>>> -JB-
>> 
> 


From daniel.fuchs at oracle.com  Wed Sep 24 10:15:31 2014
From: daniel.fuchs at oracle.com (Daniel Fuchs)
Date: Wed, 24 Sep 2014 12:15:31 +0200
Subject: jmx-dev RFR 8057149:
 sun/management/jmxremote/startstop/JMXStartStopTest.java fails with
 "Starting agent on port ... should report port in use"
In-Reply-To: <54229053.40502@oracle.com>
References: <542198D7.4010804@oracle.com>	<56CD5367-1331-4202-82D4-4AAF7FB0FA02@oracle.com>
	<54229053.40502@oracle.com>
Message-ID: <542299C3.6080106@oracle.com>

Hi Jaroslav,

619                     final boolean[] retry = new boolean[]{false};

I wonder whether using an AtomicBoolean would be safer.
It is not immediately clear whether 'retry' might be modified
in a different thread - but using an AtomicBoolean would make it
immediately clear that it does not matter ;-)

best regards,

-- daniel

On 24/09/14 11:35, Jaroslav Bachorik wrote:
> Thanks, Staffan
>
> On 09/23/2014 08:20 PM, Staffan Larsen wrote:
>> I think we should add some logging to say that we are retrying.
>>
>> nit: some weird indention on line 623.
>
> I added logging and improved the retry detection logic a bit (in the
> previous version the retry counter could have got decremented multiple
> times during one invocation of 'jcmd' - even though it was very unlikely).
>
> The weird indentation is gone.
>
> http://cr.openjdk.java.net/~jbachorik/8057149/webrev.01
>
> -JB-
>
>>
>> Otherwise looks good.
>>
>> /Staffan
>>
>> On 23 sep 2014, at 17:59, Jaroslav Bachorik
>> <jaroslav.bachorik at oracle.com> wrote:
>>
>>> Please, review this test change
>>>
>>> Issue : https://bugs.openjdk.java.net/browse/JDK-8057149
>>> Webrev: http://cr.openjdk.java.net/~jbachorik/8057149/webrev.00
>>>
>>> The test is using 'jcmd' to start/stop the JMX agent dynamically with
>>> various parameters. Some of the tests use a server socket to simulate
>>> starting the agent with the desired port not being available.
>>> However, it seems that in certain situations the agent does not
>>> provide a conscious error message and fails with RuntimeException
>>> instead. This seems to be timing related and therefore the solution
>>> is to retry the 'jcmd' command a few times with a delay to provide
>>> some cushion for the timing related problems.
>>>
>>> Thanks,
>>>
>>> -JB-
>>
>