RFR for JDK-8030057: speed up forceLogSnapshot and checkAnnotations

Sat Dec 21 01:58:52 UTC 2013

Hi Tristan,

First of all, these are two completely separate changes. They're sort-of related 
in that they involve replacing sleep() loops and polling with different 
constructs, but other than that, they're completely independent. As you'll see 
from my review comments, there are different issues going on with each as well. 
It might be worthwhile to consider splitting these into separate subtasks, one 
for each of these changes, so that if one of the changes converges more quickly 
(seems like) it can go in independently of the others.

CheckAnnotations

This is a pretty complicated test to begin with, and the changes make it even 
more complicated. It adds two threads, a single Lock, two conditions, and stream 
subclasses that notify these conditions. It's not clear to me why the low-level 
constructs from j.u.concurrent.locks are necessary. At first glance the idea of 
having a notifying specialization of a ByteArrayOutputStream is reasonable. I'd 
expect the lock and condition to be inside the stream object, though, not 
external to it. In addition, the condition being notified is simply "something 
has been written". We don't actually know that all the output from the 
subprocess has been read. The old version waited around for a while and assumed 
that it had collected it all. This of course is racy; but having the notifying 
stream is still racy, if it happens that the output is written in multiple chunks.

Note that if we want to get subprocess output as it arrives, there are two 
threads hidden inside of RMI's TestLibrary.JavaVM class that collect stdout and 
stderr from the subprocess. By default they simply write to whatever stream they 
are given. An alternative approach would be to plumb an interface to these 
threads and have the output sent directly to the test as it comes in. (I think 
one of the other RMI tests does this, though it does it in a particularly clumsy 
way that I wouldn't recommend emulating.)

But stepping back a bit, the original test is over-complicated to begin with, 
and I'm reluctant to add all these additional mechanisms in this proposed 
change, because they add more complexity, and also because I can't quite 
convince myself that they're correct or that they even solve the problem that 
exists in the original test.

The original test probably needs to be reconceived entirely. The idea is to 
activate several objects in several different activation groups (which are child 
JVMs of the RMID process), and to make sure that the output from each is 
"annotated" with a prefix that indicates which activation group it came from. It 
seems to me that a much simpler way to approach the problem is to activate 
several objects and have them each emit a unique bit of output. Then, shut 
everything down, collect ALL the output, and check to make sure that each 
object's unique output is on a line that was annotated properly. This completely 
avoids threads, locks, conditions, timing loops, and race conditions.

If you think this alternative approach would be an effective way to proceed, I'd 
be happy to help you out with it.

ForceLogSnapshot

This is another fairly complicated test that is partially helped by the addition 
of CountDownLatches. Basically, it registers 50 activatable+restartable objects 
and 50 activatable (but not restartable) objects with the activation system. 
Then it restarts the activation system. The test wants to ensure that the 
restartable objects are all restarted, but that *none* of the non-restartable 
objects are restarted.

In the first case, having a CountDownLatch that starts at 50 and awaits for them 
all to be restarted makes perfect sense. The await() call has a timeout to make 
sure we don't wait excessively long for all 50 to be restarted, but the 
CountDownLatch allows the test to continue immediately as soon as all 50 have 
been restarted. For this case, it's also not necessary to have a retry loop in 
waitAllStarted(). We just wait once until the latch is decremented to zero, or 
until we time out. It might make sense to raise the timeout here, since there's 
a lot of activity; 30 or even 60 seconds might be reasonable. In the normal case 
the restarts will all occur quickly, so we'll only wait the full timeout if 
there's a failure.

In the second case, we expect zero objects to be restarted, so having a 
CountDownLatch starting at 50 doesn't make sense. There is no notion of having 
the test proceed immediately as soon as "all" of some events have happened. One 
choice here is to wait around for a while and count how many of the 
non-restartable objects actually get restarted. An AtomicInteger would be 
sufficient for that. An alternative might be to have a CountDownLatch 
initialized to 1. If any one of the non-restartable objects is restarted, we 
know immediately that the test has failed and we can report that. We might not 
care if additional non-restartable objects are restarted after we've observed 
the first one having been restarted erroneously. (Then again, we might.) But if 
others do, decrementing the CountDownLatch further after it's reached zero does 
nothing, so that's OK.

Unfortunately, the successful case needs the test to wait around for a while, 
because it's trying to verify a negative, i.e., that no non-restartable objects 
were actually restarted. I'm not sure whether that's better done with a 
CountDownLatch(1) plus timeout, or a sleep call followed by a check of an 
AtomicInteger. I think the latter is a bit easier understand, since the 
"latching" behavior in this case is the error, not the normal condition.

Finally, these changes are intended to improve performance. It would be good to 
see some measurements to ensure that the performance has actually improved. The 
CountDownLatch stuff is an improvement over the original array of booleans, and 
it potentially makes the code cleaner, but I think its potential for performance 
improvement is limited.

On my system the test run takes about 80 seconds. (It actually does all of the 
above twice: once after shutting down rmid, and once after causing the 
activation group JVMs to exit ("crash").) The howManyRestarted() loop for 
restartable objects sleeps for ten seconds before polling the number of objects 
restarted. On average it might wait five seconds too long; given the two test 
scenarios, that's a savings of ten seconds. The howManyRestarted() loop for the 
objects that *aren't* supposed to be restarted waits for 20 seconds 
unconditionally. With two scenarios run, that's 40 seconds right there, fully 
half of the time of the test run. (Where does the other half of the time go?) 
That time is unavoidable waiting around to make sure the wrong thing (restarting 
a non-restartable object) doesn't happen.

I think some of the CountDownLatch stuff will improve the test code, but let's 
not make any assumptions about performance.

s'marks

On 12/18/13 7:37 AM, Tristan Yan wrote:
> Hi Everyone
>
> Please help to review the code change for bug JDK-8030057.
>
>
> http://cr.openjdk.java.net/~tyan/JDK-8030057/webrev.00/
> <http://cr.openjdk.java.net/%7Etyan/JDK-8030057/webrev.00/>
> Description:
>
> Performance improvement for two RMI tests
>
> java/rmi/activation/Activatable/forceLogSnapshot
>
> method waitAllStarted is using recursive sleep to poll 50 restartedObject to be
> true, we can use modern CountDownLatch to implement blocking-time wait.
>
> Also suggest shorten waiting time to 5 seconds.
>
> java/rmi/activation/Activatable/checkAnnotations
>
> We can subclass ByteArrayOutputStream which support notification when data was
> written. Also use two thread wait output string and error string to be not null
>
> Thank you
> Tristan