RFR for JDK-8030057: speed up forceLogSnapshot and checkAnnotations
stuart.marks at oracle.com
Sat Dec 21 01:58:52 UTC 2013
First of all, these are two completely separate changes. They're sort-of related
in that they involve replacing sleep() loops and polling with different
constructs, but other than that, they're completely independent. As you'll see
from my review comments, there are different issues going on with each as well.
It might be worthwhile to consider splitting these into separate subtasks, one
for each of these changes, so that if one of the changes converges more quickly
(seems like) it can go in independently of the others.
This is a pretty complicated test to begin with, and the changes make it even
more complicated. It adds two threads, a single Lock, two conditions, and stream
subclasses that notify these conditions. It's not clear to me why the low-level
constructs from j.u.concurrent.locks are necessary. At first glance the idea of
having a notifying specialization of a ByteArrayOutputStream is reasonable. I'd
expect the lock and condition to be inside the stream object, though, not
external to it. In addition, the condition being notified is simply "something
has been written". We don't actually know that all the output from the
subprocess has been read. The old version waited around for a while and assumed
that it had collected it all. This of course is racy; but having the notifying
stream is still racy, if it happens that the output is written in multiple chunks.
Note that if we want to get subprocess output as it arrives, there are two
threads hidden inside of RMI's TestLibrary.JavaVM class that collect stdout and
stderr from the subprocess. By default they simply write to whatever stream they
are given. An alternative approach would be to plumb an interface to these
threads and have the output sent directly to the test as it comes in. (I think
one of the other RMI tests does this, though it does it in a particularly clumsy
way that I wouldn't recommend emulating.)
But stepping back a bit, the original test is over-complicated to begin with,
and I'm reluctant to add all these additional mechanisms in this proposed
change, because they add more complexity, and also because I can't quite
convince myself that they're correct or that they even solve the problem that
exists in the original test.
The original test probably needs to be reconceived entirely. The idea is to
activate several objects in several different activation groups (which are child
JVMs of the RMID process), and to make sure that the output from each is
"annotated" with a prefix that indicates which activation group it came from. It
seems to me that a much simpler way to approach the problem is to activate
several objects and have them each emit a unique bit of output. Then, shut
everything down, collect ALL the output, and check to make sure that each
object's unique output is on a line that was annotated properly. This completely
avoids threads, locks, conditions, timing loops, and race conditions.
If you think this alternative approach would be an effective way to proceed, I'd
be happy to help you out with it.
This is another fairly complicated test that is partially helped by the addition
of CountDownLatches. Basically, it registers 50 activatable+restartable objects
and 50 activatable (but not restartable) objects with the activation system.
Then it restarts the activation system. The test wants to ensure that the
restartable objects are all restarted, but that *none* of the non-restartable
objects are restarted.
In the first case, having a CountDownLatch that starts at 50 and awaits for them
all to be restarted makes perfect sense. The await() call has a timeout to make
sure we don't wait excessively long for all 50 to be restarted, but the
CountDownLatch allows the test to continue immediately as soon as all 50 have
been restarted. For this case, it's also not necessary to have a retry loop in
waitAllStarted(). We just wait once until the latch is decremented to zero, or
until we time out. It might make sense to raise the timeout here, since there's
a lot of activity; 30 or even 60 seconds might be reasonable. In the normal case
the restarts will all occur quickly, so we'll only wait the full timeout if
there's a failure.
In the second case, we expect zero objects to be restarted, so having a
CountDownLatch starting at 50 doesn't make sense. There is no notion of having
the test proceed immediately as soon as "all" of some events have happened. One
choice here is to wait around for a while and count how many of the
non-restartable objects actually get restarted. An AtomicInteger would be
sufficient for that. An alternative might be to have a CountDownLatch
initialized to 1. If any one of the non-restartable objects is restarted, we
know immediately that the test has failed and we can report that. We might not
care if additional non-restartable objects are restarted after we've observed
the first one having been restarted erroneously. (Then again, we might.) But if
others do, decrementing the CountDownLatch further after it's reached zero does
nothing, so that's OK.
Unfortunately, the successful case needs the test to wait around for a while,
because it's trying to verify a negative, i.e., that no non-restartable objects
were actually restarted. I'm not sure whether that's better done with a
CountDownLatch(1) plus timeout, or a sleep call followed by a check of an
AtomicInteger. I think the latter is a bit easier understand, since the
"latching" behavior in this case is the error, not the normal condition.
Finally, these changes are intended to improve performance. It would be good to
see some measurements to ensure that the performance has actually improved. The
CountDownLatch stuff is an improvement over the original array of booleans, and
it potentially makes the code cleaner, but I think its potential for performance
improvement is limited.
On my system the test run takes about 80 seconds. (It actually does all of the
above twice: once after shutting down rmid, and once after causing the
activation group JVMs to exit ("crash").) The howManyRestarted() loop for
restartable objects sleeps for ten seconds before polling the number of objects
restarted. On average it might wait five seconds too long; given the two test
scenarios, that's a savings of ten seconds. The howManyRestarted() loop for the
objects that *aren't* supposed to be restarted waits for 20 seconds
unconditionally. With two scenarios run, that's 40 seconds right there, fully
half of the time of the test run. (Where does the other half of the time go?)
That time is unavoidable waiting around to make sure the wrong thing (restarting
a non-restartable object) doesn't happen.
I think some of the CountDownLatch stuff will improve the test code, but let's
not make any assumptions about performance.
On 12/18/13 7:37 AM, Tristan Yan wrote:
> Hi Everyone
> Please help to review the code change for bug JDK-8030057.
> Performance improvement for two RMI tests
> method waitAllStarted is using recursive sleep to poll 50 restartedObject to be
> true, we can use modern CountDownLatch to implement blocking-time wait.
> Also suggest shorten waiting time to 5 seconds.
> We can subclass ByteArrayOutputStream which support notification when data was
> written. Also use two thread wait output string and error string to be not null
> Thank you
More information about the core-libs-dev