Review request: 8004042 : Arrrghs.java test failed on windows with access error.
A fix for intermittent test failures causing grief on Windows, particularly Windows 7 and 2008 systems: http://cr.openjdk.java.net/~ddehaven/8004042/webrev.1/ The underlying cause is as of yet unknown, but I was able to create an environment that caused similar failures by running "watch -n 0.2 cat JTwork/scratch/atest.bat" in another shell then running jtreg. Without the patch it would fail very regularly (especially when isolated and looped 1,000 times), with the patch it always passes. -DrD-
On 12/5/12 2:44 PM, David DeHaven wrote:
A fix for intermittent test failures causing grief on Windows, particularly Windows 7 and 2008 systems: http://cr.openjdk.java.net/~ddehaven/8004042/webrev.1/
The underlying cause is as of yet unknown, but I was able to create an environment that caused similar failures by running "watch -n 0.2 cat JTwork/scratch/atest.bat" in another shell then running jtreg. Without the patch it would fail very regularly (especially when isolated and looped 1,000 times), with the patch it always passes.
We had been discussing/speculating that the problem is the virus scanner checking the old incarnation of the file at the same time we want to create a new one. If so, and if it were to hold the file open with exclusive access, that would explain the AccessDeniedException instead of FileAlreadyExistsException (in the CREATE_NEW case). (I still don't have an explanation for why the file deletion doesn't throw an exception. That's what seems to cause jtreg some grief in its cleanup phase, and for that reason we've added retry logic to jtreg.) In any case I think backing off and retrying is probably what's getting us the benefit, not the adjustment of the file creation flags. In any case I'd increase the number of retry attempts. The test environment is surprisingly hostile. In the RMI tests we try to open an unused network port and retry 10 times if that fails, and sometimes that's not enough. (See Jim Gish's recent changeset.) InterruptedException shouldn't be ignored. Jtreg will interrupt the test if it times out, so this interruption should be handled gracefully. Perhaps, wrap the InterruptedException in an IOException and rethrow it? (Since the caller is clearly prepared to handle IOException.) Terminating the loop and returning normally doesn't seem right, since the file wasn't created successfully. This is only style, but perhaps it would be good to get rid of the 'done' boolean and replace the 'done = true' statement with a 'return'. This simplifies things a bit, I think. s'marks
A fix for intermittent test failures causing grief on Windows, particularly Windows 7 and 2008 systems: http://cr.openjdk.java.net/~ddehaven/8004042/webrev.1/
The underlying cause is as of yet unknown, but I was able to create an environment that caused similar failures by running "watch -n 0.2 cat JTwork/scratch/atest.bat" in another shell then running jtreg. Without the patch it would fail very regularly (especially when isolated and looped 1,000 times), with the patch it always passes.
We had been discussing/speculating that the problem is the virus scanner checking the old incarnation of the file at the same time we want to create a new one. If so, and if it were to hold the file open with exclusive access, that would explain the AccessDeniedException instead of FileAlreadyExistsException (in the CREATE_NEW case).
(I still don't have an explanation for why the file deletion doesn't throw an exception. That's what seems to cause jtreg some grief in its cleanup phase, and for that reason we've added retry logic to jtreg.)
I see a lot of quirky filesystem behavior just using Windows 7, especially compared to Unixish systems … but maybe that's just me and my environment :)
In any case I think backing off and retrying is probably what's getting us the benefit, not the adjustment of the file creation flags.
I agree… I also think this is deeper than just the virus scanner, but I've not the expertise to validate that claim.
In any case I'd increase the number of retry attempts. The test environment is surprisingly hostile. In the RMI tests we try to open an unused network port and retry 10 times if that fails, and sometimes that's not enough. (See Jim Gish's recent changeset.)
For 8004317 (I think you pushed it yesterday)? I'll take a look at it. That was my initial value. I backed it off thinking it was overly aggressive, but if the environment is that hostile perhaps it's not.
InterruptedException shouldn't be ignored. Jtreg will interrupt the test if it times out, so this interruption should be handled gracefully. Perhaps, wrap the InterruptedException in an IOException and rethrow it? (Since the caller is clearly prepared to handle IOException.) Terminating the loop and returning normally doesn't seem right, since the file wasn't created successfully.
Good point! I'll add that. I think I should through RuntimeException in both cases, to be consistent.
This is only style, but perhaps it would be good to get rid of the 'done' boolean and replace the 'done = true' statement with a 'return'. This simplifies things a bit, I think.
I rewrote it to use a for loop instead and it's much cleaner. I'll post a new webrev later today. -DrD-
New webrev posted: http://cr.openjdk.java.net/~ddehaven/8004042/webrev.2/ Summary: - Cleaned it up a bit by change to a for loop (eliminating done flag) - Throws RuntimeException instead of IOException, handles InterruptedException - Bumped retry count to 10 - Changed @run mode to "main/othervm" as suggested by Kumar -DrD-
OK, looks pretty good. But.... One little thing is still bothering me. Suppose we get interrupted from the sleep() and bail out on that basis. If we got to the point where we're sleeping, we must have caught an AccessDeniedException previously. Unfortunately this exception is discarded if we were interrupted. This might lose valuable diagnostic information. So, what do we do with it? How about adding: ie.addSuppressed(cause); in the catch InterruptedException clause, before throwing the RuntimeException. (There's another issue which is that if there were previous retries, the ADEs from them are thrown away. But maybe we should save that one for another day.) s'marks On 12/6/12 9:41 AM, David DeHaven wrote:
New webrev posted: http://cr.openjdk.java.net/~ddehaven/8004042/webrev.2/
Summary: - Cleaned it up a bit by change to a for loop (eliminating done flag) - Throws RuntimeException instead of IOException, handles InterruptedException - Bumped retry count to 10 - Changed @run mode to "main/othervm" as suggested by Kumar
-DrD-
OK, looks pretty good. But....
One little thing is still bothering me. Suppose we get interrupted from the sleep() and bail out on that basis. If we got to the point where we're sleeping, we must have caught an AccessDeniedException previously. Unfortunately this exception is discarded if we were interrupted. This might lose valuable diagnostic information. So, what do we do with it? How about adding:
ie.addSuppressed(cause);
in the catch InterruptedException clause, before throwing the RuntimeException.
Another good point...
(There's another issue which is that if there were previous retries, the ADEs from them are thrown away. But maybe we should save that one for another day.)
I had the same thought, but aside from collecting and reporting all of them somehow I'm not sure what could be done about it. Maybe instead of: cause = ade; do: if (cause != null) { cause.addSuppressed(ade); } else { cause = ade; } Then they'll at least all be reported when RuntimeException is thrown. -DrD-
OK, looks pretty good. But....
One little thing is still bothering me. Suppose we get interrupted from the sleep() and bail out on that basis. If we got to the point where we're sleeping, we must have caught an AccessDeniedException previously. Unfortunately this exception is discarded if we were interrupted. This might lose valuable diagnostic information. So, what do we do with it? How about adding:
ie.addSuppressed(cause);
in the catch InterruptedException clause, before throwing the RuntimeException. Another good point...
(There's another issue which is that if there were previous retries, the ADEs from them are thrown away. But maybe we should save that one for another day.) I had the same thought, but aside from collecting and reporting all of them somehow I'm not sure what could be done about it.
Maybe instead of: cause = ade;
do: if (cause != null) { cause.addSuppressed(ade); } else { cause = ade; }
Then they'll at least all be reported when RuntimeException is thrown. Very good point indeed, how about we also capture/print unconditionally how many retries were made, this might give us some indication of the gremlin we are dealing with.
Kumar
-DrD-
On 12/7/12 7:38 AM, David DeHaven wrote:
(There's another issue which is that if there were previous retries, the ADEs from them are thrown away. But maybe we should save that one for another day.)
I had the same thought, but aside from collecting and reporting all of them somehow I'm not sure what could be done about it.
Maybe instead of: cause = ade;
do: if (cause != null) { cause.addSuppressed(ade); } else { cause = ade; }
Then they'll at least all be reported when RuntimeException is thrown.
Ah, ok, this isn't bad at all! Let's go with this. Did you need somebody to push this for you? s'marks
(There's another issue which is that if there were previous retries, the ADEs from them are thrown away. But maybe we should save that one for another day.)
I had the same thought, but aside from collecting and reporting all of them somehow I'm not sure what could be done about it.
Maybe instead of: cause = ade;
do: if (cause != null) { cause.addSuppressed(ade); } else { cause = ade; }
Then they'll at least all be reported when RuntimeException is thrown.
Ah, ok, this isn't bad at all! Let's go with this.
Sounds like a plan :)
Did you need somebody to push this for you?
Kumar is going to push for me. -DrD-
(There's another issue which is that if there were previous retries, the ADEs from them are thrown away. But maybe we should save that one for another day.)
I had the same thought, but aside from collecting and reporting all of them somehow I'm not sure what could be done about it.
Maybe instead of: cause = ade;
do: if (cause != null) { cause.addSuppressed(ade); } else { cause = ade; }
Then they'll at least all be reported when RuntimeException is thrown.
Ah, ok, this isn't bad at all! Let's go with this.
Sounds like a plan :)
Did you need somebody to push this for you?
Kumar is going to push for me.
I updated the webrev, unfortunately I lost count and updated the same webrev so it's still at webrev.2. I'm doing a final sanity check and will submit a JPRT run when that's done. http://cr.openjdk.java.net/~ddehaven/8004042/webrev.2/ -DrD-
Hopefully final webrev for this change: http://cr.openjdk.java.net/~ddehaven/8004042/webrev.3/ I moved the createBatFile method to TestHelper as I thought it could be useful to other tests and added more diagnostic info. It turns out the culprit on my machine was Windows' "Application Experience" service. Each time the batch file is launched svchost.exe tries to access the file, if it doesn't release it's file locks quickly enough then when java attempts to create a new file it fails with AccessDeniedException. In procmon this was showing up as a "DELETE PENDING" result, followed by svchost.exe getting the same result at least twice then giving up. Disabling that service allowed me to run my filesystem tests unpatched without failing. I don't see any other option to fix this as this is somewhat expected of asynchronous access in a multitasking environment. Maybe there's some means to synchronously delete files... -DrD-
On 12/7/12 1:39 PM, David DeHaven wrote:
Hopefully final webrev for this change: http://cr.openjdk.java.net/~ddehaven/8004042/webrev.3/
I moved the createBatFile method to TestHelper as I thought it could be useful to other tests and added more diagnostic info.
It turns out the culprit on my machine was Windows' "Application Experience" service. Each time the batch file is launched svchost.exe tries to access the file, if it doesn't release it's file locks quickly enough then when java attempts to create a new file it fails with AccessDeniedException. In procmon this was showing up as a "DELETE PENDING" result, followed by svchost.exe getting the same result at least twice then giving up. Disabling that service allowed me to run my filesystem tests unpatched without failing.
I don't see any other option to fix this as this is somewhat expected of asynchronous access in a multitasking environment. Maybe there's some means to synchronously delete files...
Ah, the mysteries of Windows. The test looks fine. s'marks
On 07/12/2012 21:39, David DeHaven wrote:
It turns out the culprit on my machine was Windows' "Application Experience" service. Each time the batch file is launched svchost.exe tries to access the file, if it doesn't release it's file locks quickly enough then when java attempts to create a new file it fails with AccessDeniedException. In procmon this was showing up as a "DELETE PENDING" result, followed by svchost.exe getting the same result at least twice then giving up. Disabling that service allowed me to run my filesystem tests unpatched without failing.
Good sleuthing, I guess some of us owe at least one AV vendor a written apology :-) I wonder if we should submit a bug to Microsoft, it's very surprising to have a service running by default that causes interference like this. This investigation also makes me wonder if java/lang/Runtime/exec/WinCommand.java might be another victim. -Alan
participants (4)
-
Alan Bateman
-
David DeHaven
-
Kumar Srinivasan
-
Stuart Marks