RFR: JDK-8210337: runtime/NMT/VirtualAllocTestType.java failed on RuntimeException missing from stdout/stderr
gary.adams at oracle.com
gary.adams at oracle.com
Thu Nov 8 03:43:52 UTC 2018
There is recovery code that cleans up old attach files,
but a permission denied error would prevent the clean
up operation from taking place. We're looking at a case of
a file owned by another user as well as a pid recycling
and failed clean up situation.
On 11/7/18 3:09 PM, Chris Plummer wrote:
> Are you saying that the PID was recycled, so the issue is a test run
> from long ago leaving behind the file? If so, I thought there was
> something in the attach code that cleaned up these PID files that were
> left behind. In general I would not count this as an infra issue. PID
> files left behind are the fault of the JVM. Maybe there is something
> wrong with the cleanup code on solaris, or maybe we don't clean up on
> any platform, but cycle through PIDs a lot faster on solaris.
>
> Chris
>
> On 11/7/18 11:20 AM, Gary Adams wrote:
>> If there are no further suggestions on JDK-8210337,
>> I plan to close it out as cannot reproduce.
>>
>> Similar bugs had been filed for the "Permission denied" error
>> from the openDoor request failure and each was attributed
>> to an infrastructure issue. e.g. another user with the same
>> pid left a temporary file that is blocking the current test
>> from attaching correctly.
>>
>> On 10/4/18, 1:49 PM, Gary Adams wrote:
>>> My delay and retry did not fix the problem with permission denied.
>>>
>>> When I was diagnosing the problem I instrumented the code
>>> to catch an IOException and call checkPermission to get
>>> more detail about the IOException. The error reported
>>> from calling checkPermission was ENOENT (stat).
>>>
>>> The code change I then proposed was catch the IOException,
>>> delay, and retry the open. That fixed the problem of
>>> ENOENT, but had nothing to do with "permission denied".
>>>
>>> On 10/4/18, 1:25 PM, Chris Plummer wrote:
>>>> But I also thought you said the delay and retry fixed the problem.
>>>> How could fix the problem if it is just duplicating something that
>>>> is already in place?
>>>>
>>>> Chris
>>>>
>>>> On 10/4/18 9:48 AM, Gary Adams wrote:
>>>>> My delay and retry just duplicated the openDoor retry.
>>>>> The normal processing of FileNotFoundException(ENOENT) is to retry
>>>>> several times until the file is available.
>>>>>
>>>>> But the original problem reported is a "Permission denied"
>>>>> (EACCESS or EPERM).
>>>>> Delay and retry will not resolve a permissions error.
>>>>>
>>>>> On 10/4/18, 12:30 PM, Chris Plummer wrote:
>>>>>> Didn't the retry after 100ms delay work? If yes, why would it if
>>>>>> the problem is that a java_pid was not cleaned up?
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> On 10/4/18 8:54 AM, Gary Adams wrote:
>>>>>>> First, let me retract the proposed change,
>>>>>>> it is not the right solution to the problem originally
>>>>>>> reported.
>>>>>>>
>>>>>>> Second, as a bit of explanation consider the code fragments below.
>>>>>>>
>>>>>>> The high level processing calls openDoor which is willing to retry
>>>>>>> the operation as long as the error is flagged specifically
>>>>>>> as a FileNotFoundException.
>>>>>>>
>>>>>>> VirtualMachineImpl.java:72
>>>>>>> VirtualMachineImpl.c:81
>>>>>>>
>>>>>>> During my testing I had added a check VirtualMachineImpl.java:214
>>>>>>> and when an IOException was detected made a call to checkPermissions
>>>>>>> to get more detailed information about the IOException. The error
>>>>>>> I saw was an ENOENT from the stat call. And not the detailed
>>>>>>> checks for
>>>>>>> specific permissions issues (VirtualMachineImpl.c:143)
>>>>>>>
>>>>>>> VirtualMachineImpl.c:118
>>>>>>> VirtualMachineImpl.c:147
>>>>>>>
>>>>>>> What I missed in the original proposed solution was a
>>>>>>> FileNotFoundException
>>>>>>> extends IOException. That means my delay and retry just
>>>>>>> duplicates the higher
>>>>>>> level retry around the openDoor call.
>>>>>>>
>>>>>>> Third, the original error message logged in the bug report :
>>>>>>>
>>>>>>> java.io.IOException: Permission denied
>>>>>>> at jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native
>>>>>>> Method)
>>>>>>>
>>>>>>> had to have come from
>>>>>>>
>>>>>>> VirtualMachineImpl.c:70
>>>>>>> VirtualMachineImpl.c:84
>>>>>>>
>>>>>>> which means the actual open call reported the file does exist
>>>>>>> but the permissions do not allow the file to be accessed.
>>>>>>> That also means the normal mechanism of removing leftover
>>>>>>> java_pid files would not have cleaned up another user's
>>>>>>> java_pid files.
>>>>>>>
>>>>>>> =====
>>>>>>> src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java:
>>>>>>> ...
>>>>>>> 67 // Opens the door file to the target VM. If
>>>>>>> the file is not
>>>>>>> 68 // found it might mean that the attach
>>>>>>> mechanism isn't started in the
>>>>>>> 69 // target VM so we attempt to start it and retry.
>>>>>>> 70 try {
>>>>>>> 71 fd = openDoor(pid);
>>>>>>> 72 } catch (FileNotFoundException fnf1) {
>>>>>>> 73 File f = createAttachFile(pid);
>>>>>>> 74 try {
>>>>>>> 75 sigquit(pid);
>>>>>>> 76
>>>>>>> 77 // give the target VM time to start
>>>>>>> the attach mechanism
>>>>>>> 78 final int delay_step = 100;
>>>>>>> 79 final long timeout = attachTimeout();
>>>>>>> 80 long time_spend = 0;
>>>>>>> 81 long delay = 0;
>>>>>>> 82 do {
>>>>>>> 83 // Increase timeout on each
>>>>>>> attempt to reduce polling
>>>>>>> 84 delay += delay_step;
>>>>>>> 85 try {
>>>>>>> 86 Thread.sleep(delay);
>>>>>>> 87 } catch (InterruptedException x) { }
>>>>>>> 88 try {
>>>>>>> 89 fd = openDoor(pid);
>>>>>>> 90 } catch (FileNotFoundException fnf2) {
>>>>>>> 91 // pass
>>>>>>> 92 }
>>>>>>> 93
>>>>>>> 94 time_spend += delay;
>>>>>>> 95 if (time_spend > timeout/2 && fd
>>>>>>> == -1) {
>>>>>>> 96 // Send QUIT again to give
>>>>>>> target VM the last chance to react
>>>>>>> 97 sigquit(pid);
>>>>>>> 98 }
>>>>>>> 99 } while (time_spend <= timeout && fd
>>>>>>> == -1);
>>>>>>> 100 if (fd == -1) {
>>>>>>> 101 throw new AttachNotSupportedException(
>>>>>>> 102 String.format("Unable to open door %s: " +
>>>>>>> 103 "target process %d doesn't
>>>>>>> respond within %dms " +
>>>>>>> 104 "or HotSpot VM not loaded",
>>>>>>> socket_path, pid, time_spend));
>>>>>>> 105 }
>>>>>>> ...
>>>>>>> 212 // The door is attached to .java_pid<pid> in the
>>>>>>> temporary directory.
>>>>>>> 213 private int openDoor(int pid) throws IOException {
>>>>>>> 214 socket_path = tmpdir + "/.java_pid" + pid;
>>>>>>> 215 fd = open(socket_path);
>>>>>>> 216
>>>>>>> 217 // Check that the file owner/permission to
>>>>>>> avoid attaching to
>>>>>>> 218 // bogus process
>>>>>>> 219 try {
>>>>>>> 220 checkPermissions(socket_path);
>>>>>>> 221 } catch (IOException ioe) {
>>>>>>> 222 close(fd);
>>>>>>> 223 throw ioe;
>>>>>>> 224 }
>>>>>>> 225 return fd;
>>>>>>> 226 }
>>>>>>>
>>>>>>> =====
>>>>>>> src/jdk.attach/solaris/native/libattach/VirtualMachineImpl.c:
>>>>>>> ...
>>>>>>> 59 JNIEXPORT jint JNICALL
>>>>>>> Java_sun_tools_attach_VirtualMachineImpl_open
>>>>>>> 60 (JNIEnv *env, jclass cls, jstring path)
>>>>>>> 61 {
>>>>>>> 62 jboolean isCopy;
>>>>>>> 63 const char* p = GetStringPlatformChars(env, path,
>>>>>>> &isCopy);
>>>>>>> 64 if (p == NULL) {
>>>>>>> 65 return 0;
>>>>>>> 66 } else {
>>>>>>> 67 int fd;
>>>>>>> 68 int err = 0;
>>>>>>> 69
>>>>>>> 70 fd = open(p, O_RDWR);
>>>>>>> 71 if (fd == -1) {
>>>>>>> 72 err = errno;
>>>>>>> 73 }
>>>>>>> 74
>>>>>>> 75 if (isCopy) {
>>>>>>> 76 JNU_ReleaseStringPlatformChars(env, path, p);
>>>>>>> 77 }
>>>>>>> 78
>>>>>>> 79 if (fd == -1) {
>>>>>>> 80 if (err == ENOENT) {
>>>>>>> 81 JNU_ThrowByName(env,
>>>>>>> "java/io/FileNotFoundException", NULL);
>>>>>>> 82 } else {
>>>>>>> 83 char* msg = strdup(strerror(err));
>>>>>>> 84 JNU_ThrowIOException(env, msg);
>>>>>>> 85 if (msg != NULL) {
>>>>>>> 86 free(msg);
>>>>>>> 87 }
>>>>>>> 88 }
>>>>>>> 89 }
>>>>>>> 90 return fd;
>>>>>>> 91 }
>>>>>>> 92 }
>>>>>>> ...
>>>>>>> 99 JNIEXPORT void JNICALL
>>>>>>> Java_sun_tools_attach_VirtualMachineImpl_checkPermissions
>>>>>>> 100 (JNIEnv *env, jclass cls, jstring path)
>>>>>>> 101 {
>>>>>>> 102 jboolean isCopy;
>>>>>>> 103 const char* p = GetStringPlatformChars(env, path,
>>>>>>> &isCopy);
>>>>>>> 104 if (p != NULL) {
>>>>>>> 105 struct stat64 sb;
>>>>>>> 106 uid_t uid, gid;
>>>>>>> 107 int res;
>>>>>>> 108
>>>>>>> 109 memset(&sb, 0, sizeof(struct stat64));
>>>>>>> 110
>>>>>>> 111 /*
>>>>>>> 112 * Check that the path is owned by the
>>>>>>> effective uid/gid of this
>>>>>>> 113 * process. Also check that group/other access
>>>>>>> is not allowed.
>>>>>>> 114 */
>>>>>>> 115 uid = geteuid();
>>>>>>> 116 gid = getegid();
>>>>>>> 117
>>>>>>> 118 res = stat64(p, &sb);
>>>>>>> 119 if (res != 0) {
>>>>>>> 120 /* save errno */
>>>>>>> 121 res = errno;
>>>>>>> 122 }
>>>>>>> 123
>>>>>>> 124 if (res == 0) {
>>>>>>> 125 char msg[100];
>>>>>>> 126 jboolean isError = JNI_FALSE;
>>>>>>> 127 if (sb.st_uid != uid && uid != ROOT_UID) {
>>>>>>> 128 snprintf(msg, sizeof(msg),
>>>>>>> 129 "file should be owned by the
>>>>>>> current user (which is %d) but is owned by %d", uid, sb.st_uid);
>>>>>>> 130 isError = JNI_TRUE;
>>>>>>> 131 } else if (sb.st_gid != gid && uid !=
>>>>>>> ROOT_UID) {
>>>>>>> 132 snprintf(msg, sizeof(msg),
>>>>>>> 133 "file's group should be the
>>>>>>> current group (which is %d) but the group is %d", gid, sb.st_gid);
>>>>>>> 134 isError = JNI_TRUE;
>>>>>>> 135 } else if ((sb.st_mode &
>>>>>>> (S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH)) != 0) {
>>>>>>> 136 snprintf(msg, sizeof(msg),
>>>>>>> 137 "file should only be readable and
>>>>>>> writable by the owner but has 0%03o access", sb.st_mode & 0777);
>>>>>>> 138 isError = JNI_TRUE;
>>>>>>> 139 }
>>>>>>> 140 if (isError) {
>>>>>>> 141 char buf[256];
>>>>>>> 142 snprintf(buf, sizeof(buf), "well-known
>>>>>>> file %s is not secure: %s", p, msg);
>>>>>>> 143 JNU_ThrowIOException(env, buf);
>>>>>>> 144 }
>>>>>>> 145 } else {
>>>>>>> 146 char* msg = strdup(strerror(res));
>>>>>>> 147 JNU_ThrowIOException(env, msg);
>>>>>>> 148 if (msg != NULL) {
>>>>>>> 149 free(msg);
>>>>>>> 150 }
>>>>>>> 151 }
>>>>>>>
>>>>>>> On 10/2/18, 6:23 PM, David Holmes wrote:
>>>>>>>> Minor correction: EPERM -> EACCES for Solaris
>>>>>>>>
>>>>>>>> Hard to see how to get a transient EACCES when opening a file
>>>>>>>> ... though as it is really a door I guess there could be
>>>>>>>> additional complexity.
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On 3/10/2018 7:54 AM, Chris Plummer wrote:
>>>>>>>>> On 10/2/18 2:38 PM, David Holmes wrote:
>>>>>>>>>> Chris,
>>>>>>>>>>
>>>>>>>>>> On 3/10/2018 6:57 AM, Chris Plummer wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/2/18 1:44 PM, gary.adams at oracle.com wrote:
>>>>>>>>>>>> The general attach sequence ...
>>>>>>>>>>>>
>>>>>>>>>>>> src/jdk.attach/solaris/classes/sun/tools/attach/VirtualMachineImpl.java
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> the attacher creates an attach_pid file in a directory
>>>>>>>>>>>> where the attachee is runnning
>>>>>>>>>>>> issues a signal to the attacheee
>>>>>>>>>>>>
>>>>>>>>>>>> loops waiting for the java_pid file to be created
>>>>>>>>>>>> default timeout is 10 seconds
>>>>>>>>>>>>
>>>>>>>>>>> So getting a FileNotFoundException while in this loop is OK,
>>>>>>>>>>> but IOException is not.
>>>>>>>>>>>
>>>>>>>>>>>> src/hotspot/os/solaris/attachListener_solaris.cpp
>>>>>>>>>>>>
>>>>>>>>>>>> attachee creates the java_pid file
>>>>>>>>>>>> listens til the attacher opens the door
>>>>>>>>>>>>
>>>>>>>>>>> I'm don't think this is related, but JDK-8199811 made a fix
>>>>>>>>>>> in attachListener_solaris.cpp to make it wait up to 10
>>>>>>>>>>> seconds for initialization to complete before failing the
>>>>>>>>>>> enqueue.
>>>>>>>>>>>
>>>>>>>>>>>> ...
>>>>>>>>>>>> Not sure when a bare IOException is thrown rather than the
>>>>>>>>>>>> more specific FileNotFoundException.
>>>>>>>>>>> Where is the IOException originating from? I wonder if the
>>>>>>>>>>> issue is that the file is in the process of being created,
>>>>>>>>>>> but is not fully created yet. Maybe it is there, but
>>>>>>>>>>> owner/group/permissions have not been set yet, and this
>>>>>>>>>>> results in an IOException instead of FileNotFoundException.
>>>>>>>>>>
>>>>>>>>>> The exception is shown in the bug report:
>>>>>>>>>>
>>>>>>>>>> [java.io.IOException: Permission denied
>>>>>>>>>> at jdk.attach/sun.tools.attach.VirtualMachineImpl.open(Native
>>>>>>>>>> Method)
>>>>>>>>>> at
>>>>>>>>>> jdk.attach/sun.tools.attach.VirtualMachineImpl.openDoor(VirtualMachineImpl.java:215)
>>>>>>>>>>
>>>>>>>>>> at
>>>>>>>>>> jdk.attach/sun.tools.attach.VirtualMachineImpl.<init>(VirtualMachineImpl.java:71)
>>>>>>>>>>
>>>>>>>>>> at
>>>>>>>>>> jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
>>>>>>>>>>
>>>>>>>>>> at
>>>>>>>>>> jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
>>>>>>>>>>
>>>>>>>>>> at
>>>>>>>>>> jdk.jcmd/sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:114)
>>>>>>>>>> at jdk.jcmd/sun.tools.jcmd.JCmd.main(JCmd.java:98)
>>>>>>>>>>
>>>>>>>>>> And if you look at the native code the EPERM from open will
>>>>>>>>>> cause IOException to be thrown.
>>>>>>>>>>
>>>>>>>>>> ./jdk.attach/solaris/native/libattach/VirtualMachineImpl.c
>>>>>>>>>>
>>>>>>>>>> JNIEXPORT jint JNICALL
>>>>>>>>>> Java_sun_tools_attach_VirtualMachineImpl_open
>>>>>>>>>> (JNIEnv *env, jclass cls, jstring path)
>>>>>>>>>> {
>>>>>>>>>> jboolean isCopy;
>>>>>>>>>> const char* p = GetStringPlatformChars(env, path, &isCopy);
>>>>>>>>>> if (p == NULL) {
>>>>>>>>>> return 0;
>>>>>>>>>> } else {
>>>>>>>>>> int fd;
>>>>>>>>>> int err = 0;
>>>>>>>>>>
>>>>>>>>>> fd = open(p, O_RDWR);
>>>>>>>>>> if (fd == -1) {
>>>>>>>>>> err = errno;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> if (isCopy) {
>>>>>>>>>> JNU_ReleaseStringPlatformChars(env, path, p);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> if (fd == -1) {
>>>>>>>>>> if (err == ENOENT) {
>>>>>>>>>> JNU_ThrowByName(env,
>>>>>>>>>> "java/io/FileNotFoundException", NULL);
>>>>>>>>>> } else {
>>>>>>>>>> char* msg = strdup(strerror(err));
>>>>>>>>>> JNU_ThrowIOException(env, msg);
>>>>>>>>>> if (msg != NULL) {
>>>>>>>>>> free(msg);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We should add the path to the exception message.
>>>>>>>>>>
>>>>>>>>> Thanks David. So if EPERM is the error and a retry 100ms later
>>>>>>>>> works, I think that supports my hypothesis that the file is
>>>>>>>>> not quite fully created. So Gary's fix is probably fine. The
>>>>>>>>> only other possible fix I can think of that wouldn't require
>>>>>>>>> an explicit delay (or multiple retries) is probably not worth
>>>>>>>>> the complexity. It would require that the attachee create two
>>>>>>>>> files, and the attacher try to open the second file first.
>>>>>>>>> When it either opens or returns EPERM, you know the first file
>>>>>>>>> can safety be opened.
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>> David
>>>>>>>>>> -----
>>>>>>>>>>
>>>>>>>>>>> Chris
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 10/2/18 4:11 PM, Chris Plummer wrote:
>>>>>>>>>>>>> Can you summarize how the attach handshaking is suppose to
>>>>>>>>>>>>> work? I'm just wondering why the attacher would ever be
>>>>>>>>>>>>> looking for the file before the attachee has created it.
>>>>>>>>>>>>> It seems a proper handshake would prevent this. Maybe
>>>>>>>>>>>>> there's some sort of visibility issue where the attachee
>>>>>>>>>>>>> has indeed created the file, but it is not immediately
>>>>>>>>>>>>> visible to the attacher process.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Chris
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/2/18 12:27 PM, gary.adams at oracle.com wrote:
>>>>>>>>>>>>>> The problem reproduced pretty quickly.
>>>>>>>>>>>>>> I added a call to checkPermission and revealed the
>>>>>>>>>>>>>> "file not found" from the stat call when the IOException
>>>>>>>>>>>>>> was detected.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There has been some flakiness from the Solaris test
>>>>>>>>>>>>>> machines today,
>>>>>>>>>>>>>> so I'll continue with the testing a bit longer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/2/18 3:12 PM, Chris Plummer wrote:
>>>>>>>>>>>>>>> Without the fix was this issue easy enough to reproduce
>>>>>>>>>>>>>>> that you can be sure this is resolving it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Chris
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/2/18 8:16 AM, Gary Adams wrote:
>>>>>>>>>>>>>>>> Solaris debug builds are failing tests that use the
>>>>>>>>>>>>>>>> attach interface.
>>>>>>>>>>>>>>>> An IOException is reported when the java_pid file is
>>>>>>>>>>>>>>>> not opened.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It appears that the attempt to attach is taking place
>>>>>>>>>>>>>>>> too quickly.
>>>>>>>>>>>>>>>> This workaround will allow the open operation to be
>>>>>>>>>>>>>>>> retried
>>>>>>>>>>>>>>>> after a short pause.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Webrev:
>>>>>>>>>>>>>>>> http://cr.openjdk.java.net/~gadams/8210337/webrev/
>>>>>>>>>>>>>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8210337
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Testing is in progress.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20181107/71d7aa75/attachment-0001.html>
More information about the serviceability-dev
mailing list