RFR: 8286030: Avoid JVM crash when containers share the same /tmp dir [v2]
Ioi Lam
iklam at openjdk.org
Thu Jul 7 22:02:29 UTC 2022
On Thu, 7 Jul 2022 14:17:19 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
>> Ioi Lam has updated the pull request incrementally with one additional commit since the last revision:
>>
>> @tstuefe comments
>
> src/hotspot/os/posix/perfMemory_posix.cpp line 708:
>
>> 706: fd = ::open(filename, O_RDONLY);
>> 707: if (fd >= 0) {
>> 708: is_locked = (flock(fd, LOCK_EX|LOCK_NB) != 0);
>
> Out of the n error conditions for flock(), only EWOULDBLOCK indicates a occupied lock. I would handle the rest differently.
Fixed.
> src/hotspot/os/posix/perfMemory_posix.cpp line 720:
>
>> 718:
>> 719: return is_locked;
>> 720: }
>
> The interface of this function seems weirdly unbalanced and threw me off at first. We hand in fd=-1, and it opens the fd, maybe closes the fd, maybe returns a valid fd (via reference, one has to look closely), maybe returns an invalid but valid-looking fd...
>
> I would disentangle this a bit either bei moving the open and the close out to the caller and making this a simple
>
>
> static bool is_locked_by_another_process(int fd);
>
>
> or, alternatively, just inlining the whole section into the one place where we use it.
I inlined the whole function and added more error handling code.
> src/hotspot/os/posix/perfMemory_posix.cpp line 770:
>
>> 768: // create_sharedmem_file() and is_locked_by_another_process().
>> 769: // If it's already locked by another process, then obviously it's
>> 770: // not stale
>
> Period missing.
Fixed
> src/hotspot/os/posix/perfMemory_posix.cpp line 781:
>
>> 779: // signal the process, then the file is assumed to
>> 780: // be stale and is removed because the files for such a
>> 781: // process should be in a different user specific directory.
>
> I am not sure this is good. If two conflicting hotspots share the same PID in tmp, from two different users, this will probably be a setup error. Is the best way really to provoke SIGBUS in the other VM? Seems a bit harsh.
>
> Also terminology would be wrong. Its not stale then, since the target process probably exists, is a VM, and uses that file.
We will get a permission error from the `kill(pid, 0)` call only after we have successfully grabbed the flock on the file. Note that if the file was created by a live JVM process that has the flock fix (i.e., this PR), regardless of which user owns the process, we will never come to here.
That the value of the `pid` variable is misleading. It is the NSPID of another JVM that created the file. If the current JVM process runs in a different PID namespace, it cannot reliably determine whether the file is stale or not.
In general, I don't think we can trust `pid` at all when containers are involved. But that's OK -- if you want to use Java in containers that share the /tmp directory, you must upgrade to a JVM that has the flock fix. Otherwise the behavior is undefined.
Otherwise, if you are:
- Not using containers. or
- Using containers that don't share /tmp
The logic for handling the `kill(pid, 0)` error is not changed by this PR, so we are bug-for-bug compatible with older JVMs. If you think the behavior should be changed, may that should be done in a separate PR?
Or, if you run into problems like "my hsperf files are randomly deleted", a simple fix is to upgrade the JVM to one that has the flock fix :-)
-------------
PR: https://git.openjdk.org/jdk/pull/9406
More information about the hotspot-runtime-dev
mailing list