SIGBUS on linux in perfMemory_init
Ioi Lam
ioi.lam at oracle.com
Fri May 13 04:35:54 UTC 2022
Hi All,
I've created a preliminary patch based on the suggestions from Vitaly
and Nico.
I would love to hear feedback before I start a formal PR.
https://github.com/openjdk/jdk/compare/master...iklam:8286030-containers-share-tmp-dir
Goals
=====
+ Allow JVM processes in different containers to share the /tmp directory
(See https://bugs.openjdk.java.net/browse/JDK-8286030 for use cases)
+ Serviceability tools such as jps and jcmd should be able to access all
such processes.
Non-goals
=========
+ The /tmp directory can be shared only by updated JVMs with this patch.
We do not support such sharing with older JVMs.
+ When using a shared /tmp directory, you must use the updated
jbs/jcmd/etc tools. Tools from older JDKs don't work.
Compatibility
=============
+ When the /tmp directory is NOT shared across containers, the behavior
of the updated JVM is exactly the same as before. The updated JVM can
co-exist with older JVMs, and can be accessed by jbs/jcmd/etc tools from
older JDKs.
+ The updated jbs/jcmd/etc tools in the updated JDK can work with JVMs
from older JDKs.
Design
======
AttachID -- this is the number used for /tmp/hsperfdata_$USER/<n> and
/tmp/.java_pid<n>
When a JVM starts, it chooses its AttachID by:
for (id = os::current_process_id(); ; id = random_id()) {
if ((flock /tmp/hsperfdata_$USER/$id) == SUCCESS) {
break;
}
}
Note that random_id() always returns a negative integer, so we will not
step on the pid of another valid process.
See the LINUX version of mmap_create_shared().
Cleaning up unused perfdata files
=================================
See unlink_sharedmem_file_if_stale() in the patch. We use both flock()
and kill(pid, 0) to check to liveness. This is a little complicated
because we need to co-exist with older JVMs.
Note that if the pid is positive, it could be from an older JVM, which
doesn't do flock(). Therefore, for such pids we also use kill(pid, 0).
This is racy, but it's no worse than before.
Discovering JVMs
================
See LocalVmManager.activeVms() -> PlatformSupport.activeVms().
See PlatformSupport.getAttachID().
The Linux version no longer scans /tmp directories. Instead, it scans
for hsperfdata files in the /proc/*/maps files. This way, it can detect
the AttachID of each JVM process. The AttachID may be randomized so it
will be different than the JVM's host pid and nspid.
Testing
=======
I've done some quick manual testing and it seems to work when several
containers are sharing the same /tmp directory. My command-line is this:
(/tmp insider the container is mapped to /tmp/dockertmp on the host file
system)
docker run -it --tty=true --rm \
-v /tmp/dockertmp:/tmp my-java-container \
java -cp / -Xlog:perf+memops=debug -Xlog:attach=trace Wait
I'll try to write a jtreg test case. Since we can't run as root, I'll
try to use podman and run the tests in rootless mode.
About -XX:+PerfDisableSharedMem
===============================
With my patch, if you share /tmp across containers and run Java with
-XX:+PerfDisableSharedMem
+ jps will not list these processes
+ "jcmd $HOSTPID" works for the first process that you connect to, but
might fail with another process. This is because two processes may use
the same socket file, e.g., /tmp/.java_pid1
Both of the above are the same as before, so there's no regression.
To avoid making this patch more complicated, I plan to fix this in a
future bug by mapping an empty /tmp/hsperfdata_$USER/$id file.
Please let me know what you think.
Thanks
- Ioi
On 5/7/2022 9:55 PM, Ioi Lam wrote:
>
>
> On 5/6/2022 1:40 AM, Severin Gehwolf wrote:
>> On Thu, 2022-05-05 at 13:48 -0700, Ioi Lam wrote:
>>>
>>> On 5/3/2022 8:41 AM, Nico Williams wrote:
>>>> On Fri, Apr 29, 2022 at 09:44:00AM -0400, Vitaly Davidovich wrote:
>>>>> As for possible solutions, would it be possible to use the global PID
>>>>> instead of the namespaced PID to "regain" the uniqueness invariant
>>>>> of the
>>>>> PID? Also, might it make sense to flock() the file to prevent another
>>>>> process from mucking with it?
>>>> My unsolicited, outsider opinions:
>>>>
>>>> - Sharing /tmp across containers is a Bad Idea (tm).
>>>>
>>>> - Sharing /tmp across related containers (in a pod) is not _as_
>>>> bad an
>>>> idea.
>>>>
>>>> (It might be a way to implement some cross-container
>>>> communications,
>>>> though it would be better to have an explicit mechanism for that
>>>> rather than the rather-generic /tmp.)
>>>>
>>>> - Containerizing apps that *do* communicate over /tmp might be one
>>>> reason one might configure a shared /tmp in a pod.
>>>>
>>>> Some support for such a configuration might be needed.
>>>>
>>>> (Alternatively, pods that share /tmp should also share a PID
>>>> namespace.)
>>>>
>>>> - Since there is an option to not have an mmap'ed hsperf file,
>>>> it might
>>>> be nice to have an option to use the global PID for naming hsperf
>>>> files. Or, better, implement an automatic mechanism for
>>>> detecting
>>>> conflict and switching to global PID for naming hsperf files (or
>>>> switching to anonymous hsperf mmaps).
>>>>
>>>> - In any case, on systems that have a real flock(2), using
>>>> flock(2) for
>>>> liveness testing is better than kill(2) with signal 0 -- the
>>>> latter
>>>> has false positives, while the former does not [provided
>>>> O_CLOEXEC is
>>>> used].
>>>>
>>>> For this reason, and though I am not too sympathetic to the
>>>> situation
>>>> that caused this crash, I believe that it would be better to have
>>>> some sort of fix for this problem than to declare it a
>>>> non-problem
>>>> and not-fix it.
>>>>
>>>>
>>>> I would like to expand on Vitaly's mention of flock(2). Using the
>>>> global PID would leave the JVM unable to use kill(2) with signal 0 for
>>>> liveness detection during hsperf garbage file collection. Using
>>>> kill(2)
>>>> with signal 0 for liveness is not that reliable anyways because of PID
>>>> reuse -- it can have false positives.
>>>>
>>>> A better mechanism for liveness detection would be to have the owning
>>>> JVM take an exclusive (LOCK_EX) flock(2) on the hsperf file at
>>>> startup,
>>>> and for hsperf garbage file collection to try (LOCK_NB) to get an
>>>> exclusive lock (LOCK_EX) on a candidate hsperf garbage file as a
>>>> liveness detection mechanism.
>>>>
>>>> When using the namespaced PID the kill(2) with signal 0 method of
>>>> liveness detection should still be used for backwards-compatibility
>>>> in,
>>>> e.g., jvisualvm.
>>>>
>>>> Using flock(2) would be less portable than kill(2) with signal 0, but
>>>> already there is a bunch of Linux-specific code here looking through
>>>> /proc, and Linux does have a real flock(2).
>>>>
>>>> An adaptive, zero-conf hsperf file naming scheme might use the
>>>> namespaced PID if available (i.e., if an exclusive flock(2) could be
>>>> obtained on the file), or the global PID if not, with some
>>>> indication in
>>>> the name of the file's name of which kind of PID was used.
>>> Hi Nico,
>>>
>>> I read your message again and now I totally agree with using
>>> flock(2) :-)
>>>
>>> As you said, we should start with getpid(). That way the behavior is
>>> compatible with older versions of jcmd tools, especially when Java is
>>> used outside of containers.
>>>
>>> One thing I realized is that if we have a collision, we don't need to
>>> use a globally unique ID. We just need an ID that's unique in the
>>> directory being written into.
>>>
>>> I think we can do this on the VM side:
>>>
>>> String id = getpid();
>>> while (true) {
>>> String file = "/tmp/hsperfdata_" + username() + "/" + id;
>>> if (get_exclusive_access(file)) {
>>> // I won the contest and
>>> // (a) the file didn't exist, or
>>> // (b) the file existed but the JVM that used it has died
>>> return file;
>>> }
>>> // Add an "x" here so we don't collide with the getpid() of
>>> another process
>>> id = "x" + random();
>>> }
>>>
>>> On the tools side, we can do the pid -> rendezvous file mapping as I
>>> described in the other e-mail.
>> If we could limit using this this special trick when it's actually
>> neede then this would be my preference. For one, it mostly keeps
>> compatibility with older JVMs and for two this isn't a very common use-
>> case which would penalize the 90% of use cases which aren't affected by
>> this.
>>
>> On the other hand, 'man proc' tells me this about /proc/*/environ:
>>
>> """
>> This file contains the initial environment that was set when the
>> currently executing program was started via execve(2). [...]
>>
>> If, after an execve(2), the process modifies its environment (e.g.,
>> by calling functions such as putenv(3) or modifying the environ(7)
>> variable directly), this file will not reflect those changes.
>>
>> [...]
>>
>> Permission to access this file is governed by a ptrace access mode
>> PTRACE_MODE_READ_FSCREDS check; see ptrace(2).
>> """
>>
>> So doing the publication of the file that was used in a reliable way
>> will be a challenge. Both approaches, shared memory mapping and setting
>> the environment will need PTRACE_MODE_READ_FSCREDS which I think isn't
>> generally granted for containers.
>
> I think publishing via /proc/*/environ is going to be problematic, but
> using the maps file seems fine. Here are my experiments.
>
> My conclusion is:
>
> If a Java process is visible to "jps" today, "jps" can also read its
> /proc/id/maps file.
>
> ============== test 1 ================
> Docker running with cgroup v1 on Ubuntu 20.04.3 LTS
>
> ubuntu at minikube-cgv1:~$ jps -J-version
> openjdk version "11.0.15" 2022-04-19
> OpenJDK Runtime Environment (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1)
> OpenJDK 64-Bit Server VM (build 11.0.15+10-Ubuntu-0ubuntu0.20.04.1,
> mixed mode, sharing)
> ubuntu at minikube-cgv1:~$ ps -ef | grep Wait
> ubuntu 3490 3280 2 21:34 pts/1 00:00:00 docker run -it
> --tty=true --rm my-java-app java -cp / Wait
> root 3541 3515 2 21:34 pts/0 00:00:00 java -cp / Wait
> ubuntu 3598 1247 0 21:34 pts/0 00:00:00 grep --color=auto
> Wait
>
> ubuntu at minikube-cgv1:~$ jps
> 3611 Jps
> ubuntu at minikube-cgv1:~$ sudo jps
> 3541 Wait
> 3630 Jps
> ubuntu at minikube-cgv1:~$ wc /proc/3541/maps
> wc: /proc/3541/maps: Permission denied
> ubuntu at minikube-cgv1:~$ ls /proc/3541/root
> ls: cannot access '/proc/3541/root': Permission denied
> ubuntu at minikube-cgv1:~$ sudo wc /proc/3541/maps
> 181 968 12074 /proc/3541/maps
> ubuntu at minikube-cgv1:~$ sudo grep hsperf /proc/3541/maps
> 7f4ebca2b000-7f4ebca33000 rw-s 00000000 00:33
> 1818692 /tmp/hsperfdata_root/1
>
> ============== test 2 ================
> podman rootless + cgroupv2 on Ubuntu 21.10
>
> ubuntu at podman-tester:~$ jps -J-version
> openjdk version "17.0.1" 2021-10-19
> OpenJDK Runtime Environment (build 17.0.1+12-Ubuntu-121.10)
> OpenJDK 64-Bit Server VM (build 17.0.1+12-Ubuntu-121.10, mixed mode,
> sharing)
> ubuntu at podman-tester:~$ ps -ef | grep Wait
> ubuntu 1531 1468 0 21:19 pts/0 00:00:01 podman run -it
> --tty=true --rm my-java-app java -cp / Wait
> ubuntu 1571 1556 0 21:19 pts/0 00:00:01 java -cp / Wait
> ubuntu 1946 1686 0 21:23 pts/1 00:00:00 grep --color=auto
> Wait
> ubuntu at podman-tester:~$ jps
> 1778 Jps
> 1571 Wait
> ubuntu at podman-tester:~$ cat /^C
> ubuntu at podman-tester:~$ sudo jps
> 1571 Wait
> 1805 Jps
> ubuntu at podman-tester:~$ grep hsperf/proc/1571/maps
> 7f88dc40c000-7f88dc414000 rw-s 00000000 08:01
> 792115 /tmp/hsperfdata_root/1
> ubuntu at podman-tester:~$ sudo grep hsperf/proc/1571/maps
> 7f88dc40c000-7f88dc414000 rw-s 00000000 08:01
> 792115 /tmp/hsperfdata_root/1
> ubuntu at podman-tester:~$ ls -l /proc/1571/root/tmp/hsperfdata_root/1
> -rw------- 1 ubuntu ubuntu 32768 May 7 21:24
> /proc/1571/root/tmp/hsperfdata_root/1
>
>
>
More information about the hotspot-runtime-dev
mailing list