RFR: 8343191: Cgroup v1 subsystem fails to set subsystem path [v3]
Sergey Chernyshev
schernyshev at openjdk.org
Fri Nov 8 21:20:46 UTC 2024
On Thu, 7 Nov 2024 22:31:21 GMT, Sergey Chernyshev <schernyshev at openjdk.org> wrote:
>> Cgroup V1 subsustem fails to initialize mounted controllers properly in certain cases, that may lead to controllers left undetected/inactive. We observed the behavior in CloudFoundry deployments, it affects also host systems.
>>
>> The relevant /proc/self/mountinfo line is
>>
>>
>> 2207 2196 0:43 /system.slice/garden.service/garden/good/2f57368b-0eda-4e52-64d8-af5c /sys/fs/cgroup/cpu,cpuacct ro,nosuid,nodev,noexec,relatime master:25 - cgroup cgroup rw,cpu,cpuacct
>>
>>
>> /proc/self/cgroup:
>>
>>
>> 11:cpu,cpuacct:/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c
>>
>>
>> Here, Java runs inside containerized process that is being moved cgroups due to load balancing.
>>
>> Let's examine the condition at line 64 here https://github.com/openjdk/jdk/blob/55a7cf14453b6cd1de91362927b2fa63cba400a1/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp#L59-L72
>> It is always FALSE and the branch is never taken. The issue was spotted earlier by @jerboaa in [JDK-8288019](https://bugs.openjdk.org/browse/JDK-8288019).
>>
>> The original logic was intended to find the common prefix of `_root`and `cgroup_path` and concatenate the remaining suffix to the `_mount_point` (lines 67-68). That could lead to the following results:
>>
>> Example input
>>
>> _root = "/a"
>> cgroup_path = "/a/b"
>> _mount_point = "/sys/fs/cgroup/cpu,cpuacct"
>>
>>
>> result _path
>>
>> "/sys/fs/cgroup/cpu,cpuacct/b"
>>
>>
>> Here, cgroup_path comes from /proc/self/cgroup 3rd column. The man page (https://man7.org/linux/man-pages/man7/cgroups.7.html#NOTES) for control groups states:
>>
>>
>> ...
>> /proc/pid/cgroup (since Linux 2.6.24)
>> This file describes control groups to which the process
>> with the corresponding PID belongs. The displayed
>> information differs for cgroups version 1 and version 2
>> hierarchies.
>> For each cgroup hierarchy of which the process is a
>> member, there is one entry containing three colon-
>> separated fields:
>>
>> hierarchy-ID:controller-list:cgroup-path
>>
>> For example:
>>
>> 5:cpuacct,cpu,cpuset:/daemons
>> ...
>> [3] This field contains the pathname of the control group
>> in the hierarchy to which the process belongs. This
>> pathname is relative to the mount point of the
>> hierarchy.
>>
>>
>> This explicitly states the "pathname is relative t...
>
> Sergey Chernyshev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>
> - Merge branch 'master' into JDK-8343191
> - patch reimplemented
> - fix the logic that skips duplicate controller's mount points
> - 8343191: Cgroup v1 subsystem fails to set subsystem path
It looks to me that v2 mode is not affected, at least the way it is in v1. In v2 mode, cgroup is mounted either at leaf node (private namespace), or the complete hierarchy at /sys/fs/cgroup (host namespace).
In host mode it works right away, as the full hierarchy is accessible. With a cgroup v2 created like this:
sudo mkdir -p /sys/fs/cgroup/test
echo 200000000 | sudo tee /sys/fs/cgroup/test/memory.max
```
The result would be
[0.000s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/test
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/test/memory.max
[0.001s][trace][os,container] Memory Limit is: 199999488
In the private namespace (it's a default setting in v2 hosts), it may fail migrating the process between cgroups (a docker issue?). It may look like the cgroup files are not mapped at all, while `cgroup_path` appears to be set relative to the old cgroup (the old cgroup isn't mapped though).
[0.000s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/../../test
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../test/memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/../../test/memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../../memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/../../memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/../memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/../memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
The following script
docker run --tty=true --rm --volume=$JAVA_HOME:/jdk --memory 400m ubuntu:latest \
sh -c "N=$(ls -la /sys/fs/cgroup | wc -l) ; sleep 10 ; echo $N ; ls -la /sys/fs/cgroup | wc -l" &
sleep 10
HOSTPID=$(sudo ps -ef | awk '/cgroup/ && !/docker/ && !/awk/ && !/grep/ { print $2 }')
echo $HOSTPID | sudo tee /sys/fs/cgroup/test/cgroup.procs > /dev/null
sleep 5
will display
74
1
means there are no files in `/sys/fs/cgroup` after migration. It seems like it's not something that can be fixed in Java (and it hasn't much to do with this PR too).
When moved into a subgroup, such as
sudo docker run --tty=true --rm --volume=$JAVA_HOME:/jdk --memory 400m ubuntu:latest \
sh -c "sleep 10 ; /jdk/bin/java -Xlog:os+container=trace -version" &
sleep 5
HOSTPID=$(sudo ps -ef | awk '/container=trace/ && !/docker/ && !/awk/ { print $2 }')
CGPATH=$(cat /proc/$HOSTPID/cgroup | cut -f3 -d: )
sudo mkdir -p "/sys/fs/cgroup$CGPATH/test"
echo $HOSTPID | sudo tee "/sys/fs/cgroup$CGPATH/test/cgroup.procs" > /dev/null
sleep 10
the cgroup will be mounted /sys/fs/cgroup, and the correct memory limit is displayed (thanks to the conroller path adjustment) - inherited from the parent.
[0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/test
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/test/memory.max
[0.001s][debug][os,container] Open of file /sys/fs/cgroup/test/memory.max failed, No such file or directory
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit is: -2
[0.001s][debug][os,container] container memory limit failed: -2, using host value 4105613312
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/memory.max
[0.001s][trace][os,container] Memory Limit is: 419430400
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21808#issuecomment-2465764229
More information about the core-libs-dev
mailing list