RFR: 8343191: Cgroup v1 subsystem fails to set subsystem path

Sergey Chernyshev schernyshev at openjdk.org
Thu Nov 7 18:35:01 UTC 2024


On Thu, 31 Oct 2024 15:00:25 GMT, Sergey Chernyshev <schernyshev at openjdk.org> wrote:

> Cgroup V1 subsustem fails to initialize mounted controllers properly in certain cases, that may lead to controllers left undetected/inactive. We observed the behavior in CloudFoundry deployments, it affects also host systems.
> 
> The relevant /proc/self/mountinfo line is
> 
> 
> 2207 2196 0:43 /system.slice/garden.service/garden/good/2f57368b-0eda-4e52-64d8-af5c /sys/fs/cgroup/cpu,cpuacct ro,nosuid,nodev,noexec,relatime master:25 - cgroup cgroup rw,cpu,cpuacct
> 
> 
> /proc/self/cgroup:
> 
> 
> 11:cpu,cpuacct:/system.slice/garden.service/garden/bad/2f57368b-0eda-4e52-64d8-af5c
> 
> 
> Here, Java runs inside containerized process that is being moved cgroups due to load balancing.
> 
> Let's examine the condition at line 64 here https://github.com/openjdk/jdk/blob/55a7cf14453b6cd1de91362927b2fa63cba400a1/src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp#L59-L72
> It is always FALSE and the branch is never taken. The issue was spotted earlier by @jerboaa in [JDK-8288019](https://bugs.openjdk.org/browse/JDK-8288019). 
> 
> The original logic was intended to find the common prefix of `_root`and `cgroup_path` and concatenate the remaining suffix to the `_mount_point` (lines 67-68). That could lead to the following results: 
> 
> Example input
> 
> _root = "/a"
> cgroup_path = "/a/b"
> _mount_point = "/sys/fs/cgroup/cpu,cpuacct"
> 
> 
> result _path
> 
> "/sys/fs/cgroup/cpu,cpuacct/b"
> 
> 
> Here, cgroup_path comes from /proc/self/cgroup 3rd column. The man page (https://man7.org/linux/man-pages/man7/cgroups.7.html#NOTES) for control groups states:
> 
> 
> ...
>        /proc/pid/cgroup (since Linux 2.6.24)
>               This file describes control groups to which the process
>               with the corresponding PID belongs.  The displayed
>               information differs for cgroups version 1 and version 2
>               hierarchies.
>               For each cgroup hierarchy of which the process is a
>               member, there is one entry containing three colon-
>               separated fields:
> 
>                   hierarchy-ID:controller-list:cgroup-path
> 
>               For example:
> 
>                   5:cpuacct,cpu,cpuset:/daemons
> ...
>               [3]  This field contains the pathname of the control group
>                    in the hierarchy to which the process belongs. This
>                    pathname is relative to the mount point of the
>                    hierarchy.
> 
> 
> This explicitly states the "pathname is relative to the mount point of the hierarchy". Hence, the correct result could have been
> 
> 
> /sys/fs/cgroup/cpu,cpuacct/a/b
> 
> 
> Howe...

Here's an updated version of the patch. The long standing behavior was to leave `_path` uninitialized when `_root` is not "`/`" and not equal to `cgroup_path`. The issue can be reproduced as follows.

Create a new cgroup for memory

sudo mkdir -p /sys/fs/cgroup/memory/test


Run the following script

docker run --tty=true --rm --volume=$JAVA_HOME:/jdk --memory 400m ubuntu:latest \
    sh -c "sleep 10 ; /jdk/bin/java -Xlog:os+container=trace -version" | grep Memory\ Limit &
sleep 10
HOSTPID=$(sudo ps -ef | awk '/container=trace/ && !/docker/ && !/awk/ { print $2 }')
echo $HOSTPID | sudo tee /sys/fs/cgroup/memory/test/cgroup.procs
sleep 10

In the above script, a containerized process (`/bin/sh`) is moved to cgroup `/test` before `/jdk/bin/java` gets executed. Java inherits cgroup `/test` from its parent process, its `_root` will be `/docker/<CONTAINER_ID>`, `cgroup_path` will be `/test`.


The result would be ($JAVA_HOME points to JDK before fix)

9804
[0.001s][trace][os,container] Memory Limit failed: -2
[0.001s][trace][os,container] Memory Limit failed: -2
[0.002s][trace][os,container] Memory Limit failed: -2
[0.043s][trace][os,container] Memory Limit failed: -2


JDK updated version:

10001
[0.001s][trace  ][os,container] Memory Limit is: 419430400
[0.001s][trace  ][os,container] Memory Limit is: 419430400
[0.002s][trace  ][os,container] Memory Limit is: 419430400
[0.035s][trace  ][os,container] Memory Limit is: 419430400

The updated version falls back to the mount point (only when `_root` is other than `"/"`).

**Testing**

- Standard tiers (1-3)
- jtreg:test/jdk/jdk/internal/platform
- jtreg:test/hotspot/jtreg/containers
- gtest:cgroupTest

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21808#issuecomment-2462243544


More information about the serviceability-dev mailing list