RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]
Ioi Lam
iklam at openjdk.java.net
Thu May 19 20:02:36 UTC 2022
On Thu, 19 May 2022 09:06:18 GMT, Severin Gehwolf <sgehwolf at openjdk.org> wrote:
>> src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 63:
>>
>>> 61: } else {
>>> 62: char *p = strstr(cgroup_path, _root);
>>> 63: if (p != NULL && p == cgroup_path) {
>>
>> What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
>>
>> Maybe this block should be combined with the new `else` block you're adding? However, the behavior seems opposite between these two blocks of code:
>>
>> The upper block: _root is a prefix of cgroup_path, we take the **tail** of cgroup_path
>>
>>
>> TestCase substring_match = {
>> "/sys/fs/cgroup/memory", // mount_path
>> "/user.slice/user-1000.slice", // root_path
>> "/user.slice/user-1000.slice/user at 1001.service", // cgroup_path
>> "/sys/fs/cgroup/memory/user at 1001.service" // expected_path
>> };
>>
>>
>> The lower block: The beginning of _root is a prefix of cgroup_path, we take the **head** of cgroup_path
>>
>>
>> TestCase prefix_matched_cg = {
>> "/sys/fs/cgroup/memory", // mount_path
>> "/user.slice/user-1000.slice/session-50.scope", // root_path
>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>> "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
>> };
>>
>>
>> I find the behavior hard to understand. I think the rules (and reasons) should be added to the comment block above the function.
>
>> What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
>
> With a mount path of `/bar` this ends up being `/bar/o/bar`. Pretty strange, but then again it's a bit of a contrived example as those paths come from `/proc` parsing. Anyway, this is code that got added with [JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115). It's not something I've written and to be honest, I'm not sure this branch is needed, but I didn't want to change the existing behaviour with this patch. I have no more insight than you in terms of why that approach has been taken.
>
>> Maybe this block should be combined with the new `else` block you're adding?
>
> Maybe, but I'm not sure if it would break something.
>
>> However, the behavior seems opposite between these two blocks of code:
>>
>> The upper block: _root is a prefix of cgroup_path, we take the **tail** of cgroup_path
>>
>> ```
>> TestCase substring_match = {
>> "/sys/fs/cgroup/memory", // mount_path
>> "/user.slice/user-1000.slice", // root_path
>> "/user.slice/user-1000.slice/user at 1001.service", // cgroup_path
>> "/sys/fs/cgroup/memory/user at 1001.service" // expected_path
>> };
>> ```
>
> Yes. Though, I cannot comment on why that has been chosen. It's been there since day one :-/
>
>> The lower block: The beginning of _root is a prefix of cgroup_path, we take the **head** of cgroup_path
>>
>> ```
>> TestCase prefix_matched_cg = {
>> "/sys/fs/cgroup/memory", // mount_path
>> "/user.slice/user-1000.slice/session-50.scope", // root_path
>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>> "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
>> };
>> ```
>>
>> I find the behavior hard to understand. I think the rules (and reasons) should be added to the comment block above the function.
>
> The reason why I've gone down the path of adding the head of cgroup_path is because of this document (in conjunction to what the user was observing on an affected system):
> https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
>
> The user was observing paths as listed in the test:
>
> "/user.slice/user-1000.slice/session-50.scope", // root_path
> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>
> This very much looks like systemd managed. Given that and knowing that systemd adds processes into `scopes` or `services` and groups them via `slices` and also knowing that cgroups are hierarchical (i.e. limits of `/foo/bar` also apply to `/foo/bar/baz`), it seems likely that if there are any limits, then it'll be on `/user.slice/user-1000.slice` within the mounted controller. Unfortunately, I'm not able to reproduce this myself.
I am wondering if the problem is this:
We have systemd running on the host, and a different copy of systemd that runs inside the container.
- They both set up `/user.slice/user-1000.slice/session-??.scope` within their own file systems
- For some reason, when you're looking inside the container, `/proc/self/cgroup` might use a path in the containerized file system whereas `/proc/self/mountinfo` uses a path in the host file system. These two paths may look alike but they have absolutely no relation to each other.
I have asked the reporter for more information:
https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593
Meanwhile, I think the current method of finding "which directory under /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned about, the path you get from `/proc/self/cgroup` and `/proc/self/mountinfo` have no relation to each other, but we use them anyway to get our answer, with many ad-hoc methods that are not documented in the code.
Maybe we should do this instead?
- Read /proc/self/cgroup
- Find the `10:memory:<path>` line
- If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path
- Otherwise, scan all `tasks` files under `/sys/fs/cgroup/memory/`. Exactly one of them contains my PID.
For example, here's a test with docker:
INSIDE CONTAINER
# cat /proc/self/cgroup | grep memory
10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
# cat /proc/self/mountinfo | grep memory
801 791 0:42 /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,memory
# cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
cat: /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks: No such file or directory
# cat /sys/fs/cgroup/memory/tasks | grep $$
1
ON HOST
# cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
37494
# cat /proc/37494/status | grep NSpid
NSpid: 37494 1
-------------
PR: https://git.openjdk.java.net/jdk/pull/8629
More information about the serviceability-dev
mailing list