RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]
Severin Gehwolf
sgehwolf at openjdk.java.net
Mon May 23 09:27:45 UTC 2022
On Thu, 19 May 2022 20:18:50 GMT, Ioi Lam <iklam at openjdk.org> wrote:
>> I am wondering if the problem is this:
>>
>> We have systemd running on the host, and a different copy of systemd that runs inside the container.
>>
>> - They both set up `/user.slice/user-1000.slice/session-??.scope` within their own file systems
>> - For some reason, when you're looking inside the container, `/proc/self/cgroup` might use a path in the containerized file system whereas `/proc/self/mountinfo` uses a path in the host file system. These two paths may look alike but they have absolutely no relation to each other.
>>
>> I have asked the reporter for more information:
>>
>> https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593
>>
>> Meanwhile, I think the current method of finding "which directory under /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned about, the path you get from `/proc/self/cgroup` and `/proc/self/mountinfo` have no relation to each other, but we use them anyway to get our answer, with many ad-hoc methods that are not documented in the code.
>>
>> Maybe we should do this instead?
>>
>> - Read /proc/self/cgroup
>> - Find the `10:memory:<path>` line
>> - If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path
>> - Otherwise, scan all `tasks` files under `/sys/fs/cgroup/memory/`. Exactly one of them contains my PID.
>>
>> For example, here's a test with docker:
>>
>>
>> INSIDE CONTAINER
>> # cat /proc/self/cgroup | grep memory
>> 10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
>> # cat /proc/self/mountinfo | grep memory
>> 801 791 0:42 /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup cgroup rw,memory
>> # cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
>> cat: /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks: No such file or directory
>> # cat /sys/fs/cgroup/memory/tasks | grep $$
>> 1
>>
>> ON HOST
>> # cat /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
>> 37494
>> # cat /proc/37494/status | grep NSpid
>> NSpid: 37494 1
>
> Also, I think the current PR could produce the wrong answer, if systemd is indeed running inside the container, and we have:
>
>
> "/user.slice/user-1000.slice/session-50.scope", // root_path
> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>
>
> The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which specifies the overall memory limit for user-1000. However, the correct answer may be /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may have a smaller memory limit, and the JVM may end up allocating a larger heap than allowed.
Yes, if we can decide which one the right file is. This is largely undocumented territory. The correct fix is a) find the correct path to the namespace hierarchy the process is a part of. b) starting at the leaf node, walk up the hierarchy and find the **lowest** limits. Doing this would be very expensive!
Aside: Current container detection in the JVM/JDK is notoriously imprecise. It's largely based on common setups (containers like docker). The heuristics assume that memory limits are reported inside the container at the leaf node. If, however, that's not the case, the detected limits will be wrong (it will detect it as unlimited, even though it's - for example - memory constrained at the parent). This can for example be reproduced on a cgroups v2 system with a systemd slice using memory limits. We've worked-around this in OpenJDK for cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338
-------------
PR: https://git.openjdk.java.net/jdk/pull/8629
More information about the serviceability-dev
mailing list