RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]
Severin Gehwolf
sgehwolf at openjdk.java.net
Mon May 23 19:13:58 UTC 2022
On Mon, 23 May 2022 09:24:19 GMT, Severin Gehwolf <sgehwolf at openjdk.org> wrote:
>> Also, I think the current PR could produce the wrong answer, if systemd is indeed running inside the container, and we have:
>>
>>
>> "/user.slice/user-1000.slice/session-50.scope", // root_path
>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>>
>>
>> The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which specifies the overall memory limit for user-1000. However, the correct answer may be /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may have a smaller memory limit, and the JVM may end up allocating a larger heap than allowed.
>
> Yes, if we can decide which one the right file is. This is largely undocumented territory. The correct fix is a) find the correct path to the namespace hierarchy the process is a part of. b) starting at the leaf node, walk up the hierarchy and find the **lowest** limits. Doing this would be very expensive!
>
> Aside: Current container detection in the JVM/JDK is notoriously imprecise. It's largely based on common setups (containers like docker). The heuristics assume that memory limits are reported inside the container at the leaf node. If, however, that's not the case, the detected limits will be wrong (it will detect it as unlimited, even though it's - for example - memory constrained at the parent). This can for example be reproduced on a cgroups v2 system with a systemd slice using memory limits. We've worked-around this in OpenJDK for cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338
> Maybe we should do this instead?
>
> * Read /proc/self/cgroup
>
> * Find the `10:memory:<path>` line
>
> * If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path
>
> * Otherwise, scan all `tasks` files under `/sys/fs/cgroup/memory/`. Exactly one of them contains my PID.
Something like that seems most promising, but it would have to be `cgroup.procs` not `tasks` as `tasks` is the task id (i.e. Linux's thread), not the process. We could keep the two common cases as short circuiting. I.e. host and docker cases in the test.
-------------
PR: https://git.openjdk.java.net/jdk/pull/8629
More information about the core-libs-dev
mailing list