RFR(S): 8239785: Cgroups: Incorrect detection logic on old systems in hotspot

Thu Mar 5 13:41:26 UTC 2020

> On Mar 5, 2020, at 4:36 AM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
> 
> Hi Bob,
> 
> On Wed, 2020-03-04 at 15:39 -0500, Bob Vandette wrote:
>>> On Mar 4, 2020, at 1:59 PM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
>>> 
>>> Hi Bob,
>>> 
>>> Updated webrev:
>>> http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8239785/05/webrev/
>>> 
>>> Testing: hotspot docker tests on Linux fastdebug/release cgroup v1 and
>>> cgroup v2. New regression test. jdk submit. All pass.
>>> 
>>> On Fri, 2020-02-28 at 15:08 -0500, Bob Vandette wrote:
>>>> It still looks like the _mount_path can be corrupted if both cgroupv1 and cgroupv2 are mounted.
>>>> You’ll only initialize the _mount_path if cgroupv2 is detected in the cgroups file but if cgroupv1
>>>> is also mounted, you’ll still clobber the _mount_path.
>>> 
>>> How so? When we start processing /proc/self/mountinfo on line 170, then
>>> we have already looked at /proc/cgroups, thus is_cgroupsV2 is set to
>>> true (A) or false (B).
>>> 
>>> Cases A:
>>> 1. We are on an old system with *no* cgroup controllers (v1 or v2)
>>>   mounted (this bug). We are fine to continue as no cgroups entry
>>>   will be found in /proc/self/mountinfo. Thus, no clobbering.
>>> 2. We are on a true unified hierarchy only system. In that case
>>>   the sscanf for v1 would match (early match), but 'memory' et.al
>>>   controllers would not be found. Thus, no clobbering.
>>> 
>>> Cases B:
>>> 1. Hybrid hierarchy: controllers mounted via cgroup v1, but
>>>   cgroup v2 mounted in addition for system use. is_cgroupV2 will
>>>   prevent the cgroup v2 controller path lookup logic to be entered.
>>>   No clobbering.
>>> 2. cgroup v1 only hierarchy (legacy): Same logic applies. V2 setting
>>>   of mount paths won't be entered since is_cgroupV2 == false.
>>>   No clobbering.
>>> 
>>> What am I missing? FWIW, I've played around with the idea of keeping
>>> cgroup v1 and cgroup v2 mount paths separate[1]. If that's preferred,
>>> happy to go this route.
>> 
>> The case that I think is a problem is …
>> 
>> cgroupv2 is detected in /proc/cgroups so is_cgroupv2 is true
>> both cgroupv1 and cgroupv2 devices are mounted
> 
> That's not possible. is_cgroupv2 cannot be true in that case.
> 
>> At line cgroupSubsystem_linux.cpp:292 in your new webrev, you process the cgroupv1
>> devices and clobber _mount_path.
>> 
>> What keeps this from occurring?
> 
> In that case 'is_cgroupv2 == false', because having any cgroups v1
> controllers mounted will result in some non-zero hierarchy ids in
> /proc/cgroups. That in turn will set is_cgroup2 == false.
> 
> From man 7 cgroups:
> 
> """"
>   /proc files
>       /proc/cgroups (since Linux 2.6.24)
> 
>              The fields in this file are, from left to right:
> 
>              1. The name of the controller.
> 
>              2. The unique ID of the cgroup hierarchy on which this controller is mounted.  If multiple cgroups v1 controllers are bound to the same hierarchy, then each will show the same hierarchy ID in
>                 this field.  The value in this field will be 0 if:
> 
>                   a) the controller is not mounted on a cgroups v1 hierarchy;
> 
>                   b) the controller is bound to the cgroups v2 single unified hierarchy; or
> 
>                   c) the controller is disabled (see below).
> """
> 
> The only way hierarchy id == 0 for a controller on cgroups v1 is a).
> It's never 0 when cgroup v1 and cgroup v2 are *both* mounted (hybrid).
> One example of such a /proc/cgroups is on my F30 workstation (hybrid):
> 
> $ cat /proc/cgroups 
> #subsys_name	hierarchy	num_cgroups	enabled
> cpuset	5	1	1
> cpu	3	1	1
> cpuacct	3	1	1
> blkio	11	1	1
> memory	6	136	1
> devices	9	75	1
> freezer	7	1	1
> net_cls	4	1	1
> perf_event	2	1	1
> net_prio	4	1	1
> hugetlb	10	1	1
> pids	8	81	1
> 
> And the corresponding mountinfo:
> 
> $ grep cgroup /proc/self/mountinfo 
> 30 23 0:26 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:4 - tmpfs tmpfs ro,seclabel,mode=755
> 31 30 0:27 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:5 - cgroup2 cgroup2 rw,seclabel,nsdelegate
> 32 30 0:28 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,seclabel,xattr,name=systemd
> 35 30 0:31 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,seclabel,perf_event
> 36 30 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,seclabel,cpu,cpuacct
> 37 30 0:33 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,seclabel,net_cls,net_prio
> 38 30 0:34 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,seclabel,cpuset
> 39 30 0:35 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,seclabel,memory
> 40 30 0:36 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,seclabel,freezer
> 41 30 0:37 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,seclabel,pids
> 42 30 0:38 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,seclabel,devices
> 43 30 0:39 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,seclabel,hugetlb
> 44 30 0:40 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,seclabel,blkio
> 
> Does that make sense?

Ok, that makes sense.

Thanks for the clarification.

Bob.

> 
> Thanks,
> Severin
> 
>>> While thinking about this issue here, I've decided to actually write a
>>> test (using WhiteBox) so that all these cases are covered and will
>>> continue to work. Running the test, CgroupSubsystemFactory, with
>>> fastdebug proves that there is no clobbering if that matters. It's also
>>> a way to reproduce the issue. So we have a regression test now too
>>> (testCgroupv1NoMounts). Line 325-332 in [2] is the actual fix. Without
>>> it, the test fails.
>>> 
>>>> Maybe you should just do the cgroupv2 processing if /proc/cgroups has
>>>> cgroupv2 enabled and if that fails continue looking for cgroup v1
>>>> mounts.
>>> 
>>> FWIW, I've amended the cgroup v1 logic to also check for the file
>>> system type. Which needs to be 'cgroup' (not 'cgroup2') for cgroup v1.
>>> This isn't strictly needed but more accurately reflects the intent IMO.
>> 
>> Sounds good.
>> 
>>>> Do you really need to do this?  The BSS section should be zeroed.
>>>> 
>>>> 55   for (int i = 0; i < CG_INFO_LENGTH; i++) {
>>>> 56     cg_infos[i]._name = NULL;
>>>> 57     cg_infos[i]._cgroup_path = NULL;
>>>> 58     cg_infos[i]._root_mount_path = NULL;
>>>> 59     cg_infos[i]._mount_path = NULL;
>>>> 60   }
>>> 
>>> I believe so. CgroupInfo array is stack allocated and we should
>>> initialize values properly. On the other hand, we should do that in the
>>> constructor. Done now.
>> 
>> You are correct.  I was only looking at the diffs and not the full source.
>> 
>>>> You don’t need to check for NULL in cleanup, os::free can handle NULL
>>>> pointers.
>>>> 
>>>> 339 void CgroupSubsystemFactory::cleanup(CgroupInfo* cg_infos) {
>>>> 340   assert(cg_infos != NULL, "Invariant");
>>>> 341   for (int i = 0; i < CG_INFO_LENGTH; i++) {
>>>> 342     if (cg_infos[i]._name != NULL) {
>>>> 343       os::free(cg_infos[i]._name);
>>>> 344     }
>>>> 345     if (cg_infos[i]._cgroup_path != NULL) {
>>>> 346       os::free(cg_infos[i]._cgroup_path);
>>>> 347     }
>>>> 348     if (cg_infos[i]._root_mount_path != NULL) {
>>>> 349       os::free(cg_infos[i]._root_mount_path);
>>>> 350     }
>>>> 351     if (cg_infos[i]._mount_path != NULL) {
>>>> 352       os::free(cg_infos[i]._mount_path);
>>>> 353     }
>>>> 354   }
>>>> 355 }
>>> 
>>> Fixed.
>>> 
>>> Let me know what you think about the updated webrev.
>>> 
>>> Thanks,
>>> Severin
>>> 
>>> [1] http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8239785/02/webrev/
>>> [2] http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8239785/05/webrev/src/hotspot/os/linux/cgroupSubsystem_linux.cpp.html
>>> 
>>>> Bob.
>>>> 
>>>> 
>>>> 
>>>>> On Feb 28, 2020, at 11:42 AM, Severin Gehwolf <
>>>>> sgehwolf at redhat.com>
>>>>> wrote:
>>>>> 
>>>>> Hi Bob,
>>>>> 
>>>>> Thanks for the review!
>>>>> 
>>>>> On Tue, 2020-02-25 at 15:08 -0500, Bob Vandette wrote:
>>>>>> http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8239785/01/webrev/src/hotspot/os/linux/cgroupSubsystem_linux.cpp.sdiff.html
>>>>>> 
>>>>>> If both cgroupv1 and v2 controllers are mounted, you might
>>>>>> have
>>>>>> some problems.  You might want to only look for
>>>>>> cgroup2 mounts if is_cgroupV2 is true otherwise only look for
>>>>>> cgroup mounts.
>>>>> 
>>>>> OK.
>>>>> 
>>>>>> 194           cg_infos[i]._mount_path =
>>>>>> os::strdup(tmp_mount_point);
>>>>>> This will stomp the pointer:
>>>>>> 213           cg_infos[memory_idx]._mount_path =
>>>>>> os::strdup(tmpmount);
>>>>> 
>>>>> You are right. If we are on a hybrid hierarchy and for some
>>>>> reason
>>>>> the
>>>>> cgroup v2 mount shows up *after* cgroup mounts in
>>>>> /proc/self/mountinfo,
>>>>> we'd get the mount path wrong. So ordering matters! The reason
>>>>> this
>>>>> worked for me on hybrid is that cgroup2 mounts come *before*
>>>>> cgroup
>>>>> mounts on the system I've tested.
>>>>> 
>>>>>> You should free all strdup’d pointers in cg_infos in places
>>>>>> where
>>>>>> you fail and return NULL.
>>>>>> 
>>>>>> 81       cg_infos[memory_idx]._name = os::strdup(name);
>>>>>> 213           cg_infos[memory_idx]._mount_path =
>>>>>> os::strdup(tmpmount);
>>>>>> 214           cg_infos[memory_idx]._root_mount_path =
>>>>>> os::strdup(tmproot);
>>>>> 
>>>>> Done now. More thoughts?
>>>>> 
>>>>> Updated webrev: 
>>>>> http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8239785/03/webrev/
>>>>> 
>>>>> Thanks,
>>>>> Severin
>>>>> 
>>>>>>> On Feb 25, 2020, at 2:22 PM, Severin Gehwolf <
>>>>>>> sgehwolf at redhat.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Could I please get reviews of this cgroup-related patch?
>>>>>>> After
>>>>>>> JDK-
>>>>>>> 8230305 old systems with no mounted cgroup controllers
>>>>>>> would
>>>>>>> get
>>>>>>> detected as cgroups v2 (wrong). Then, when initializing the
>>>>>>> v2
>>>>>>> subsystem, it would fail. The trace about cgroupv2 mount
>>>>>>> point
>>>>>>> not
>>>>>>> found is misleading in this case. While the outcome is
>>>>>>> similar
>>>>>>> pre/post
>>>>>>> patch (NULL cgroup subsystem), I'd like to be explicit
>>>>>>> about
>>>>>>> this case.
>>>>>>> 
>>>>>>> The suggested fix is to look at /proc/self/mountinfo in
>>>>>>> order
>>>>>>> to
>>>>>>> correctly detect whether this is cgroup v2 or cgroup v1
>>>>>>> with no
>>>>>>> mounted
>>>>>>> controllers. In the latter case we just stop initialization
>>>>>>> as
>>>>>>> we'd
>>>>>>> fail later in cgroupv1 code anyway. This aligns hotspot
>>>>>>> code
>>>>>>> with core-
>>>>>>> libs after JDK-8239559.
>>>>>>> 
>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8239785
>>>>>>> webrev: 
>>>>>>> http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8239785/01/webrev/
>>>>>>> 
>>>>>>> Testing: jdk-submit, hotspot docker tests on cgroup v1 and
>>>>>>> cgroup v2. All pass.
>>>>>>> 
>>>>>>> Thoughts?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Severin
>>>>>>> 
>>> 
>>> 
>>> 
>