RFR: 8146115 - Improve docker container detection and resource configuration usage

Fri Oct 6 07:20:34 UTC 2017

Hi Robbin,

On Thursday, October 5, 2017, Robbin Ehn <robbin.ehn at oracle.com> wrote:

> Hi Alex, just a short question,
>
> You said something about "Marathon/Mesos[1] and Kubernetes[2] use shares
> and quotas"
> If you only use shares and quotas, do you not care about numa? (read trust
> kernel)
> On would think that you would setup a cgroup per numa node and split those
> into cgroups with shares/quotas.

It's a good point.
I certainly care about numa; we test I think similar to you numactl 'ing
driver/server processes to be in control of that variable.

Kubernetes doesn't, yet [1]. Neither mesos [2].

Thanks
Alex

[1] https://github.com/kubernetes/kubernetes/issues/49964
[2]
https://issues.apache.org/jira/plugins/servlet/mobile#issue/MESOS-6548 /
https://issues.apache.org/jira/plugins/servlet/mobile#issue/MESOS-5342

> Thanks, Robbin
>
> On 10/05/2017 06:43 PM, Alex Bagehot wrote:
>
>> Hi David,
>>
>> On Wed, Oct 4, 2017 at 10:51 PM, David Holmes <david.holmes at oracle.com>
>> wrote:
>>
>> Hi Alex,
>>>
>>> Can you tell me how shares/quotas are actually implemented in terms of
>>> allocating "cpus" to processes when shares/quotas are being applied?
>>>
>>
>>
>> The allocation of cpus to processes/threads(tasks as the kernel sees them)
>> or the other way round is called balancing, which is done by Scheduling
>> domains[3].
>>
>> cpu shares use CFS "group" scheduling[1] to apply the share to all the
>> tasks(threads) in the container. The container cpu shares weight maps
>> directly to a task's weight in CFS, which given it is part of a group is
>> divided by the number of tasks in the group (ie. a default container share
>> of 1024 with 2 threads in the container/group would result in each
>> thread/task having a 512 weight[4]). The same values used by nice[2] also.
>>
>> You can observe the task weight and other scheduler numbers in
>> /proc/sched_debug [4]. You can also kernel trace scheduler activity which
>> typically tells you the tasks involved, the cpu, the event: switch or
>> wakeup, etc.
>>
>>
>> For example in a 12 cpu system if I have a 50% share do I get all 12 CPUs
>>> for 50% of a "quantum" each, or do I get 6 CPUs for a full quantum each?
>>>
>>>
>> You get 12 cpus for 50% of the time on the average if there is another
>> workload that has the same weight as you and is consuming as much as it
>> can.
>> If there's nothing else running on the machine you get 12 cpus for 100% of
>> the time with a cpu shares only config (ie. the burst capacity).
>>
>> I validated that the share was balanced over all the cpus by running linux
>> perf events and checking that there were cpu samples on all cpus. There's
>> bound to be other ways of doing it also.
>>
>>
>>
>>> When we try to use the "number of processors" to control the number of
>>> threads created, or the number of partitions in a task, then we really
>>> want
>>> to know how many CPUs we can actually be concurrently running on!
>>>
>>>
>> Makes sense to check. Hopefully there aren't any major errors or omissions
>> in the above.
>> Thanks,
>> Alex
>>
>> [1] https://lwn.net/Articles/240474/
>> [2] https://github.com/torvalds/linux/blob/368f89984bb971b9f8b69eeb85ab19
>> a89f985809/kernel/sched/core.c#L6735
>> [3] https://lwn.net/Articles/80911/ / http://www.i3s.unice.fr/~
>> jplozi/wastedcores/files/extended_talk.pdf
>>
>> [4]
>>
>> cfs_rq[13]:/system.slice/docker-f5681788d6daab249c90810fe60da4
>> 29a2565b901ff34245922a578635b5d607.scope
>>
>>    .exec_clock                    : 0.000000
>>
>>    .MIN_vruntime                  : 0.000001
>>
>>    .min_vruntime                  : 8090.087297
>>
>>    .max_vruntime                  : 0.000001
>>
>>    .spread                        : 0.000000
>>
>>    .spread0                       : -124692718.052832
>>
>>    .nr_spread_over                : 0
>>
>>    .nr_running                    : 1
>>
>>    .load                          : 1024
>>
>>    .runnable_load_avg             : 1023
>>
>>    .blocked_load_avg              : 0
>>
>>    .tg_load_avg                   : 2046
>>
>>    .tg_load_contrib               : 1023
>>
>>    .tg_runnable_contrib           : 1023
>>
>>    .tg->runnable_avg              : 2036
>>
>>    .tg->cfs_bandwidth.timer_active: 0
>>
>>    .throttled                     : 0
>>
>>    .throttle_count                : 0
>>
>>    .se->exec_start                : 236081964.515645
>>
>>    .se->vruntime                  : 24403993.326934
>>
>>    .se->sum_exec_runtime          : 8091.135873
>>
>>    .se->load.weight               : 512
>>
>>    .se->avg.runnable_avg_sum      : 45979
>>
>>    .se->avg.runnable_avg_period   : 45979
>>
>>    .se->avg.load_avg_contrib      : 511
>>
>>    .se->avg.decay_count           : 0
>>
>>
>>
>>> Thanks,
>>> David
>>>
>>>
>>> On 5/10/2017 6:01 AM, Alex Bagehot wrote:
>>>
>>> Hi,
>>>>
>>>> On Wed, Oct 4, 2017 at 7:51 PM, Bob Vandette <bob.vandette at oracle.com>
>>>> wrote:
>>>>
>>>>
>>>> On Oct 4, 2017, at 2:30 PM, Robbin Ehn <robbin.ehn at oracle.com> wrote:
>>>>>
>>>>>>
>>>>>> Thanks Bob for looking into this.
>>>>>>
>>>>>> On 10/04/2017 08:14 PM, Bob Vandette wrote:
>>>>>>
>>>>>> Robbin,
>>>>>>> I’ve looked into this issue and you are correct.  I do have to
>>>>>>> examine
>>>>>>>
>>>>>>> both the
>>>>>>
>>>>>
>>>>> sched_getaffinity results as well as the cgroup cpu subsystem
>>>>>>
>>>>>>>
>>>>>>> configuration
>>>>>>
>>>>>
>>>>> files in order to provide a reasonable value for active_processors.  If
>>>>>>
>>>>>>>
>>>>>>> I was only
>>>>>>
>>>>>
>>>>> interested in cpusets, I could simply rely on the getaffinity call but
>>>>>>
>>>>>>>
>>>>>>> I also want to
>>>>>>
>>>>>
>>>>> factor in shares and quotas as well.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>> We had a quick discussion at the office, we actually do think that you
>>>>>>
>>>>>> could skip reading the shares and quotas.
>>>>>
>>>>> It really depends on what the user expect, if he give us 4 cpu's with
>>>>>>
>>>>>> 50% or 2 full cpu what do he expect the differences would be?
>>>>>
>>>>> One could argue that he 'knows' that he will only use max 50% and thus
>>>>>>
>>>>>> we can act as if he is giving us 4 full cpu.
>>>>>
>>>>> But I'll leave that up to you, just a tough we had.
>>>>>>
>>>>>>
>>>>> It’s my opinion that we should do something if someone makes the effort
>>>>> to
>>>>> configure their
>>>>> containers to use quotas or shares.  There are many different opinions
>>>>> on
>>>>> what the right that
>>>>> right “something” is.
>>>>>
>>>>>
>>>>> It might be interesting to look at some real instances of how java
>>>> might[3]
>>>> be deployed in containers.
>>>> Marathon/Mesos[1] and Kubernetes[2] use shares and quotas so this is a
>>>> vast
>>>> chunk of deployments that need both of them today.
>>>>
>>>>
>>>>
>>>> Many developers that are trying to deploy apps that use containers say
>>>>> they don’t like
>>>>> cpusets.  This is too limiting for them especially when the server
>>>>> configurations vary
>>>>> within their organization.
>>>>>
>>>>>
>>>>> True, however Kubernetes has an alpha feature[5] where it allocates
>>>> cpusets
>>>> to containers that request a whole number of cpus. Previously without
>>>> cpusets any container could run on any cpu which we know might not be
>>>> good
>>>> for some workloads that want isolation. A request for a fractional or
>>>> burstable amount of cpu would be allocated from a shared cpu pool. So
>>>> although manual allocation of cpusets will be flakey[3] , automation
>>>> should
>>>> be able to make it work.
>>>>
>>>>
>>>>
>>>>   From everything I’ve read including source code, there seems to be a
>>>>> consensus that
>>>>> shares and quotas are being used as a way to specify a fraction of a
>>>>> system (number of cpus).
>>>>>
>>>>>
>>>>> A refinement[6] on this is:
>>>> Shares can be used for guaranteed cpu - you will always get your share.
>>>> Quota[4] is a limit/constraint - you can never get more than the quota.
>>>> So given the below limit of how many shares will be allocated on a host
>>>> you
>>>> can have burstable(or overcommit) capacity if your shares are less than
>>>> your quota.
>>>>
>>>>
>>>>
>>>> Docker added —cpus which is implemented using quotas and periods.  They
>>>>> adjust these
>>>>> two parameters to provide a way of calculating the number of cpus that
>>>>> will be available
>>>>> to a process (quota/period).  Amazon also documents that cpu shares are
>>>>> defined to be a multiple of 1024.
>>>>> Where 1024 represents a single cpu and a share value of N*1024
>>>>> represents
>>>>> N cpus.
>>>>>
>>>>>
>>>>> Kubernetes and Mesos/Marathon also use the N*1024 shares per host to
>>>> allocate resources automatically.
>>>>
>>>> Hopefully this provides some background on what a couple of
>>>> orchestration
>>>> systems that will be running java are doing currently in this area.
>>>> Thanks,
>>>> Alex
>>>>
>>>>
>>>> [1] https://github.com/apache/mesos/commit/346cc8dd528a28a6e
>>>> 1f1cbdb4c95b8bdea2f6070 / (now out of date but appears to be a
>>>> reasonable
>>>> intro : https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke
>>>> r-mesos-and-marathon/ )
>>>> [1a] https://youtu.be/hJyAfC-Z2xk?t=2439
>>>>
>>>> [2] https://kubernetes.io/docs/concepts/configuration/manage
>>>> -compute-resources-container/
>>>>
>>>> [3] https://youtu.be/w1rZOY5gbvk?t=2479
>>>>
>>>> [4] https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
>>>> https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf
>>>> https://lwn.net/Articles/428175/
>>>>
>>>> [5]
>>>> https://github.com/kubernetes/community/blob/43ce57ac476b9f2
>>>> ce3f0220354a075e095a0d469/contributors/design-proposals/node
>>>> /cpu-manager.md
>>>> / https://github.com/kubernetes/kubernetes/commit/
>>>> 00f0e0f6504ad8dd85fcbbd6294cd7cf2475fc72 / https://vimeo.com/226858314
>>>>
>>>>
>>>> [6] https://kubernetes.io/docs/concepts/configuration/manage-
>>>> compute-resources-container/#how-pods-with-resource-limits-are-run
>>>>
>>>>
>>>> Of course these are just conventions.  This is why I provided a way of
>>>>
>>>>> specifying the
>>>>> number of CPUs so folks deploying Java services can be certain they get
>>>>> what they want.
>>>>>
>>>>> Bob.
>>>>>
>>>>>
>>>>> I had assumed that when sched_setaffinity was called (in your case by
>>>>>>
>>>>>>>
>>>>>>> numactl) that the
>>>>>>
>>>>>
>>>>> cgroup cpu config files would be updated to reflect the current
>>>>>>
>>>>>>>
>>>>>>> processor affinity for the
>>>>>>
>>>>>
>>>>> running process. This is not correct.  I have updated my changeset and
>>>>>>
>>>>>>>
>>>>>>> have successfully
>>>>>>
>>>>>
>>>>> run with your examples below.  I’ll post a new webrev soon.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I see, thanks again!
>>>>>>
>>>>>> /Robbin
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>> Bob.
>>>>>>>
>>>>>>>
>>>>>>>> I still want to include the flag for at least one Java release in
>>>>>>>> the
>>>>>>>>
>>>>>>>>>
>>>>>>>>> event that the new behavior causes some regression
>>>>>>>>
>>>>>>>
>>>>> in behavior.  I’m trying to make the detection robust so that it will
>>>>>>
>>>>>>>
>>>>>>>>> fallback to the current behavior in the event
>>>>>>>>
>>>>>>>
>>>>> that cgroups is not configured as expected but I’d like to have a way
>>>>>>
>>>>>>>
>>>>>>>>> of forcing the issue.  JDK 10 is not
>>>>>>>>
>>>>>>>
>>>>> supposed to be a long term support release which makes it a good
>>>>>>
>>>>>>>
>>>>>>>>> target for this new behavior.
>>>>>>>>
>>>>>>>
>>>>> I agree with David that once we commit to cgroups, we should extract
>>>>>>
>>>>>>>
>>>>>>>>> all VM configuration data from that
>>>>>>>>
>>>>>>>
>>>>> source.  There’s more information available for cpusets than just
>>>>>>
>>>>>>>
>>>>>>>>> processor affinity that we might want to
>>>>>>>>
>>>>>>>
>>>>> consider when calculating the number of processors to assume for the
>>>>>>
>>>>>>>
>>>>>>>>> VM.  There’s exclusivity and
>>>>>>>>
>>>>>>>
>>>>> effective cpu data available in addition to the cpuset string.
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>> cgroup only contains limits, not the real hard limits.
>>>>>>>> You most consider the affinity mask. We that have numa nodes do:
>>>>>>>>
>>>>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>>>>>>>>
>>>>>>>> -Xlog:os=debug -cp . ForEver | grep proc
>>>>>>>
>>>>>>
>>>>> [0.001s][debug][os] Initial active processor count set to 16
>>>>>>
>>>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>>>>>>>>
>>>>>>>> -Xlog:os=debug -XX:+UseContainerSupport -cp . ForEver | grep proc
>>>>>>>
>>>>>>
>>>>> [0.001s][debug][os] Initial active processor count set to 32
>>>>>>
>>>>>>>
>>>>>>>> when benchmarking all the time and that must be set to 16 otherwise
>>>>>>>>
>>>>>>>> the flag is really bad for us.
>>>>>>>
>>>>>>
>>>>> So the flag actually breaks the little numa support we have now.
>>>>>>
>>>>>>>
>>>>>>>> Thanks, Robbin
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>