RFR: 8146115 - Improve docker container detection and resource configuration usage

Thu Oct 5 19:17:10 UTC 2017

Hi Alex, just a short question,

You said something about "Marathon/Mesos[1] and Kubernetes[2] use shares and quotas"
If you only use shares and quotas, do you not care about numa? (read trust kernel)
On would think that you would setup a cgroup per numa node and split those into cgroups with shares/quotas.

Thanks, Robbin

On 10/05/2017 06:43 PM, Alex Bagehot wrote:
> Hi David,
> 
> On Wed, Oct 4, 2017 at 10:51 PM, David Holmes <david.holmes at oracle.com>
> wrote:
> 
>> Hi Alex,
>>
>> Can you tell me how shares/quotas are actually implemented in terms of
>> allocating "cpus" to processes when shares/quotas are being applied?
> 
> 
> The allocation of cpus to processes/threads(tasks as the kernel sees them)
> or the other way round is called balancing, which is done by Scheduling
> domains[3].
> 
> cpu shares use CFS "group" scheduling[1] to apply the share to all the
> tasks(threads) in the container. The container cpu shares weight maps
> directly to a task's weight in CFS, which given it is part of a group is
> divided by the number of tasks in the group (ie. a default container share
> of 1024 with 2 threads in the container/group would result in each
> thread/task having a 512 weight[4]). The same values used by nice[2] also.
> 
> You can observe the task weight and other scheduler numbers in
> /proc/sched_debug [4]. You can also kernel trace scheduler activity which
> typically tells you the tasks involved, the cpu, the event: switch or
> wakeup, etc.
> 
> 
>> For example in a 12 cpu system if I have a 50% share do I get all 12 CPUs
>> for 50% of a "quantum" each, or do I get 6 CPUs for a full quantum each?
>>
> 
> You get 12 cpus for 50% of the time on the average if there is another
> workload that has the same weight as you and is consuming as much as it can.
> If there's nothing else running on the machine you get 12 cpus for 100% of
> the time with a cpu shares only config (ie. the burst capacity).
> 
> I validated that the share was balanced over all the cpus by running linux
> perf events and checking that there were cpu samples on all cpus. There's
> bound to be other ways of doing it also.
> 
> 
>>
>> When we try to use the "number of processors" to control the number of
>> threads created, or the number of partitions in a task, then we really want
>> to know how many CPUs we can actually be concurrently running on!
>>
> 
> Makes sense to check. Hopefully there aren't any major errors or omissions
> in the above.
> Thanks,
> Alex
> 
> [1] https://lwn.net/Articles/240474/
> [2] https://github.com/torvalds/linux/blob/368f89984bb971b9f8b69eeb85ab19
> a89f985809/kernel/sched/core.c#L6735
> [3] https://lwn.net/Articles/80911/ / http://www.i3s.unice.fr/~
> jplozi/wastedcores/files/extended_talk.pdf
> 
> [4]
> 
> cfs_rq[13]:/system.slice/docker-f5681788d6daab249c90810fe60da4
> 29a2565b901ff34245922a578635b5d607.scope
> 
>    .exec_clock                    : 0.000000
> 
>    .MIN_vruntime                  : 0.000001
> 
>    .min_vruntime                  : 8090.087297
> 
>    .max_vruntime                  : 0.000001
> 
>    .spread                        : 0.000000
> 
>    .spread0                       : -124692718.052832
> 
>    .nr_spread_over                : 0
> 
>    .nr_running                    : 1
> 
>    .load                          : 1024
> 
>    .runnable_load_avg             : 1023
> 
>    .blocked_load_avg              : 0
> 
>    .tg_load_avg                   : 2046
> 
>    .tg_load_contrib               : 1023
> 
>    .tg_runnable_contrib           : 1023
> 
>    .tg->runnable_avg              : 2036
> 
>    .tg->cfs_bandwidth.timer_active: 0
> 
>    .throttled                     : 0
> 
>    .throttle_count                : 0
> 
>    .se->exec_start                : 236081964.515645
> 
>    .se->vruntime                  : 24403993.326934
> 
>    .se->sum_exec_runtime          : 8091.135873
> 
>    .se->load.weight               : 512
> 
>    .se->avg.runnable_avg_sum      : 45979
> 
>    .se->avg.runnable_avg_period   : 45979
> 
>    .se->avg.load_avg_contrib      : 511
> 
>    .se->avg.decay_count           : 0
> 
> 
>>
>> Thanks,
>> David
>>
>>
>> On 5/10/2017 6:01 AM, Alex Bagehot wrote:
>>
>>> Hi,
>>>
>>> On Wed, Oct 4, 2017 at 7:51 PM, Bob Vandette <bob.vandette at oracle.com>
>>> wrote:
>>>
>>>
>>>> On Oct 4, 2017, at 2:30 PM, Robbin Ehn <robbin.ehn at oracle.com> wrote:
>>>>>
>>>>> Thanks Bob for looking into this.
>>>>>
>>>>> On 10/04/2017 08:14 PM, Bob Vandette wrote:
>>>>>
>>>>>> Robbin,
>>>>>> I’ve looked into this issue and you are correct.  I do have to examine
>>>>>>
>>>>> both the
>>>>
>>>>> sched_getaffinity results as well as the cgroup cpu subsystem
>>>>>>
>>>>> configuration
>>>>
>>>>> files in order to provide a reasonable value for active_processors.  If
>>>>>>
>>>>> I was only
>>>>
>>>>> interested in cpusets, I could simply rely on the getaffinity call but
>>>>>>
>>>>> I also want to
>>>>
>>>>> factor in shares and quotas as well.
>>>>>>
>>>>>
>>>>> We had a quick discussion at the office, we actually do think that you
>>>>>
>>>> could skip reading the shares and quotas.
>>>>
>>>>> It really depends on what the user expect, if he give us 4 cpu's with
>>>>>
>>>> 50% or 2 full cpu what do he expect the differences would be?
>>>>
>>>>> One could argue that he 'knows' that he will only use max 50% and thus
>>>>>
>>>> we can act as if he is giving us 4 full cpu.
>>>>
>>>>> But I'll leave that up to you, just a tough we had.
>>>>>
>>>>
>>>> It’s my opinion that we should do something if someone makes the effort
>>>> to
>>>> configure their
>>>> containers to use quotas or shares.  There are many different opinions on
>>>> what the right that
>>>> right “something” is.
>>>>
>>>>
>>> It might be interesting to look at some real instances of how java
>>> might[3]
>>> be deployed in containers.
>>> Marathon/Mesos[1] and Kubernetes[2] use shares and quotas so this is a
>>> vast
>>> chunk of deployments that need both of them today.
>>>
>>>
>>>
>>>> Many developers that are trying to deploy apps that use containers say
>>>> they don’t like
>>>> cpusets.  This is too limiting for them especially when the server
>>>> configurations vary
>>>> within their organization.
>>>>
>>>>
>>> True, however Kubernetes has an alpha feature[5] where it allocates
>>> cpusets
>>> to containers that request a whole number of cpus. Previously without
>>> cpusets any container could run on any cpu which we know might not be good
>>> for some workloads that want isolation. A request for a fractional or
>>> burstable amount of cpu would be allocated from a shared cpu pool. So
>>> although manual allocation of cpusets will be flakey[3] , automation
>>> should
>>> be able to make it work.
>>>
>>>
>>>
>>>>   From everything I’ve read including source code, there seems to be a
>>>> consensus that
>>>> shares and quotas are being used as a way to specify a fraction of a
>>>> system (number of cpus).
>>>>
>>>>
>>> A refinement[6] on this is:
>>> Shares can be used for guaranteed cpu - you will always get your share.
>>> Quota[4] is a limit/constraint - you can never get more than the quota.
>>> So given the below limit of how many shares will be allocated on a host
>>> you
>>> can have burstable(or overcommit) capacity if your shares are less than
>>> your quota.
>>>
>>>
>>>
>>>> Docker added —cpus which is implemented using quotas and periods.  They
>>>> adjust these
>>>> two parameters to provide a way of calculating the number of cpus that
>>>> will be available
>>>> to a process (quota/period).  Amazon also documents that cpu shares are
>>>> defined to be a multiple of 1024.
>>>> Where 1024 represents a single cpu and a share value of N*1024 represents
>>>> N cpus.
>>>>
>>>>
>>> Kubernetes and Mesos/Marathon also use the N*1024 shares per host to
>>> allocate resources automatically.
>>>
>>> Hopefully this provides some background on what a couple of orchestration
>>> systems that will be running java are doing currently in this area.
>>> Thanks,
>>> Alex
>>>
>>>
>>> [1] https://github.com/apache/mesos/commit/346cc8dd528a28a6e
>>> 1f1cbdb4c95b8bdea2f6070 / (now out of date but appears to be a reasonable
>>> intro : https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke
>>> r-mesos-and-marathon/ )
>>> [1a] https://youtu.be/hJyAfC-Z2xk?t=2439
>>>
>>> [2] https://kubernetes.io/docs/concepts/configuration/manage
>>> -compute-resources-container/
>>>
>>> [3] https://youtu.be/w1rZOY5gbvk?t=2479
>>>
>>> [4] https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
>>> https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf
>>> https://lwn.net/Articles/428175/
>>>
>>> [5]
>>> https://github.com/kubernetes/community/blob/43ce57ac476b9f2
>>> ce3f0220354a075e095a0d469/contributors/design-proposals/node
>>> /cpu-manager.md
>>> / https://github.com/kubernetes/kubernetes/commit/
>>> 00f0e0f6504ad8dd85fcbbd6294cd7cf2475fc72 / https://vimeo.com/226858314
>>>
>>>
>>> [6] https://kubernetes.io/docs/concepts/configuration/manage-
>>> compute-resources-container/#how-pods-with-resource-limits-are-run
>>>
>>>
>>> Of course these are just conventions.  This is why I provided a way of
>>>> specifying the
>>>> number of CPUs so folks deploying Java services can be certain they get
>>>> what they want.
>>>>
>>>> Bob.
>>>>
>>>>
>>>>> I had assumed that when sched_setaffinity was called (in your case by
>>>>>>
>>>>> numactl) that the
>>>>
>>>>> cgroup cpu config files would be updated to reflect the current
>>>>>>
>>>>> processor affinity for the
>>>>
>>>>> running process. This is not correct.  I have updated my changeset and
>>>>>>
>>>>> have successfully
>>>>
>>>>> run with your examples below.  I’ll post a new webrev soon.
>>>>>>
>>>>>
>>>>> I see, thanks again!
>>>>>
>>>>> /Robbin
>>>>>
>>>>> Thanks,
>>>>>> Bob.
>>>>>>
>>>>>>>
>>>>>>> I still want to include the flag for at least one Java release in the
>>>>>>>>
>>>>>>> event that the new behavior causes some regression
>>>>
>>>>> in behavior.  I’m trying to make the detection robust so that it will
>>>>>>>>
>>>>>>> fallback to the current behavior in the event
>>>>
>>>>> that cgroups is not configured as expected but I’d like to have a way
>>>>>>>>
>>>>>>> of forcing the issue.  JDK 10 is not
>>>>
>>>>> supposed to be a long term support release which makes it a good
>>>>>>>>
>>>>>>> target for this new behavior.
>>>>
>>>>> I agree with David that once we commit to cgroups, we should extract
>>>>>>>>
>>>>>>> all VM configuration data from that
>>>>
>>>>> source.  There’s more information available for cpusets than just
>>>>>>>>
>>>>>>> processor affinity that we might want to
>>>>
>>>>> consider when calculating the number of processors to assume for the
>>>>>>>>
>>>>>>> VM.  There’s exclusivity and
>>>>
>>>>> effective cpu data available in addition to the cpuset string.
>>>>>>>>
>>>>>>>
>>>>>>> cgroup only contains limits, not the real hard limits.
>>>>>>> You most consider the affinity mask. We that have numa nodes do:
>>>>>>>
>>>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>>>>>>>
>>>>>> -Xlog:os=debug -cp . ForEver | grep proc
>>>>
>>>>> [0.001s][debug][os] Initial active processor count set to 16
>>>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>>>>>>>
>>>>>> -Xlog:os=debug -XX:+UseContainerSupport -cp . ForEver | grep proc
>>>>
>>>>> [0.001s][debug][os] Initial active processor count set to 32
>>>>>>>
>>>>>>> when benchmarking all the time and that must be set to 16 otherwise
>>>>>>>
>>>>>> the flag is really bad for us.
>>>>
>>>>> So the flag actually breaks the little numa support we have now.
>>>>>>>
>>>>>>> Thanks, Robbin
>>>>>>>
>>>>>>
>>>>
>>>>