RFR: 8146115 - Improve docker container detection and resource configuration usage

Thu Oct 5 16:43:13 UTC 2017

Hi David,

On Wed, Oct 4, 2017 at 10:51 PM, David Holmes <david.holmes at oracle.com>
wrote:

> Hi Alex,
>
> Can you tell me how shares/quotas are actually implemented in terms of
> allocating "cpus" to processes when shares/quotas are being applied?

The allocation of cpus to processes/threads(tasks as the kernel sees them)
or the other way round is called balancing, which is done by Scheduling
domains[3].

cpu shares use CFS "group" scheduling[1] to apply the share to all the
tasks(threads) in the container. The container cpu shares weight maps
directly to a task's weight in CFS, which given it is part of a group is
divided by the number of tasks in the group (ie. a default container share
of 1024 with 2 threads in the container/group would result in each
thread/task having a 512 weight[4]). The same values used by nice[2] also.

You can observe the task weight and other scheduler numbers in
/proc/sched_debug [4]. You can also kernel trace scheduler activity which
typically tells you the tasks involved, the cpu, the event: switch or
wakeup, etc.

> For example in a 12 cpu system if I have a 50% share do I get all 12 CPUs
> for 50% of a "quantum" each, or do I get 6 CPUs for a full quantum each?
>

You get 12 cpus for 50% of the time on the average if there is another
workload that has the same weight as you and is consuming as much as it can.
If there's nothing else running on the machine you get 12 cpus for 100% of
the time with a cpu shares only config (ie. the burst capacity).

I validated that the share was balanced over all the cpus by running linux
perf events and checking that there were cpu samples on all cpus. There's
bound to be other ways of doing it also.

>
> When we try to use the "number of processors" to control the number of
> threads created, or the number of partitions in a task, then we really want
> to know how many CPUs we can actually be concurrently running on!
>

Makes sense to check. Hopefully there aren't any major errors or omissions
in the above.
Thanks,
Alex

[1] https://lwn.net/Articles/240474/
[2] https://github.com/torvalds/linux/blob/368f89984bb971b9f8b69eeb85ab19
a89f985809/kernel/sched/core.c#L6735
[3] https://lwn.net/Articles/80911/ / http://www.i3s.unice.fr/~
jplozi/wastedcores/files/extended_talk.pdf

[4]

cfs_rq[13]:/system.slice/docker-f5681788d6daab249c90810fe60da4
29a2565b901ff34245922a578635b5d607.scope

  .exec_clock                    : 0.000000

  .MIN_vruntime                  : 0.000001

  .min_vruntime                  : 8090.087297

  .max_vruntime                  : 0.000001

  .spread                        : 0.000000

  .spread0                       : -124692718.052832

  .nr_spread_over                : 0

  .nr_running                    : 1

  .load                          : 1024

  .runnable_load_avg             : 1023

  .blocked_load_avg              : 0

  .tg_load_avg                   : 2046

  .tg_load_contrib               : 1023

  .tg_runnable_contrib           : 1023

  .tg->runnable_avg              : 2036

  .tg->cfs_bandwidth.timer_active: 0

  .throttled                     : 0

  .throttle_count                : 0

  .se->exec_start                : 236081964.515645

  .se->vruntime                  : 24403993.326934

  .se->sum_exec_runtime          : 8091.135873

  .se->load.weight               : 512

  .se->avg.runnable_avg_sum      : 45979

  .se->avg.runnable_avg_period   : 45979

  .se->avg.load_avg_contrib      : 511

  .se->avg.decay_count           : 0

>
> Thanks,
> David
>
>
> On 5/10/2017 6:01 AM, Alex Bagehot wrote:
>
>> Hi,
>>
>> On Wed, Oct 4, 2017 at 7:51 PM, Bob Vandette <bob.vandette at oracle.com>
>> wrote:
>>
>>
>>> On Oct 4, 2017, at 2:30 PM, Robbin Ehn <robbin.ehn at oracle.com> wrote:
>>>>
>>>> Thanks Bob for looking into this.
>>>>
>>>> On 10/04/2017 08:14 PM, Bob Vandette wrote:
>>>>
>>>>> Robbin,
>>>>> I’ve looked into this issue and you are correct.  I do have to examine
>>>>>
>>>> both the
>>>
>>>> sched_getaffinity results as well as the cgroup cpu subsystem
>>>>>
>>>> configuration
>>>
>>>> files in order to provide a reasonable value for active_processors.  If
>>>>>
>>>> I was only
>>>
>>>> interested in cpusets, I could simply rely on the getaffinity call but
>>>>>
>>>> I also want to
>>>
>>>> factor in shares and quotas as well.
>>>>>
>>>>
>>>> We had a quick discussion at the office, we actually do think that you
>>>>
>>> could skip reading the shares and quotas.
>>>
>>>> It really depends on what the user expect, if he give us 4 cpu's with
>>>>
>>> 50% or 2 full cpu what do he expect the differences would be?
>>>
>>>> One could argue that he 'knows' that he will only use max 50% and thus
>>>>
>>> we can act as if he is giving us 4 full cpu.
>>>
>>>> But I'll leave that up to you, just a tough we had.
>>>>
>>>
>>> It’s my opinion that we should do something if someone makes the effort
>>> to
>>> configure their
>>> containers to use quotas or shares.  There are many different opinions on
>>> what the right that
>>> right “something” is.
>>>
>>>
>> It might be interesting to look at some real instances of how java
>> might[3]
>> be deployed in containers.
>> Marathon/Mesos[1] and Kubernetes[2] use shares and quotas so this is a
>> vast
>> chunk of deployments that need both of them today.
>>
>>
>>
>>> Many developers that are trying to deploy apps that use containers say
>>> they don’t like
>>> cpusets.  This is too limiting for them especially when the server
>>> configurations vary
>>> within their organization.
>>>
>>>
>> True, however Kubernetes has an alpha feature[5] where it allocates
>> cpusets
>> to containers that request a whole number of cpus. Previously without
>> cpusets any container could run on any cpu which we know might not be good
>> for some workloads that want isolation. A request for a fractional or
>> burstable amount of cpu would be allocated from a shared cpu pool. So
>> although manual allocation of cpusets will be flakey[3] , automation
>> should
>> be able to make it work.
>>
>>
>>
>>>  From everything I’ve read including source code, there seems to be a
>>> consensus that
>>> shares and quotas are being used as a way to specify a fraction of a
>>> system (number of cpus).
>>>
>>>
>> A refinement[6] on this is:
>> Shares can be used for guaranteed cpu - you will always get your share.
>> Quota[4] is a limit/constraint - you can never get more than the quota.
>> So given the below limit of how many shares will be allocated on a host
>> you
>> can have burstable(or overcommit) capacity if your shares are less than
>> your quota.
>>
>>
>>
>>> Docker added —cpus which is implemented using quotas and periods.  They
>>> adjust these
>>> two parameters to provide a way of calculating the number of cpus that
>>> will be available
>>> to a process (quota/period).  Amazon also documents that cpu shares are
>>> defined to be a multiple of 1024.
>>> Where 1024 represents a single cpu and a share value of N*1024 represents
>>> N cpus.
>>>
>>>
>> Kubernetes and Mesos/Marathon also use the N*1024 shares per host to
>> allocate resources automatically.
>>
>> Hopefully this provides some background on what a couple of orchestration
>> systems that will be running java are doing currently in this area.
>> Thanks,
>> Alex
>>
>>
>> [1] https://github.com/apache/mesos/commit/346cc8dd528a28a6e
>> 1f1cbdb4c95b8bdea2f6070 / (now out of date but appears to be a reasonable
>> intro : https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke
>> r-mesos-and-marathon/ )
>> [1a] https://youtu.be/hJyAfC-Z2xk?t=2439
>>
>> [2] https://kubernetes.io/docs/concepts/configuration/manage
>> -compute-resources-container/
>>
>> [3] https://youtu.be/w1rZOY5gbvk?t=2479
>>
>> [4] https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
>> https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf
>> https://lwn.net/Articles/428175/
>>
>> [5]
>> https://github.com/kubernetes/community/blob/43ce57ac476b9f2
>> ce3f0220354a075e095a0d469/contributors/design-proposals/node
>> /cpu-manager.md
>> / https://github.com/kubernetes/kubernetes/commit/
>> 00f0e0f6504ad8dd85fcbbd6294cd7cf2475fc72 / https://vimeo.com/226858314
>>
>>
>> [6] https://kubernetes.io/docs/concepts/configuration/manage-
>> compute-resources-container/#how-pods-with-resource-limits-are-run
>>
>>
>> Of course these are just conventions.  This is why I provided a way of
>>> specifying the
>>> number of CPUs so folks deploying Java services can be certain they get
>>> what they want.
>>>
>>> Bob.
>>>
>>>
>>>> I had assumed that when sched_setaffinity was called (in your case by
>>>>>
>>>> numactl) that the
>>>
>>>> cgroup cpu config files would be updated to reflect the current
>>>>>
>>>> processor affinity for the
>>>
>>>> running process. This is not correct.  I have updated my changeset and
>>>>>
>>>> have successfully
>>>
>>>> run with your examples below.  I’ll post a new webrev soon.
>>>>>
>>>>
>>>> I see, thanks again!
>>>>
>>>> /Robbin
>>>>
>>>> Thanks,
>>>>> Bob.
>>>>>
>>>>>>
>>>>>> I still want to include the flag for at least one Java release in the
>>>>>>>
>>>>>> event that the new behavior causes some regression
>>>
>>>> in behavior.  I’m trying to make the detection robust so that it will
>>>>>>>
>>>>>> fallback to the current behavior in the event
>>>
>>>> that cgroups is not configured as expected but I’d like to have a way
>>>>>>>
>>>>>> of forcing the issue.  JDK 10 is not
>>>
>>>> supposed to be a long term support release which makes it a good
>>>>>>>
>>>>>> target for this new behavior.
>>>
>>>> I agree with David that once we commit to cgroups, we should extract
>>>>>>>
>>>>>> all VM configuration data from that
>>>
>>>> source.  There’s more information available for cpusets than just
>>>>>>>
>>>>>> processor affinity that we might want to
>>>
>>>> consider when calculating the number of processors to assume for the
>>>>>>>
>>>>>> VM.  There’s exclusivity and
>>>
>>>> effective cpu data available in addition to the cpuset string.
>>>>>>>
>>>>>>
>>>>>> cgroup only contains limits, not the real hard limits.
>>>>>> You most consider the affinity mask. We that have numa nodes do:
>>>>>>
>>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>>>>>>
>>>>> -Xlog:os=debug -cp . ForEver | grep proc
>>>
>>>> [0.001s][debug][os] Initial active processor count set to 16
>>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>>>>>>
>>>>> -Xlog:os=debug -XX:+UseContainerSupport -cp . ForEver | grep proc
>>>
>>>> [0.001s][debug][os] Initial active processor count set to 32
>>>>>>
>>>>>> when benchmarking all the time and that must be set to 16 otherwise
>>>>>>
>>>>> the flag is really bad for us.
>>>
>>>> So the flag actually breaks the little numa support we have now.
>>>>>>
>>>>>> Thanks, Robbin
>>>>>>
>>>>>
>>>
>>>