RFR: 8146115 - Improve docker container detection and resource configuration usage

Thu Oct 5 22:12:30 UTC 2017

Hi Bob,

On 6/10/2017 3:57 AM, Bob Vandette wrote:
> 
>> On Oct 5, 2017, at 12:43 PM, Alex Bagehot <ceeaspb at gmail.com 
>> <mailto:ceeaspb at gmail.com>> wrote:
>>
>> Hi David,
>>
>> On Wed, Oct 4, 2017 at 10:51 PM, David Holmes <david.holmes at oracle.com 
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     Hi Alex,
>>
>>     Can you tell me how shares/quotas are actually implemented in
>>     terms of allocating "cpus" to processes when shares/quotas are
>>     being applied? 
>>
>>
>> The allocation of cpus to processes/threads(tasks as the kernel sees 
>> them) or the other way round is called balancing, which is done by 
>> Scheduling domains[3].
>>
>> cpu shares use CFS "group" scheduling[1] to apply the share to all the 
>> tasks(threads) in the container. The container cpu shares weight maps 
>> directly to a task's weight in CFS, which given it is part of a group 
>> is divided by the number of tasks in the group (ie. a default 
>> container share of 1024 with 2 threads in the container/group would 
>> result in each thread/task having a 512 weight[4]). The same values 
>> used by nice[2] also.
>>
>> You can observe the task weight and other scheduler numbers in 
>> /proc/sched_debug [4]. You can also kernel trace scheduler activity 
>> which typically tells you the tasks involved, the cpu, the event: 
>> switch or wakeup, etc.
>>
>>     For example in a 12 cpu system if I have a 50% share do I get all
>>     12 CPUs for 50% of a "quantum" each, or do I get 6 CPUs for a full
>>     quantum each?
>>
>>
>> You get 12 cpus for 50% of the time on the average if there is another 
>> workload that has the same weight as you and is consuming as much as 
>> it can.
>> If there's nothing else running on the machine you get 12 cpus for 
>> 100% of the time with a cpu shares only config (ie. the burst capacity).
>>
>> I validated that the share was balanced over all the cpus by running 
>> linux perf events and checking that there were cpu samples on all 
>> cpus. There's bound to be other ways of doing it also.
>>
>>
>>     When we try to use the "number of processors" to control the
>>     number of threads created, or the number of partitions in a task,
>>     then we really want to know how many CPUs we can actually be
>>     concurrently running on!
> 
> I’m not sure that the primary question for serverless container 
> execution.  Just because you might happen to burst and have available
> to you more CPU time than you specified in your shares doesn’t mean
> that a multi-threaded application running in one of these containers 
> should configure itself to use all available host processors.  This 
> would result in over-burdoning the system at times of high load.

And conversely if you restrict yourself to the "share" of processors you 
get over time (ie 6 instead of 12) then you can severely impact the 
performance (response time in particular) of the VM and the application 
running on the VM.

But I don't see how this can overburden the system. If you app is 
running alone you get to use all 12 cpus for 100% of the time and life 
is good. If another app starts up then your 100% drops proportionately. 
If you schedule 12 apps all with a 1/12 share then everyone gets up to 
12 cpus for 1/12 of the time. It's only if you try to schedule a set of 
apps with a utilization total greater than 1 does the system become 
overloaded.

> The Java runtime, at startup, configures several subsystems to use a 
> number of threads for each system based on the number of available
> processors.  These subsystems include things like the number of GC
> threads, JIT compiler and thread pools.

> The problem I am trying to solve is to come up with a single number
> of CPUs based on container knowledge that can be used for the Java
> runtime subsystem to configure itself.  I believe that we should
> trust the implementor of the Mesos or Kubernetes setup and honor 
> their wishes when coming up with this number and not just use the
> processor affinity or number of cpus in the cpuset.

I don't agree, as has been discussed before. It's perfectly fine, even 
desirable, in my opinion to have 12 threads executing concurrently for 
50% of the time, rather than only 6 threads for 100% (assuming the 
scheduling technology is even clever enough to realize it can grant your 
threads 100%).

Over time the amount of work your app can execute is the same, but the 
time taken for an individual subtask can vary. If you are just doing 
one-shot batch processing then it makes no difference. If you're running 
an app that itself services incoming requests then the response time to 
individual requests can be impacted. To take the worst-case scenario, 
imagine you get 12 concurrent requests that would each take 1/12 of your 
cpu quota. With 12 threads on 12 cpus you can service all 12 requests 
with a response time of 1/12 time units. But with 6 threads on 6 cpus 
you can only service 6 requests with a 1/12 response time, and the other 
6 will have a 1/6 response time.

> The challenge is determining the right algorithm that doesn’t penalize 
> the VM.

Agreed. But I think the current algorithm may penalize the VM, and more 
importantly the application it is running.

> My current implementation does this:
> 
> total available logical processors = min 
> (cpusets,sched_getaffinity,shares/1024, quota/period)
> 
> All fractional units are rounded up to the next whole number.

My point has always been that I just don't think producing a single 
number from all these factors is the right/best way to deal with this. I 
think we really want to be able to answer the question "how many 
processors can I concurrently execute on" distinct from the question of 
"how much of a time slice will I get on each of those processors". To me 
"how many" is the question that "availableProcessors" should be 
answering - and only that question. How much "share" do I get is a 
different question, and perhaps one that the VM and the application need 
to be able to ask.

BTW sched_getaffinity should already account for cpusets ??

Cheers,
David

> Bob.
> 
>>
>> Makes sense to check. Hopefully there aren't any major errors or 
>> omissions in the above.
>> Thanks,
>> Alex
>>
>> [1] https://lwn.net/Articles/240474/ <https://lwn.net/Articles/240474/>
>> [2] 
>> https://github.com/torvalds/linux/blob/368f89984bb971b9f8b69eeb85ab19a89f985809/kernel/sched/core.c#L6735 
>> <https://github.com/torvalds/linux/blob/368f89984bb971b9f8b69eeb85ab19a89f985809/kernel/sched/core.c#L6735>
>> [3] https://lwn.net/Articles/80911/ <https://lwn.net/Articles/80911/> 
>> / http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf 
>> <http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf>
>>
>> [4]
>>
>> cfs_rq[13]:/system.slice/docker-f5681788d6daab249c90810fe60da429a2565b901ff34245922a578635b5d607.scope
>>
>> .exec_clock: 0.000000
>>
>> .MIN_vruntime: 0.000001
>>
>> .min_vruntime: 8090.087297
>>
>> .max_vruntime: 0.000001
>>
>> .spread: 0.000000
>>
>> .spread0 : -124692718.052832
>>
>> .nr_spread_over: 0
>>
>> .nr_running: 1
>>
>> .load: 1024
>>
>> .runnable_load_avg : 1023
>>
>> .blocked_load_avg: 0
>>
>> .tg_load_avg : 2046
>>
>> .tg_load_contrib : 1023
>>
>> .tg_runnable_contrib : 1023
>>
>> .tg->runnable_avg: 2036
>>
>> .tg->cfs_bandwidth.timer_active: 0
>>
>> .throttled : 0
>>
>> .throttle_count: 0
>>
>> .se->exec_start: 236081964.515645
>>
>> .se->vruntime: 24403993.326934
>>
>> .se->sum_exec_runtime: 8091.135873
>>
>> .se->load.weight : 512
>>
>> .se->avg.runnable_avg_sum: 45979
>>
>> .se->avg.runnable_avg_period : 45979
>>
>> .se->avg.load_avg_contrib: 511
>>
>> .se->avg.decay_count : 0
>>
>>
>>     Thanks,
>>     David
>>
>>
>>     On 5/10/2017 6:01 AM, Alex Bagehot wrote:
>>
>>         Hi,
>>
>>         On Wed, Oct 4, 2017 at 7:51 PM, Bob Vandette
>>         <bob.vandette at oracle.com <mailto:bob.vandette at oracle.com>>
>>         wrote:
>>
>>
>>                 On Oct 4, 2017, at 2:30 PM, Robbin Ehn
>>                 <robbin.ehn at oracle.com <mailto:robbin.ehn at oracle.com>>
>>                 wrote:
>>
>>                 Thanks Bob for looking into this.
>>
>>                 On 10/04/2017 08:14 PM, Bob Vandette wrote:
>>
>>                     Robbin,
>>                     I’ve looked into this issue and you are correct. 
>>                     I do have to examine
>>
>>             both the
>>
>>                     sched_getaffinity results as well as the cgroup
>>                     cpu subsystem
>>
>>             configuration
>>
>>                     files in order to provide a reasonable value for
>>                     active_processors.  If
>>
>>             I was only
>>
>>                     interested in cpusets, I could simply rely on the
>>                     getaffinity call but
>>
>>             I also want to
>>
>>                     factor in shares and quotas as well.
>>
>>
>>                 We had a quick discussion at the office, we actually
>>                 do think that you
>>
>>             could skip reading the shares and quotas.
>>
>>                 It really depends on what the user expect, if he give
>>                 us 4 cpu's with
>>
>>             50% or 2 full cpu what do he expect the differences would be?
>>
>>                 One could argue that he 'knows' that he will only use
>>                 max 50% and thus
>>
>>             we can act as if he is giving us 4 full cpu.
>>
>>                 But I'll leave that up to you, just a tough we had.
>>
>>
>>             It’s my opinion that we should do something if someone
>>             makes the effort to
>>             configure their
>>             containers to use quotas or shares.  There are many
>>             different opinions on
>>             what the right that
>>             right “something” is.
>>
>>
>>         It might be interesting to look at some real instances of how
>>         java might[3]
>>         be deployed in containers.
>>         Marathon/Mesos[1] and Kubernetes[2] use shares and quotas so
>>         this is a vast
>>         chunk of deployments that need both of them today.
>>
>>
>>
>>             Many developers that are trying to deploy apps that use
>>             containers say
>>             they don’t like
>>             cpusets.  This is too limiting for them especially when
>>             the server
>>             configurations vary
>>             within their organization.
>>
>>
>>         True, however Kubernetes has an alpha feature[5] where it
>>         allocates cpusets
>>         to containers that request a whole number of cpus. Previously
>>         without
>>         cpusets any container could run on any cpu which we know might
>>         not be good
>>         for some workloads that want isolation. A request for a
>>         fractional or
>>         burstable amount of cpu would be allocated from a shared cpu
>>         pool. So
>>         although manual allocation of cpusets will be flakey[3] ,
>>         automation should
>>         be able to make it work.
>>
>>
>>
>>              From everything I’ve read including source code, there
>>             seems to be a
>>             consensus that
>>             shares and quotas are being used as a way to specify a
>>             fraction of a
>>             system (number of cpus).
>>
>>
>>         A refinement[6] on this is:
>>         Shares can be used for guaranteed cpu - you will always get
>>         your share.
>>         Quota[4] is a limit/constraint - you can never get more than
>>         the quota.
>>         So given the below limit of how many shares will be allocated
>>         on a host you
>>         can have burstable(or overcommit) capacity if your shares are
>>         less than
>>         your quota.
>>
>>
>>
>>             Docker added —cpus which is implemented using quotas and
>>             periods.  They
>>             adjust these
>>             two parameters to provide a way of calculating the number
>>             of cpus that
>>             will be available
>>             to a process (quota/period).  Amazon also documents that
>>             cpu shares are
>>             defined to be a multiple of 1024.
>>             Where 1024 represents a single cpu and a share value of
>>             N*1024 represents
>>             N cpus.
>>
>>
>>         Kubernetes and Mesos/Marathon also use the N*1024 shares per
>>         host to
>>         allocate resources automatically.
>>
>>         Hopefully this provides some background on what a couple of
>>         orchestration
>>         systems that will be running java are doing currently in this
>>         area.
>>         Thanks,
>>         Alex
>>
>>
>>         [1] https://github.com/apache/mesos/commit/346cc8dd528a28a6e
>>         <https://github.com/apache/mesos/commit/346cc8dd528a28a6e>
>>         1f1cbdb4c95b8bdea2f6070 / (now out of date but appears to be a
>>         reasonable
>>         intro :
>>         https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke
>>         <https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke>
>>         r-mesos-and-marathon/ )
>>         [1a] https://youtu.be/hJyAfC-Z2xk?t=2439
>>         <https://youtu.be/hJyAfC-Z2xk?t=2439>
>>
>>         [2] https://kubernetes.io/docs/concepts/configuration/manage
>>         <https://kubernetes.io/docs/concepts/configuration/manage>
>>         -compute-resources-container/
>>
>>         [3] https://youtu.be/w1rZOY5gbvk?t=2479
>>         <https://youtu.be/w1rZOY5gbvk?t=2479>
>>
>>         [4]
>>         https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
>>         <https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt>
>>         https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf
>>         <https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf>
>>         https://lwn.net/Articles/428175/
>>         <https://lwn.net/Articles/428175/>
>>
>>         [5]
>>         https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/design-proposals/node/cpu-manager.md
>>         <https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/design-proposals/node/cpu-manager.md>
>>         / https://github.com/kubernetes/kubernetes/commit/
>>         <https://github.com/kubernetes/kubernetes/commit/>
>>         00f0e0f6504ad8dd85fcbbd6294cd7cf2475fc72 /
>>         https://vimeo.com/226858314
>>
>>
>>         [6] https://kubernetes.io/docs/concepts/configuration/manage-
>>         <https://kubernetes.io/docs/concepts/configuration/manage->
>>         compute-resources-container/#how-pods-with-resource-limits-are-run
>>
>>
>>             Of course these are just conventions.  This is why I
>>             provided a way of
>>             specifying the
>>             number of CPUs so folks deploying Java services can be
>>             certain they get
>>             what they want.
>>
>>             Bob.
>>
>>
>>                     I had assumed that when sched_setaffinity was
>>                     called (in your case by
>>
>>             numactl) that the
>>
>>                     cgroup cpu config files would be updated to
>>                     reflect the current
>>
>>             processor affinity for the
>>
>>                     running process. This is not correct.  I have
>>                     updated my changeset and
>>
>>             have successfully
>>
>>                     run with your examples below.  I’ll post a new
>>                     webrev soon.
>>
>>
>>                 I see, thanks again!
>>
>>                 /Robbin
>>
>>                     Thanks,
>>                     Bob.
>>
>>
>>                             I still want to include the flag for at
>>                             least one Java release in the
>>
>>             event that the new behavior causes some regression
>>
>>                             in behavior.  I’m trying to make the
>>                             detection robust so that it will
>>
>>             fallback to the current behavior in the event
>>
>>                             that cgroups is not configured as expected
>>                             but I’d like to have a way
>>
>>             of forcing the issue.  JDK 10 is not
>>
>>                             supposed to be a long term support release
>>                             which makes it a good
>>
>>             target for this new behavior.
>>
>>                             I agree with David that once we commit to
>>                             cgroups, we should extract
>>
>>             all VM configuration data from that
>>
>>                             source.  There’s more information
>>                             available for cpusets than just
>>
>>             processor affinity that we might want to
>>
>>                             consider when calculating the number of
>>                             processors to assume for the
>>
>>             VM.  There’s exclusivity and
>>
>>                             effective cpu data available in addition
>>                             to the cpuset string.
>>
>>
>>                         cgroup only contains limits, not the real hard
>>                         limits.
>>                         You most consider the affinity mask. We that
>>                         have numa nodes do:
>>
>>                         [rehn at rehn-ws dev]$ numactl --cpunodebind=1
>>                         --membind=1 java
>>
>>             -Xlog:os=debug -cp . ForEver | grep proc
>>
>>                         [0.001s][debug][os] Initial active processor
>>                         count set to 16
>>                         [rehn at rehn-ws dev]$ numactl --cpunodebind=1
>>                         --membind=1 java
>>
>>             -Xlog:os=debug -XX:+UseContainerSupport -cp . ForEver |
>>             grep proc
>>
>>                         [0.001s][debug][os] Initial active processor
>>                         count set to 32
>>
>>                         when benchmarking all the time and that must
>>                         be set to 16 otherwise
>>
>>             the flag is really bad for us.
>>
>>                         So the flag actually breaks the little numa
>>                         support we have now.
>>
>>                         Thanks, Robbin
>>
>>
>>
>>
>