RFR: 8146115 - Improve docker container detection and resource configuration usage

Fri Oct 6 23:28:14 UTC 2017

On 7/10/2017 1:34 AM, Bob Vandette wrote:
> 
>> On Oct 5, 2017, at 6:12 PM, David Holmes <David.Holmes at oracle.com> wrote:
>>
>> Hi Bob,
>>
>> On 6/10/2017 3:57 AM, Bob Vandette wrote:
>>>> On Oct 5, 2017, at 12:43 PM, Alex Bagehot <ceeaspb at gmail.com <mailto:ceeaspb at gmail.com>> wrote:
>>>>
>>>> Hi David,
>>>>
>>>> On Wed, Oct 4, 2017 at 10:51 PM, David Holmes <david.holmes at oracle.com <mailto:david.holmes at oracle.com>> wrote:
>>>>
>>>>     Hi Alex,
>>>>
>>>>     Can you tell me how shares/quotas are actually implemented in
>>>>     terms of allocating "cpus" to processes when shares/quotas are
>>>>     being applied?
>>>>
>>>> The allocation of cpus to processes/threads(tasks as the kernel sees them) or the other way round is called balancing, which is done by Scheduling domains[3].
>>>>
>>>> cpu shares use CFS "group" scheduling[1] to apply the share to all the tasks(threads) in the container. The container cpu shares weight maps directly to a task's weight in CFS, which given it is part of a group is divided by the number of tasks in the group (ie. a default container share of 1024 with 2 threads in the container/group would result in each thread/task having a 512 weight[4]). The same values used by nice[2] also.
>>>>
>>>> You can observe the task weight and other scheduler numbers in /proc/sched_debug [4]. You can also kernel trace scheduler activity which typically tells you the tasks involved, the cpu, the event: switch or wakeup, etc.
>>>>
>>>>     For example in a 12 cpu system if I have a 50% share do I get all
>>>>     12 CPUs for 50% of a "quantum" each, or do I get 6 CPUs for a full
>>>>     quantum each?
>>>>
>>>>
>>>> You get 12 cpus for 50% of the time on the average if there is another workload that has the same weight as you and is consuming as much as it can.
>>>> If there's nothing else running on the machine you get 12 cpus for 100% of the time with a cpu shares only config (ie. the burst capacity).
>>>>
>>>> I validated that the share was balanced over all the cpus by running linux perf events and checking that there were cpu samples on all cpus. There's bound to be other ways of doing it also.
>>>>
>>>>
>>>>     When we try to use the "number of processors" to control the
>>>>     number of threads created, or the number of partitions in a task,
>>>>     then we really want to know how many CPUs we can actually be
>>>>     concurrently running on!
>>> I’m not sure that the primary question for serverless container execution.  Just because you might happen to burst and have available
>>> to you more CPU time than you specified in your shares doesn’t mean
>>> that a multi-threaded application running in one of these containers should configure itself to use all available host processors.  This would result in over-burdoning the system at times of high load.
>>
>> And conversely if you restrict yourself to the "share" of processors you get over time (ie 6 instead of 12) then you can severely impact the performance (response time in particular) of the VM and the application running on the VM.
> 
> So if someone configures an 88 way system to use 1/88 share, you don’t think they expect a highly threaded
> application to run slower than if they didn’t restrict the shares??   The whole idea about shares is to SHARE the
> system.  Yes, you’d have better performance when the system is idle and only running a single application but that’s
> not what these container frameworks are trying to accomplish.  They want to get the best performance when running many
> many processes.  That’s what I’m optimizing for.

In what I described you are SHARING the system. You're also getting the 
most benefit from a lightly loaded system.

To me the conceptual model for a 1/88 share of an 88-way system is that 
you get 88 processors that appear to run at 1/88 the speed of the 
physical ones. Not that you get 1 real full speed processor.

>>
>> But I don't see how this can overburden the system. If you app is running alone you get to use all 12 cpus for 100% of the time and life is good. If another app starts up then your 100% drops proportionately. If you schedule 12 apps all with a 1/12 share then everyone gets up to 12 cpus for 1/12 of the time. It's only if you try to schedule a set of apps with a utilization total greater than 1 does the system become overloaded.
> 
> In my above example, If we run the VM ergonomics based on 88 CPUs, then we are wasting a lot of memory on thread stacks and when
> many of these processes are running,  the system will context switch a lot more than it would if we restricted the creation of threads to
> the share amount.

Context switching is a function of threads and time. My way uses more 
threads and less time (per unit of work); yours uses less threads and 
more time. Seems like zero sum to me.

Memory use is a different matter, but only because you can restrict 
memory independently of cpus. So you will need to ensure your memory 
quotas can accommodate the number of threads you expect to run - regardless.

David
-----

> Bob.
> 
> 
>>
>>> The Java runtime, at startup, configures several subsystems to use a number of threads for each system based on the number of available
>>> processors.  These subsystems include things like the number of GC
>>> threads, JIT compiler and thread pools.
>>
>>> The problem I am trying to solve is to come up with a single number
>>> of CPUs based on container knowledge that can be used for the Java
>>> runtime subsystem to configure itself.  I believe that we should
>>> trust the implementor of the Mesos or Kubernetes setup and honor their wishes when coming up with this number and not just use the
>>> processor affinity or number of cpus in the cpuset.
>>
>> I don't agree, as has been discussed before. It's perfectly fine, even desirable, in my opinion to have 12 threads executing concurrently for 50% of the time, rather than only 6 threads for 100% (assuming the scheduling technology is even clever enough to realize it can grant your threads 100%).
>>
>> Over time the amount of work your app can execute is the same, but the time taken for an individual subtask can vary. If you are just doing one-shot batch processing then it makes no difference. If you're running an app that itself services incoming requests then the response time to individual requests can be impacted. To take the worst-case scenario, imagine you get 12 concurrent requests that would each take 1/12 of your cpu quota. With 12 threads on 12 cpus you can service all 12 requests with a response time of 1/12 time units. But with 6 threads on 6 cpus you can only service 6 requests with a 1/12 response time, and the other 6 will have a 1/6 response time.
>>
>>> The challenge is determining the right algorithm that doesn’t penalize the VM.
>>
>> Agreed. But I think the current algorithm may penalize the VM, and more importantly the application it is running.
>>
>>> My current implementation does this:
>>> total available logical processors = min (cpusets,sched_getaffinity,shares/1024, quota/period)
>>> All fractional units are rounded up to the next whole number.
>>
>> My point has always been that I just don't think producing a single number from all these factors is the right/best way to deal with this. I think we really want to be able to answer the question "how many processors can I concurrently execute on" distinct from the question of "how much of a time slice will I get on each of those processors". To me "how many" is the question that "availableProcessors" should be answering - and only that question. How much "share" do I get is a different question, and perhaps one that the VM and the application need to be able to ask.
>>
>> BTW sched_getaffinity should already account for cpusets ??
>>
>> Cheers,
>> David
>>
>>> Bob.
>>>>
>>>> Makes sense to check. Hopefully there aren't any major errors or omissions in the above.
>>>> Thanks,
>>>> Alex
>>>>
>>>> [1] https://lwn.net/Articles/240474/ <https://lwn.net/Articles/240474/>
>>>> [2] https://github.com/torvalds/linux/blob/368f89984bb971b9f8b69eeb85ab19a89f985809/kernel/sched/core.c#L6735 <https://github.com/torvalds/linux/blob/368f89984bb971b9f8b69eeb85ab19a89f985809/kernel/sched/core.c#L6735>
>>>> [3] https://lwn.net/Articles/80911/ <https://lwn.net/Articles/80911/> / http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf <http://www.i3s.unice.fr/~jplozi/wastedcores/files/extended_talk.pdf>
>>>>
>>>> [4]
>>>>
>>>> cfs_rq[13]:/system.slice/docker-f5681788d6daab249c90810fe60da429a2565b901ff34245922a578635b5d607.scope
>>>>
>>>> .exec_clock: 0.000000
>>>>
>>>> .MIN_vruntime: 0.000001
>>>>
>>>> .min_vruntime: 8090.087297
>>>>
>>>> .max_vruntime: 0.000001
>>>>
>>>> .spread: 0.000000
>>>>
>>>> .spread0 : -124692718.052832
>>>>
>>>> .nr_spread_over: 0
>>>>
>>>> .nr_running: 1
>>>>
>>>> .load: 1024
>>>>
>>>> .runnable_load_avg : 1023
>>>>
>>>> .blocked_load_avg: 0
>>>>
>>>> .tg_load_avg : 2046
>>>>
>>>> .tg_load_contrib : 1023
>>>>
>>>> .tg_runnable_contrib : 1023
>>>>
>>>> .tg->runnable_avg: 2036
>>>>
>>>> .tg->cfs_bandwidth.timer_active: 0
>>>>
>>>> .throttled : 0
>>>>
>>>> .throttle_count: 0
>>>>
>>>> .se->exec_start: 236081964.515645
>>>>
>>>> .se->vruntime: 24403993.326934
>>>>
>>>> .se->sum_exec_runtime: 8091.135873
>>>>
>>>> .se->load.weight : 512
>>>>
>>>> .se->avg.runnable_avg_sum: 45979
>>>>
>>>> .se->avg.runnable_avg_period : 45979
>>>>
>>>> .se->avg.load_avg_contrib: 511
>>>>
>>>> .se->avg.decay_count : 0
>>>>
>>>>
>>>>     Thanks,
>>>>     David
>>>>
>>>>
>>>>     On 5/10/2017 6:01 AM, Alex Bagehot wrote:
>>>>
>>>>         Hi,
>>>>
>>>>         On Wed, Oct 4, 2017 at 7:51 PM, Bob Vandette
>>>>         <bob.vandette at oracle.com <mailto:bob.vandette at oracle.com>>
>>>>         wrote:
>>>>
>>>>
>>>>                 On Oct 4, 2017, at 2:30 PM, Robbin Ehn
>>>>                 <robbin.ehn at oracle.com <mailto:robbin.ehn at oracle.com>>
>>>>                 wrote:
>>>>
>>>>                 Thanks Bob for looking into this.
>>>>
>>>>                 On 10/04/2017 08:14 PM, Bob Vandette wrote:
>>>>
>>>>                     Robbin,
>>>>                     I’ve looked into this issue and you are correct.                     I do have to examine
>>>>
>>>>             both the
>>>>
>>>>                     sched_getaffinity results as well as the cgroup
>>>>                     cpu subsystem
>>>>
>>>>             configuration
>>>>
>>>>                     files in order to provide a reasonable value for
>>>>                     active_processors.  If
>>>>
>>>>             I was only
>>>>
>>>>                     interested in cpusets, I could simply rely on the
>>>>                     getaffinity call but
>>>>
>>>>             I also want to
>>>>
>>>>                     factor in shares and quotas as well.
>>>>
>>>>
>>>>                 We had a quick discussion at the office, we actually
>>>>                 do think that you
>>>>
>>>>             could skip reading the shares and quotas.
>>>>
>>>>                 It really depends on what the user expect, if he give
>>>>                 us 4 cpu's with
>>>>
>>>>             50% or 2 full cpu what do he expect the differences would be?
>>>>
>>>>                 One could argue that he 'knows' that he will only use
>>>>                 max 50% and thus
>>>>
>>>>             we can act as if he is giving us 4 full cpu.
>>>>
>>>>                 But I'll leave that up to you, just a tough we had.
>>>>
>>>>
>>>>             It’s my opinion that we should do something if someone
>>>>             makes the effort to
>>>>             configure their
>>>>             containers to use quotas or shares.  There are many
>>>>             different opinions on
>>>>             what the right that
>>>>             right “something” is.
>>>>
>>>>
>>>>         It might be interesting to look at some real instances of how
>>>>         java might[3]
>>>>         be deployed in containers.
>>>>         Marathon/Mesos[1] and Kubernetes[2] use shares and quotas so
>>>>         this is a vast
>>>>         chunk of deployments that need both of them today.
>>>>
>>>>
>>>>
>>>>             Many developers that are trying to deploy apps that use
>>>>             containers say
>>>>             they don’t like
>>>>             cpusets.  This is too limiting for them especially when
>>>>             the server
>>>>             configurations vary
>>>>             within their organization.
>>>>
>>>>
>>>>         True, however Kubernetes has an alpha feature[5] where it
>>>>         allocates cpusets
>>>>         to containers that request a whole number of cpus. Previously
>>>>         without
>>>>         cpusets any container could run on any cpu which we know might
>>>>         not be good
>>>>         for some workloads that want isolation. A request for a
>>>>         fractional or
>>>>         burstable amount of cpu would be allocated from a shared cpu
>>>>         pool. So
>>>>         although manual allocation of cpusets will be flakey[3] ,
>>>>         automation should
>>>>         be able to make it work.
>>>>
>>>>
>>>>
>>>>              From everything I’ve read including source code, there
>>>>             seems to be a
>>>>             consensus that
>>>>             shares and quotas are being used as a way to specify a
>>>>             fraction of a
>>>>             system (number of cpus).
>>>>
>>>>
>>>>         A refinement[6] on this is:
>>>>         Shares can be used for guaranteed cpu - you will always get
>>>>         your share.
>>>>         Quota[4] is a limit/constraint - you can never get more than
>>>>         the quota.
>>>>         So given the below limit of how many shares will be allocated
>>>>         on a host you
>>>>         can have burstable(or overcommit) capacity if your shares are
>>>>         less than
>>>>         your quota.
>>>>
>>>>
>>>>
>>>>             Docker added —cpus which is implemented using quotas and
>>>>             periods.  They
>>>>             adjust these
>>>>             two parameters to provide a way of calculating the number
>>>>             of cpus that
>>>>             will be available
>>>>             to a process (quota/period).  Amazon also documents that
>>>>             cpu shares are
>>>>             defined to be a multiple of 1024.
>>>>             Where 1024 represents a single cpu and a share value of
>>>>             N*1024 represents
>>>>             N cpus.
>>>>
>>>>
>>>>         Kubernetes and Mesos/Marathon also use the N*1024 shares per
>>>>         host to
>>>>         allocate resources automatically.
>>>>
>>>>         Hopefully this provides some background on what a couple of
>>>>         orchestration
>>>>         systems that will be running java are doing currently in this
>>>>         area.
>>>>         Thanks,
>>>>         Alex
>>>>
>>>>
>>>>         [1] https://github.com/apache/mesos/commit/346cc8dd528a28a6e
>>>>         <https://github.com/apache/mesos/commit/346cc8dd528a28a6e>
>>>>         1f1cbdb4c95b8bdea2f6070 / (now out of date but appears to be a
>>>>         reasonable
>>>>         intro :
>>>>         https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke
>>>>         <https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke>
>>>>         r-mesos-and-marathon/ )
>>>>         [1a] https://youtu.be/hJyAfC-Z2xk?t=2439
>>>>         <https://youtu.be/hJyAfC-Z2xk?t=2439>
>>>>
>>>>         [2] https://kubernetes.io/docs/concepts/configuration/manage
>>>>         <https://kubernetes.io/docs/concepts/configuration/manage>
>>>>         -compute-resources-container/
>>>>
>>>>         [3] https://youtu.be/w1rZOY5gbvk?t=2479
>>>>         <https://youtu.be/w1rZOY5gbvk?t=2479>
>>>>
>>>>         [4]
>>>>         https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
>>>>         <https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt>
>>>>         https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf
>>>>         <https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf>
>>>>         https://lwn.net/Articles/428175/
>>>>         <https://lwn.net/Articles/428175/>
>>>>
>>>>         [5]
>>>>         https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/design-proposals/node/cpu-manager.md
>>>>         <https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/design-proposals/node/cpu-manager.md>
>>>>         / https://github.com/kubernetes/kubernetes/commit/
>>>>         <https://github.com/kubernetes/kubernetes/commit/>
>>>>         00f0e0f6504ad8dd85fcbbd6294cd7cf2475fc72 /
>>>>         https://vimeo.com/226858314
>>>>
>>>>
>>>>         [6] https://kubernetes.io/docs/concepts/configuration/manage-
>>>>         <https://kubernetes.io/docs/concepts/configuration/manage->
>>>>         compute-resources-container/#how-pods-with-resource-limits-are-run
>>>>
>>>>
>>>>             Of course these are just conventions.  This is why I
>>>>             provided a way of
>>>>             specifying the
>>>>             number of CPUs so folks deploying Java services can be
>>>>             certain they get
>>>>             what they want.
>>>>
>>>>             Bob.
>>>>
>>>>
>>>>                     I had assumed that when sched_setaffinity was
>>>>                     called (in your case by
>>>>
>>>>             numactl) that the
>>>>
>>>>                     cgroup cpu config files would be updated to
>>>>                     reflect the current
>>>>
>>>>             processor affinity for the
>>>>
>>>>                     running process. This is not correct.  I have
>>>>                     updated my changeset and
>>>>
>>>>             have successfully
>>>>
>>>>                     run with your examples below.  I’ll post a new
>>>>                     webrev soon.
>>>>
>>>>
>>>>                 I see, thanks again!
>>>>
>>>>                 /Robbin
>>>>
>>>>                     Thanks,
>>>>                     Bob.
>>>>
>>>>
>>>>                             I still want to include the flag for at
>>>>                             least one Java release in the
>>>>
>>>>             event that the new behavior causes some regression
>>>>
>>>>                             in behavior.  I’m trying to make the
>>>>                             detection robust so that it will
>>>>
>>>>             fallback to the current behavior in the event
>>>>
>>>>                             that cgroups is not configured as expected
>>>>                             but I’d like to have a way
>>>>
>>>>             of forcing the issue.  JDK 10 is not
>>>>
>>>>                             supposed to be a long term support release
>>>>                             which makes it a good
>>>>
>>>>             target for this new behavior.
>>>>
>>>>                             I agree with David that once we commit to
>>>>                             cgroups, we should extract
>>>>
>>>>             all VM configuration data from that
>>>>
>>>>                             source.  There’s more information
>>>>                             available for cpusets than just
>>>>
>>>>             processor affinity that we might want to
>>>>
>>>>                             consider when calculating the number of
>>>>                             processors to assume for the
>>>>
>>>>             VM.  There’s exclusivity and
>>>>
>>>>                             effective cpu data available in addition
>>>>                             to the cpuset string.
>>>>
>>>>
>>>>                         cgroup only contains limits, not the real hard
>>>>                         limits.
>>>>                         You most consider the affinity mask. We that
>>>>                         have numa nodes do:
>>>>
>>>>                         [rehn at rehn-ws dev]$ numactl --cpunodebind=1
>>>>                         --membind=1 java
>>>>
>>>>             -Xlog:os=debug -cp . ForEver | grep proc
>>>>
>>>>                         [0.001s][debug][os] Initial active processor
>>>>                         count set to 16
>>>>                         [rehn at rehn-ws dev]$ numactl --cpunodebind=1
>>>>                         --membind=1 java
>>>>
>>>>             -Xlog:os=debug -XX:+UseContainerSupport -cp . ForEver |
>>>>             grep proc
>>>>
>>>>                         [0.001s][debug][os] Initial active processor
>>>>                         count set to 32
>>>>
>>>>                         when benchmarking all the time and that must
>>>>                         be set to 16 otherwise
>>>>
>>>>             the flag is really bad for us.
>>>>
>>>>                         So the flag actually breaks the little numa
>>>>                         support we have now.
>>>>
>>>>                         Thanks, Robbin
>>>>
>>>>
>>>>
>>>>
>