RFR: 8146115 - Improve docker container detection and resource configuration usage

Wed Oct 4 21:51:47 UTC 2017

Hi Alex,

Can you tell me how shares/quotas are actually implemented in terms of 
allocating "cpus" to processes when shares/quotas are being applied? For 
example in a 12 cpu system if I have a 50% share do I get all 12 CPUs 
for 50% of a "quantum" each, or do I get 6 CPUs for a full quantum each?

When we try to use the "number of processors" to control the number of 
threads created, or the number of partitions in a task, then we really 
want to know how many CPUs we can actually be concurrently running on!

Thanks,
David

On 5/10/2017 6:01 AM, Alex Bagehot wrote:
> Hi,
> 
> On Wed, Oct 4, 2017 at 7:51 PM, Bob Vandette <bob.vandette at oracle.com>
> wrote:
> 
>>
>>> On Oct 4, 2017, at 2:30 PM, Robbin Ehn <robbin.ehn at oracle.com> wrote:
>>>
>>> Thanks Bob for looking into this.
>>>
>>> On 10/04/2017 08:14 PM, Bob Vandette wrote:
>>>> Robbin,
>>>> I’ve looked into this issue and you are correct.  I do have to examine
>> both the
>>>> sched_getaffinity results as well as the cgroup cpu subsystem
>> configuration
>>>> files in order to provide a reasonable value for active_processors.  If
>> I was only
>>>> interested in cpusets, I could simply rely on the getaffinity call but
>> I also want to
>>>> factor in shares and quotas as well.
>>>
>>> We had a quick discussion at the office, we actually do think that you
>> could skip reading the shares and quotas.
>>> It really depends on what the user expect, if he give us 4 cpu's with
>> 50% or 2 full cpu what do he expect the differences would be?
>>> One could argue that he 'knows' that he will only use max 50% and thus
>> we can act as if he is giving us 4 full cpu.
>>> But I'll leave that up to you, just a tough we had.
>>
>> It’s my opinion that we should do something if someone makes the effort to
>> configure their
>> containers to use quotas or shares.  There are many different opinions on
>> what the right that
>> right “something” is.
>>
> 
> It might be interesting to look at some real instances of how java might[3]
> be deployed in containers.
> Marathon/Mesos[1] and Kubernetes[2] use shares and quotas so this is a vast
> chunk of deployments that need both of them today.
> 
> 
>>
>> Many developers that are trying to deploy apps that use containers say
>> they don’t like
>> cpusets.  This is too limiting for them especially when the server
>> configurations vary
>> within their organization.
>>
> 
> True, however Kubernetes has an alpha feature[5] where it allocates cpusets
> to containers that request a whole number of cpus. Previously without
> cpusets any container could run on any cpu which we know might not be good
> for some workloads that want isolation. A request for a fractional or
> burstable amount of cpu would be allocated from a shared cpu pool. So
> although manual allocation of cpusets will be flakey[3] , automation should
> be able to make it work.
> 
> 
>>
>>  From everything I’ve read including source code, there seems to be a
>> consensus that
>> shares and quotas are being used as a way to specify a fraction of a
>> system (number of cpus).
>>
> 
> A refinement[6] on this is:
> Shares can be used for guaranteed cpu - you will always get your share.
> Quota[4] is a limit/constraint - you can never get more than the quota.
> So given the below limit of how many shares will be allocated on a host you
> can have burstable(or overcommit) capacity if your shares are less than
> your quota.
> 
> 
>>
>> Docker added —cpus which is implemented using quotas and periods.  They
>> adjust these
>> two parameters to provide a way of calculating the number of cpus that
>> will be available
>> to a process (quota/period).  Amazon also documents that cpu shares are
>> defined to be a multiple of 1024.
>> Where 1024 represents a single cpu and a share value of N*1024 represents
>> N cpus.
>>
> 
> Kubernetes and Mesos/Marathon also use the N*1024 shares per host to
> allocate resources automatically.
> 
> Hopefully this provides some background on what a couple of orchestration
> systems that will be running java are doing currently in this area.
> Thanks,
> Alex
> 
> 
> [1] https://github.com/apache/mesos/commit/346cc8dd528a28a6e
> 1f1cbdb4c95b8bdea2f6070 / (now out of date but appears to be a reasonable
> intro : https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke
> r-mesos-and-marathon/ )
> [1a] https://youtu.be/hJyAfC-Z2xk?t=2439
> 
> [2] https://kubernetes.io/docs/concepts/configuration/manage
> -compute-resources-container/
> 
> [3] https://youtu.be/w1rZOY5gbvk?t=2479
> 
> [4] https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
> https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf
> https://lwn.net/Articles/428175/
> 
> [5]
> https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/design-proposals/node/cpu-manager.md
> / https://github.com/kubernetes/kubernetes/commit/
> 00f0e0f6504ad8dd85fcbbd6294cd7cf2475fc72 / https://vimeo.com/226858314
> 
> [6] https://kubernetes.io/docs/concepts/configuration/manage-
> compute-resources-container/#how-pods-with-resource-limits-are-run
> 
> 
>> Of course these are just conventions.  This is why I provided a way of
>> specifying the
>> number of CPUs so folks deploying Java services can be certain they get
>> what they want.
>>
>> Bob.
>>
>>>
>>>> I had assumed that when sched_setaffinity was called (in your case by
>> numactl) that the
>>>> cgroup cpu config files would be updated to reflect the current
>> processor affinity for the
>>>> running process. This is not correct.  I have updated my changeset and
>> have successfully
>>>> run with your examples below.  I’ll post a new webrev soon.
>>>
>>> I see, thanks again!
>>>
>>> /Robbin
>>>
>>>> Thanks,
>>>> Bob.
>>>>>
>>>>>> I still want to include the flag for at least one Java release in the
>> event that the new behavior causes some regression
>>>>>> in behavior.  I’m trying to make the detection robust so that it will
>> fallback to the current behavior in the event
>>>>>> that cgroups is not configured as expected but I’d like to have a way
>> of forcing the issue.  JDK 10 is not
>>>>>> supposed to be a long term support release which makes it a good
>> target for this new behavior.
>>>>>> I agree with David that once we commit to cgroups, we should extract
>> all VM configuration data from that
>>>>>> source.  There’s more information available for cpusets than just
>> processor affinity that we might want to
>>>>>> consider when calculating the number of processors to assume for the
>> VM.  There’s exclusivity and
>>>>>> effective cpu data available in addition to the cpuset string.
>>>>>
>>>>> cgroup only contains limits, not the real hard limits.
>>>>> You most consider the affinity mask. We that have numa nodes do:
>>>>>
>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>> -Xlog:os=debug -cp . ForEver | grep proc
>>>>> [0.001s][debug][os] Initial active processor count set to 16
>>>>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
>> -Xlog:os=debug -XX:+UseContainerSupport -cp . ForEver | grep proc
>>>>> [0.001s][debug][os] Initial active processor count set to 32
>>>>>
>>>>> when benchmarking all the time and that must be set to 16 otherwise
>> the flag is really bad for us.
>>>>> So the flag actually breaks the little numa support we have now.
>>>>>
>>>>> Thanks, Robbin
>>
>>