RFR: 8146115 - Improve docker container detection and resource configuration usage

Wed Oct 4 20:01:09 UTC 2017

Hi,

On Wed, Oct 4, 2017 at 7:51 PM, Bob Vandette <bob.vandette at oracle.com>
wrote:

>
> > On Oct 4, 2017, at 2:30 PM, Robbin Ehn <robbin.ehn at oracle.com> wrote:
> >
> > Thanks Bob for looking into this.
> >
> > On 10/04/2017 08:14 PM, Bob Vandette wrote:
> >> Robbin,
> >> I’ve looked into this issue and you are correct.  I do have to examine
> both the
> >> sched_getaffinity results as well as the cgroup cpu subsystem
> configuration
> >> files in order to provide a reasonable value for active_processors.  If
> I was only
> >> interested in cpusets, I could simply rely on the getaffinity call but
> I also want to
> >> factor in shares and quotas as well.
> >
> > We had a quick discussion at the office, we actually do think that you
> could skip reading the shares and quotas.
> > It really depends on what the user expect, if he give us 4 cpu's with
> 50% or 2 full cpu what do he expect the differences would be?
> > One could argue that he 'knows' that he will only use max 50% and thus
> we can act as if he is giving us 4 full cpu.
> > But I'll leave that up to you, just a tough we had.
>
> It’s my opinion that we should do something if someone makes the effort to
> configure their
> containers to use quotas or shares.  There are many different opinions on
> what the right that
> right “something” is.
>

It might be interesting to look at some real instances of how java might[3]
be deployed in containers.
Marathon/Mesos[1] and Kubernetes[2] use shares and quotas so this is a vast
chunk of deployments that need both of them today.

>
> Many developers that are trying to deploy apps that use containers say
> they don’t like
> cpusets.  This is too limiting for them especially when the server
> configurations vary
> within their organization.
>

True, however Kubernetes has an alpha feature[5] where it allocates cpusets
to containers that request a whole number of cpus. Previously without
cpusets any container could run on any cpu which we know might not be good
for some workloads that want isolation. A request for a fractional or
burstable amount of cpu would be allocated from a shared cpu pool. So
although manual allocation of cpusets will be flakey[3] , automation should
be able to make it work.

>
> From everything I’ve read including source code, there seems to be a
> consensus that
> shares and quotas are being used as a way to specify a fraction of a
> system (number of cpus).
>

A refinement[6] on this is:
Shares can be used for guaranteed cpu - you will always get your share.
Quota[4] is a limit/constraint - you can never get more than the quota.
So given the below limit of how many shares will be allocated on a host you
can have burstable(or overcommit) capacity if your shares are less than
your quota.

>
> Docker added —cpus which is implemented using quotas and periods.  They
> adjust these
> two parameters to provide a way of calculating the number of cpus that
> will be available
> to a process (quota/period).  Amazon also documents that cpu shares are
> defined to be a multiple of 1024.
> Where 1024 represents a single cpu and a share value of N*1024 represents
> N cpus.
>

Kubernetes and Mesos/Marathon also use the N*1024 shares per host to
allocate resources automatically.

Hopefully this provides some background on what a couple of orchestration
systems that will be running java are doing currently in this area.
Thanks,
Alex

[1] https://github.com/apache/mesos/commit/346cc8dd528a28a6e
1f1cbdb4c95b8bdea2f6070 / (now out of date but appears to be a reasonable
intro : https://zcox.wordpress.com/2014/09/17/cpu-resources-in-docke
r-mesos-and-marathon/ )
[1a] https://youtu.be/hJyAfC-Z2xk?t=2439

[2] https://kubernetes.io/docs/concepts/configuration/manage
-compute-resources-container/

[3] https://youtu.be/w1rZOY5gbvk?t=2479

[4] https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
https://landley.net/kdocs/ols/2010/ols2010-pages-245-254.pdf
https://lwn.net/Articles/428175/

[5]
https://github.com/kubernetes/community/blob/43ce57ac476b9f2ce3f0220354a075e095a0d469/contributors/design-proposals/node/cpu-manager.md
/ https://github.com/kubernetes/kubernetes/commit/
00f0e0f6504ad8dd85fcbbd6294cd7cf2475fc72 / https://vimeo.com/226858314

[6] https://kubernetes.io/docs/concepts/configuration/manage-
compute-resources-container/#how-pods-with-resource-limits-are-run

> Of course these are just conventions.  This is why I provided a way of
> specifying the
> number of CPUs so folks deploying Java services can be certain they get
> what they want.
>
> Bob.
>
> >
> >> I had assumed that when sched_setaffinity was called (in your case by
> numactl) that the
> >> cgroup cpu config files would be updated to reflect the current
> processor affinity for the
> >> running process. This is not correct.  I have updated my changeset and
> have successfully
> >> run with your examples below.  I’ll post a new webrev soon.
> >
> > I see, thanks again!
> >
> > /Robbin
> >
> >> Thanks,
> >> Bob.
> >>>
> >>>> I still want to include the flag for at least one Java release in the
> event that the new behavior causes some regression
> >>>> in behavior.  I’m trying to make the detection robust so that it will
> fallback to the current behavior in the event
> >>>> that cgroups is not configured as expected but I’d like to have a way
> of forcing the issue.  JDK 10 is not
> >>>> supposed to be a long term support release which makes it a good
> target for this new behavior.
> >>>> I agree with David that once we commit to cgroups, we should extract
> all VM configuration data from that
> >>>> source.  There’s more information available for cpusets than just
> processor affinity that we might want to
> >>>> consider when calculating the number of processors to assume for the
> VM.  There’s exclusivity and
> >>>> effective cpu data available in addition to the cpuset string.
> >>>
> >>> cgroup only contains limits, not the real hard limits.
> >>> You most consider the affinity mask. We that have numa nodes do:
> >>>
> >>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
> -Xlog:os=debug -cp . ForEver | grep proc
> >>> [0.001s][debug][os] Initial active processor count set to 16
> >>> [rehn at rehn-ws dev]$ numactl --cpunodebind=1 --membind=1 java
> -Xlog:os=debug -XX:+UseContainerSupport -cp . ForEver | grep proc
> >>> [0.001s][debug][os] Initial active processor count set to 32
> >>>
> >>> when benchmarking all the time and that must be set to 16 otherwise
> the flag is really bad for us.
> >>> So the flag actually breaks the little numa support we have now.
> >>>
> >>> Thanks, Robbin
>
>