[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization
David Holmes
david.holmes at oracle.com
Thu Feb 3 09:19:10 UTC 2022
Hi Ioi,
For the benefit of the mailing list discussion ...
On 3/02/2022 5:30 pm, Ioi Lam wrote:
> Please see the bug report [1] for detailed description and test cases.
>
> I'd like to have some discussion before we can decide what to do.
>
> I discovered this issue when analyzing JDK-8279484 [2]. Under Kubernetes
> (minikube), Runtime.availableProcessors() returns 1, despite that the
> fact the machine has 32 CPUs, the Kubernetes node has a single
> deployment, and no CPU limits were set.
>
> Specifically, I want to understand why the JDK is using
> CgroupSubsystem::cpu_shares() to limit the number of CPUs used by the
> Java process.
Because we were asked to by customers deploying in containers.
> In cgroup, there are other ways that are designed specifically for
> limiting the number of CPUs, i.e., CgroupSubsystem::cpu_quota(). Why is
> using cpu_quota() alone not enough? Why did we choose the current
> approach of considering both cpu_quota() and cpu_shares()?
Because people were using both (whether that made sense or not) and so
we needed a policy on what to do if both were set.
> My guess is that sometimes people don't limit the actual number of CPUs
> per container, but instead use CPU Shares to set the relative scheduling
> priority between containers.
>
> I.e., they run "docker run --cpu-shares=1234" without using the "--cpus"
> flag.
>
> If this is indeed the reason, I can understand the (good) intention, but
> the solution seems awfully insufficient.
>
> CPU Shares is a *relative* number. How much CPU is allocated to you
> depends on
>
> - how many other processes are actively running
> - what their CPU Shares are
>
> The above information can change dynamically, as other processes may be
> added or removed, and they can change between active and idle states.
>
> However, the JVM treats CPU Shares as an *absolute/static* number, and
> sets the CPU quota of the current process using this very simplistic
> formula.
From old discussion and the code I believe the thought was that share
was relative to the the per-cpu default shares of 1024. So we use that
to determine the fraction of each CPU that should be assigned, and we
should then use that to determine the available number of CPUs. But that
isn't what we actually do - we only calculate the fraction and round it
up to get the number of CPUs and that is wrong (and typically only gives
1 cpu because shares < 1024). I speculate that what was intended was to
map from having an X% share of each CPU, to instead having access to X%
of the total CPUs (at 100% of each). Mathematically this has some basis
but it actually makes no practical sense from a throughput or response
time perspective. If I'm allowed 50% of the CPU per time period to do my
calculations, I want 100% of each CPU for half of the period as that
potentially minimises the elapsed time till I have a result.
> Value of /sys/fs/cgroup/cpu.shares -> cpu quota:
>
> 1023 -> 1 CPU
> 1024 -> no limit (huh??)
> 2048 -> 2 CPUs
> 4096 -> 4 CPUs
>
> This seems just wrong to me. There's no way you can get a "correct"
> result without knowing anything about other processes that are running
> at the same time.
As I said above and in the bug report I think this was an error and the
intent was to then multiply by the number of actual processors.
> The net effect is when Java is running under a container, more likely
> that not, the JVM will limit itself to a single CPU. This seems really
> inefficient to me.
Yes.
> What should we do?
We could just adjust the calculation as I suggested.
Or, given that share aka weight is meaningless without knowing the total
weight in the system we could just ignore it. The app then gets access
to all cpu's and it is up to the container to track actual usage and
impose any limits configured.
I've always thought that these cgroups mechanisms were fundamentally
flawed and that if the intent was to define a resource limited
environment, then the environment should report what resources were
available by the normal APIs. They got this right with cpu-sets by
integrating with sched_getaffinity; but for shares and quotas it has
been left to the applications to try and figure out what that should
mean - and that makes no sense to me.
Cheers,
David
> Thanks
> - Ioi
>
> [1] https://bugs.openjdk.java.net/browse/JDK-8281181
> [2] https://bugs.openjdk.java.net/browse/JDK-8279484
More information about the hotspot-dev
mailing list