[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization
David Holmes
david.holmes at oracle.com
Mon Feb 7 01:12:21 UTC 2022
Just for the record ...
On 3/02/2022 7:19 pm, David Holmes wrote:
> Hi Ioi,
>
> For the benefit of the mailing list discussion ...
>
> On 3/02/2022 5:30 pm, Ioi Lam wrote:
>> Please see the bug report [1] for detailed description and test cases.
>>
>> I'd like to have some discussion before we can decide what to do.
>>
>> I discovered this issue when analyzing JDK-8279484 [2]. Under
>> Kubernetes (minikube), Runtime.availableProcessors() returns 1,
>> despite that the fact the machine has 32 CPUs, the Kubernetes node has
>> a single deployment, and no CPU limits were set.
>>
>> Specifically, I want to understand why the JDK is using
>> CgroupSubsystem::cpu_shares() to limit the number of CPUs used by the
>> Java process.
>
> Because we were asked to by customers deploying in containers.
>
>> In cgroup, there are other ways that are designed specifically for
>> limiting the number of CPUs, i.e., CgroupSubsystem::cpu_quota(). Why
>> is using cpu_quota() alone not enough? Why did we choose the current
>> approach of considering both cpu_quota() and cpu_shares()?
>
> Because people were using both (whether that made sense or not) and so
> we needed a policy on what to do if both were set.
>
>> My guess is that sometimes people don't limit the actual number of
>> CPUs per container, but instead use CPU Shares to set the relative
>> scheduling priority between containers.
>>
>> I.e., they run "docker run --cpu-shares=1234" without using the
>> "--cpus" flag.
>>
>> If this is indeed the reason, I can understand the (good) intention,
>> but the solution seems awfully insufficient.
>>
>> CPU Shares is a *relative* number. How much CPU is allocated to you
>> depends on
>>
>> - how many other processes are actively running
>> - what their CPU Shares are
>>
>> The above information can change dynamically, as other processes may
>> be added or removed, and they can change between active and idle states.
>>
>> However, the JVM treats CPU Shares as an *absolute/static* number, and
>> sets the CPU quota of the current process using this very simplistic
>> formula.
>
> From old discussion and the code I believe the thought was that share
> was relative to the the per-cpu default shares of 1024. So we use that
> to determine the fraction of each CPU that should be assigned, and we
> should then use that to determine the available number of CPUs. But that
> isn't what we actually do - we only calculate the fraction and round it
> up to get the number of CPUs and that is wrong (and typically only gives
> 1 cpu because shares < 1024). I speculate that what was intended was to
> map from having an X% share of each CPU, to instead having access to X%
> of the total CPUs (at 100% of each). Mathematically this has some basis
> but it actually makes no practical sense from a throughput or response
> time perspective. If I'm allowed 50% of the CPU per time period to do my
> calculations, I want 100% of each CPU for half of the period as that
> potentially minimises the elapsed time till I have a result.
>
>> Value of /sys/fs/cgroup/cpu.shares -> cpu quota:
>>
>> 1023 -> 1 CPU
>> 1024 -> no limit (huh??)
>> 2048 -> 2 CPUs
>> 4096 -> 4 CPUs
>>
>> This seems just wrong to me. There's no way you can get a "correct"
>> result without knowing anything about other processes that are running
>> at the same time.
>
> As I said above and in the bug report I think this was an error and the
> intent was to then multiply by the number of actual processors.
Not it was not an error. See the discussion Severin referenced:
http://mail.openjdk.java.net/pipermail/hotspot-dev/2019-January/036093.html
David
-----
>> The net effect is when Java is running under a container, more likely
>> that not, the JVM will limit itself to a single CPU. This seems really
>> inefficient to me.
>
> Yes.
>
>> What should we do?
>
> We could just adjust the calculation as I suggested.
>
> Or, given that share aka weight is meaningless without knowing the total
> weight in the system we could just ignore it. The app then gets access
> to all cpu's and it is up to the container to track actual usage and
> impose any limits configured.
>
> I've always thought that these cgroups mechanisms were fundamentally
> flawed and that if the intent was to define a resource limited
> environment, then the environment should report what resources were
> available by the normal APIs. They got this right with cpu-sets by
> integrating with sched_getaffinity; but for shares and quotas it has
> been left to the applications to try and figure out what that should
> mean - and that makes no sense to me.
>
> Cheers,
> David
>
>> Thanks
>> - Ioi
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8281181
>> [2] https://bugs.openjdk.java.net/browse/JDK-8279484
More information about the hotspot-dev
mailing list