RFR: 8367319: Add os interfaces to get machine and container values separately [v2]
Erik Österlund
eosterlund at openjdk.org
Thu Oct 23 09:20:48 UTC 2025
On Wed, 22 Oct 2025 06:43:33 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:
>>> In other words - are you against the idea of having an implicit API that gives you either the container or the "machine"?
>>
>> @fisk I am "against" trying to shoe-horn a poorly-defined legacy API into a dichotomy of "container value" versus "machine value". The concepts are not at all clean or well-defined for both variants.
>>
>>> CPU quotas when running in a container.
>>
>> Then you need an explicit API for that. No contortion of "available processors" is going to tell you what your quota is.
>>
>>> The "machine" numbers when running in a container (which get overridden by container numbers).
>>
>> Sounds simple but is it well-defined? Can you even ask the question (again I come back to available processors where sched_getaffinity will already account for the presence of the container if tasksets are used - what answer would you want for the "machine"?).
>>
>> Aside: even if you can ask the question what use are these values if you are running within the container? Is there some means to bypass the container constraints?
>
>> > In other words - are you against the idea of having an implicit API that gives you either the container or the "machine"?
>>
>>
>>
>> @fisk I am "against" trying to shoe-horn a poorly-defined legacy API into a dichotomy of "container value" versus "machine value". The concepts are not at all clean or well-defined for both variants.
>>
>
> Okay. Let's start with the problem domain then.
>
>>
>> > CPU quotas when running in a container.
>>
>>
>>
>> Then you need an explicit API for that. No contortion of "available processors" is going to tell you what your quota is.
>>
>
> While that is true, I was hoping for an API that isn't just the exact Linux implementation that screams of Linux for no good reason. What I mean by that is that having the available processors as a double seems general enough that potentially other OS containers could use it, whether or not they implemented CPU throttling in the same way. Having an explicit fraction instead does not help me much as a user.
>
>>
>> > The "machine" numbers when running in a container (which get overridden by container numbers).
>>
>>
>>
>> Sounds simple but is it well-defined? Can you even ask the question (again I come back to available processors where sched_getaffinity will already account for the presence of the container if tasksets are used - what answer would you want for the "machine"?).
>>
>
> What I specifically care about is how many cores the JVM is allowed to be scheduled to run on. These cores are shared across containers and utilizing them has a shared latency impact.
>
> The container layer might throttle how often you are allowed to run on said processors and for how long. But neither cgroup CPU limit constrains how many processors the JVM is allowed to run on. It might run on all of them for a very short period of time.
>
>>
>> Aside: even if you can ask the question what use are these values if you are running within the container? Is there some means to bypass the container constraints?
>
> So the reason this matters to me is that GC latency with a concurrent GC is all about CPU utilization. As CPU utilization grows, response times grow too across all percentiles.
>
> So when in a position to pick how to pay for GC in terms of memory vs CPU, I want to increasingly bias the cost more towards using memory when the shared cores the container is allowed to run on get saturated, despite the individual container using less CPU.
>
> In other words, there are trade-offs between latency, CPU and memory. All of these ar...
> @fisk there seem to be some assumptions involved in how the container actually operates to achieve its configured limits. If your CPU quota is 50% you don't know whether you will get all cores for 50% of the time or 50% of the cores for all the time - unless your container environment tells you how it operates. My recollection was that you could limit both the number of CPUs and the quota of CPU time allocated. So this would very much be a CGroups API not a generic "container" API (but although we call it a Container API it has been totally shaped by CGroups anyway ...)
It is true that you can a) configure CPU quotas, b) configure CPU weights, and C) configure the set of CPUs a cgroup container can run in, I'm not sure I follow though how we are making assumptions about the cgroup implementation in the proposed API. The API exposes only the CPU quotas/limits. I'm not sure I follow where any controversial assumptions are made about the implementation of cgroups in the API. It exposes what number of processors, on average, the container is limited to run on. Are you afraid that some exotic new container implementation might not map cleanly to describe how many processors on average is available to the container? If so - while that is possible, I'd cross that bridge when/if we get there; it sounds like a strange container environment. But perhaps that was not your point?
> You then want to be able to ask the execution environment that executes the container what resources it has available for processes (like the container). This is where I am less convinced that a suitable API exists for you to query these values from a process inside the container environment. Calling this outer layer "machine" is not really accurate, but "os" is already taken and you can't change the semantics of the existing os:: API.
Right. There is no intention to change the semantics of the existing os:: API. If the word “machine” is not a good one, perhaps we can find a better word. Perhaps “system” as it talks about the system resources?
> But again I think it a mistake to take the existing os:: API and and define Container and Machine versions of it, because that existing os:: API does not really map cleanly to what you want to ask. I would look at what you want to ask and define Machine and Container APIs that do that, without trying to tie it back to the existing os API (implementation can of course be shared underneath).
The questions I have are:
1) How close am I to hitting a memory limit on the container level.
2) How close am I to hitting a memory limit on the system level.
3) How close am I to hitting a CPU limit on the container level.
4) How close am I to hitting a CPU limit on the system level.
I think the proposed APIs do map cleanly to what I want to ask (+/- names). It seems like a pretty good mapping to expose what the corresponding resource usage and limits are on the container and system levels to get answers to those questions. We have modelled answers to that question in exactly the same way when you run within a container and when you run outside a container, except that you only get to ask questions about the innermost layers and not the outer layer.
Other than that, I think the main notable difference to the other os:: APIs we have, is that the CPU limit is an average on the container level. That is why it was represented as a double which better reflects that. The existing os:: API will round up to one when you have 0.125 cores available on average, because it was not designed to be an average, but is probably a good enough approximation for many use cases. But sometimes better precision is needed. Would it perhaps help if the word “average” or “avg” was in the name of the CPU limit function, to avoid confusion?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/27646#issuecomment-3435906478
More information about the hotspot-dev
mailing list