RFR: 8146115 - Improve docker container detection and resource configuration usage

Mon Oct 23 07:37:14 UTC 2017

Bob,

I compiled and run .02 on aarch64 linux and it works as expected.

-Dmitry

On 23.10.2017 00:52, David Holmes wrote:
> Hi Bob,
> 
> Changes seem fine.
> 
> I'll take up the issue of whether this should be enabled by default in
> the CSR.
> 
> Thanks,
> David
> 
> On 21/10/2017 4:44 AM, Bob Vandette wrote:
>> Here’s an updated webrev that hopefully takes care of all remaining
>> comments.
>>
>> http://cr.openjdk.java.net/~bobv/8146115/webrev.02
>>
>> I added the deprecation of the UseCGroupMemoryLimitForHeap option this
>> round since
>> this experimental option should no longer be necessary.
>>
>>
>> Bob.
>>
>>
>>> On Oct 13, 2017, at 9:34 AM, David Holmes <David.Holmes at oracle.com>
>>> wrote:
>>>
>>> Reading back through my suggestion for os.hpp
>>> initialize_container_support should just be init_container_support.
>>>
>>> Thanks,
>>> David
>>>
>>> On 13/10/2017 11:14 PM, Bob Vandette wrote:
>>>>> On Oct 12, 2017, at 11:08 PM, David Holmes
>>>>> <david.holmes at oracle.com> wrote:
>>>>>
>>>>> Hi Bob,
>>>>>
>>>>> On 13/10/2017 1:43 AM, Bob Vandette wrote:
>>>>>>> On Oct 11, 2017, at 9:04 PM, David Holmes
>>>>>>> <david.holmes at oracle.com> wrote:
>>>>>>>
>>>>>>> Hi Bob,
>>>>>>>
>>>>>>> On 12/10/2017 5:11 AM, Bob Vandette wrote:
>>>>>>>> Here’s an updated webrev for this RFE that contains changes and
>>>>>>>> cleanups based on feedback I’ve received so far.
>>>>>>>> I’m still investigating the best approach for reacting to cpu
>>>>>>>> shares and quotas.  I do not believe doing nothing is the answer.
>>>>>>>
>>>>>>> I do. :) Let me try this again. When you run outside of a
>>>>>>> container you don't get 100% of the CPUs - you have to share with
>>>>>>> whatever else is running on the system. You get a fraction of CPU
>>>>>>> time based on the load. We don't try to communicate load
>>>>>>> information to the VM/application so it can adapt. Within a
>>>>>>> container setting shares/quotas is just a way of setting an
>>>>>>> artificial load. So why should we be treating it any differently?
>>>>>> Because today we optimize for a lightly loaded system and when
>>>>>> running serverless applications in containers we should be
>>>>>> optimizing for a fully loaded system.  If developers don’t want
>>>>>> this, then don’t use shares or quotas and you’ll have exactly
>>>>>> the behavior you have today.  I think we just have to document the
>>>>>> new behavior (and how to turn it off) so people know what
>>>>>> to expect.
>>>>>
>>>>> The person deploying the app may not have control over how the app
>>>>> is deployed in terms of shares/quotas. It all depends how (and who)
>>>>> manages the containers. This is a big part of my problem/concerns
>>>>> here that I don't know exactly how all this is organized and who
>>>>> knows what in advance and what they can control.
>>>>>
>>>>> But I'll let this drop, other than raising an additional concern. I
>>>>> don't think just allowing the user to hardwire the number of
>>>>> processors to use will necessarily solve the problem with what
>>>>> available_processors() returns. I'm concerned the execution of the
>>>>> VM may occur in a context where the number of processors is not
>>>>> known in advance, and the user can not disable shares/quotas. In
>>>>> that case we may need to have a flag that says to ignore
>>>>> shares/quotas in the processor count calculation.
>>>> I’m not sure that’s a high probability issue.  It’s my understanding
>>>> that whoever is configuring the container
>>>> management will be specifying the resources required to run these
>>>> applications which comes along with a
>>>> guarantee of these resources.  If this issue does come up, I do have
>>>> the -XX:-UseContainerSupport big
>>>> switch that turns all of this off.  It will however disable the
>>>> memory support as well.
>>>>>
>>>>>> You seem to discount the added cost of 100s of VMs creating lots
>>>>>> of un-necessaary threads.  In the current JDK 10 code base,
>>>>>> In a heavily loaded system with 88 processors, VmData grows from
>>>>>> 60MBs (1 cpu) to 376MB (88 cpus).  This is only mapped
>>>>>> memory and it depends heavily on how deep in the stack these
>>>>>> threads go before it impacts VmRSS but it shows the potential
>>>>>> downside
>>>>>> of having 100s of VMs thinking they each own the entire machine.
>>>>>
>>>>> I agree that the default ergonomics does not scale well. Anyone
>>>>> doing any serious Java deployment tunes the VM explicitly and does
>>>>> not rely on the defaults. How will they do that in a container
>>>>> environment? I don't know.
>>>>>
>>>>> I would love to see some actual deployment scenarios/experiences
>>>>> for this to understand things better.
>>>> This is one of the reasons I want to get this support out in JDK 10,
>>>> to get some feedback under real scenarios.
>>>>>
>>>>>> I haven’t even done any experiments to determine the added context
>>>>>> switching cost if the VM decides to use excessive
>>>>>> pthreads.
>>>>>>>
>>>>>>> That's not to say an API to provide load/shares/quota information
>>>>>>> may not be useful, but that is a separate issue to what the
>>>>>>> "active processor count" should report.
>>>>>> I don’t have a problem with active processor count reporting the
>>>>>> number of processors we have, but I do have a problem
>>>>>> with our current usage of this information within the VM and Core
>>>>>> libraries.
>>>>>
>>>>> That is a somewhat separate issue. One worth pursuing separately.
>>>> We should look at this as part of the “Container aware Java” JEP.
>>>>>
>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~bobv/8146115/webrev.01
>>>>>>>> Updates:
>>>>>>>> 1. I had to move the processing of AggressiveHeap since the
>>>>>>>> container memory size needs to be known before this can be
>>>>>>>> processed.
>>>>>>>
>>>>>>> I don't like the placement of this - we don't call os:: init
>>>>>>> functions from inside Arguments - we manage the initialization
>>>>>>> sequence from Threads::create_vm. Seems to me that container
>>>>>>> initialization can/should happen in os::init_before_ergo, and the
>>>>>>> AggressiveHeap processing can occur at the start of
>>>>>>> Arguments::apply_ergo().
>>>>>>>
>>>>>>> That said we need to be sure nothing touched by
>>>>>>> set_aggressive_heap_flags will be used before we now reach that
>>>>>>> code - there are a lot of flags being set in there.
>>>>>> This is exactly the reason why I put the call where it did.  I put
>>>>>> the call to set_aggressive_heap_flags in finalize_vm_init_args
>>>>>> because that is exactly what this call is doing.  It’s finalizing
>>>>>> flags used after the parsing.  The impacted flags are definitely
>>>>>> being
>>>>>> used shortly after and before init_before_ergo is called.
>>>>>
>>>>> I see that now and it is very unfortunate because I really do not
>>>>> like what you had to do here. As you can tell from the logic in
>>>>> create_vm we have always refactored to ensure we can progressively
>>>>> manage the interleaving of OS initialization with Arguments
>>>>> processing. So having a deep part of Argument processing go off and
>>>>> call some more OS initialization is not nice. That said I can't see
>>>>> a way around it without very unreasonable refactoring.
>>>>>
>>>>> But I do have a couple of changes I'd like to request please:
>>>>>
>>>>> 1. Move the call to os::initialize_container_support() up a level
>>>>> to before the call to finalize_vm_init_args(), with a more
>>>>> elaborate comment:
>>>>>
>>>>> // We need to ensure processor and memory resources have been properly
>>>>> // configured - which may rely on arguments we just processed - before
>>>>> // doing the final argument processing. Any argument processing that
>>>>> // needs to know about processor and memory resources must occur after
>>>>> // this point.
>>>>>
>>>>> os::initialize_container_support();
>>>>>
>>>>> // Do final processing now that all arguments have been parsed
>>>>> result = finalize_vm_init_args(patch_mod_javabase);
>>>>>
>>>>> 2. Simplify and modify os.hpp as follows:
>>>>>
>>>>> +  LINUX_ONLY(static void pd_initialize_container_support();)
>>>>>
>>>>>    public:
>>>>>     static void init(void);                      // Called before
>>>>> command line parsing
>>>>>
>>>>> +   static void initialize_container_support() { // Called during
>>>>> command line parsing
>>>>> +     LINUX_ONLY(pd_initialize_container_support();)
>>>>> +   }
>>>>>
>>>>>     static void init_before_ergo(void);          // Called after
>>>>> command line parsing
>>>>>                                                  // before VM
>>>>> ergonomics
>>>>>
>>>>> 3. In thread.cpp add a comment here:
>>>>>
>>>>>    // Parse arguments
>>>>> +  // Note: this internally calls os::initialize_container_support()
>>>>>    jint parse_result = Arguments::parse(args);
>>>> All very reasonable changes.
>>>> Thanks,
>>>> Bob.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>>
>>>>>>>> 2. I no longer use the cpuset.cpus contents since
>>>>>>>> sched_getaffinity reports the correct results
>>>>>>>> even if someone manually updates the cgroup data.  I originally
>>>>>>>> didn’t think this was the case since
>>>>>>>> sched_setaffinity didn’t automatically update the cpuset file
>>>>>>>> contents but the inverse is true.
>>>>>>>
>>>>>>> Ok.
>>>>>>>
>>>>>>>> 3. I ifdef’d the container function support in
>>>>>>>> src/hotspot/share/runtime/os.hpp to avoid putting stubs in all
>>>>>>>> other os
>>>>>>>> platform directories.  I can do this if it’s absolutely necessary.
>>>>>>>
>>>>>>> You should not need to do this if initialization moves as I
>>>>>>> suggested above. os::init_before_ergo() in os_linux.cpp can call
>>>>>>> OSContainer::init().
>>>>>>> No need for os::initialize_container_support() or
>>>>>>> os::pd_initialize_container_support.
>>>>>> But os::init_before_ergo is in shared code.
>>>>>
>>>>> Yep my bad - point is moot now anyway.
>>>>>
>>>>> <snip>
>>>>>
>>>>>>> src/hotspot/os/linux/os_linux.cpp/.hpp
>>>>>>>
>>>>>>> 187         log_trace(os)("available container memory: "
>>>>>>> JULONG_FORMAT, avail_mem);
>>>>>>> 188         return avail_mem;
>>>>>>> 189       } else {
>>>>>>> 190         log_debug(os,container)("container memory usage call
>>>>>>> failed: " JLONG_FORMAT, mem_usage);
>>>>>>>
>>>>>>> Why "trace" (the third logging level) to show the information,
>>>>>>> but "debug" (the second level) to show failed calls? You use
>>>>>>> debug in other files for basic info. Overall I'm unclear on your
>>>>>>> use of debug versus trace for the logging.
>>>>>> I use trace for noisy information that is not reporting errors and
>>>>>> debug for failures that are informational and not fatal.
>>>>>> In this case, the call could return -1 or -2.  -1 is unlimited and
>>>>>> -2 is an error.  In either case we fallback to the
>>>>>> standard system call to get available memory.  I would have used
>>>>>> warning but since these messages were occurring
>>>>>> during a test run causing test failures.
>>>>>
>>>>> Okay. Thanks for clarifying.
>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>>
>>>>>>> src/hotspot/os/linux/osContainer_linux.cpp
>>>>>>>
>>>>>>> Dead code:
>>>>>>>
>>>>>>> 376 #if 0
>>>>>>> 377   os::Linux::print_container_info(tty);
>>>>>>> ...
>>>>>>> 390 #endif
>>>>>> I left it in for standalone testing.  Should I use some other #if?
>>>>>
>>>>> We don't generally leave in dead code in the runtime code. Do you
>>>>> see this as useful after you've finalized the changes?
>>>>>
>>>>> Is this testing just for showing the logging? Is it worth making
>>>>> this a logging controlled call? Is it suitable for a Gtest test?
>>>>>
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>> Bob.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>> Bob.
>>