RFR: 2178143: VM crashes if the number of bound CPUs changed during runtime

Thu Mar 21 05:28:29 UTC 2013

Hi Yumin,

On 21/03/2013 2:37 PM, Yumin Qi wrote:
<snip>
> I think this is only a workaround and not a solution for the specific
> use case, or  there is no perfect solution for it. If customers decided
> to use this flag,  they should be aware that they are ready not to
> consider performance at first.  For using number of available
> processors, we need to add code to get that number, not the one in
> hotspot os::active_processor_count() which will return the number of
> live processors.  So do you think I could just use a flag -XX:+AssumeMP
> work around  this problem? Since GC threads will not be based on this
> assumption, I agree (Harold pointed this out either) that one flag is
> simpler.  In fact, the code for is_MP() is obsolete for today's
> computers, all with multi-cores (even for cell chips) so I think better
> solution is remove call to is_MP() in all places in hotspot. I will
> prepare another webrev with -XX:+AssumeMP for next codereview.

As per my follow up email I think AssumeMP is the way to go for now. We 
just need to decide whether we think the default should be true or false.

Removing is_MP() altogether is more problematic because of the range of 
platforms this has to work. A build time solution would define is_MP() 
as a constant for those platforms that want that, and then the is_MP 
calls will not appear in the generated code - while still allowing other 
platforms to only insert MP code when needed. But that disallows any 
runtime configuration for those platforms we assume are always MP.

> For number of GC Threads:
> unsigned int Abstract_VM_Version::nof_parallel_worker_threads(
>                                                        unsigned int num,
>                                                        unsigned int den,
>                                                        unsigned int
> switch_pt) {
>    if (FLAG_IS_DEFAULT(ParallelGCThreads)) {
>      assert(ParallelGCThreads == 0, "Default ParallelGCThreads is not 0");
>      // For very large machines, there are diminishing returns
>      // for large numbers of worker threads.  Instead of
>      // hogging the whole system, use a fraction of the workers for every
>      // processor after the first 8.  For example, on a 72 cpu machine
>      // and a chosen fraction of 5/8
>      // use 8 + (72 - 8) * (5/8) == 48 worker threads.
>      unsigned int ncpus = (unsigned int) os::active_processor_count();
>      return (ncpus <= switch_pt) ?
>             ncpus :
>            (switch_pt + ((ncpus - switch_pt) * num) / den);
>    } else {
>      return ParallelGCThreads;
>    }
> }
>
> the call to this function is
> unsigned int Abstract_VM_Version::calc_parallel_worker_threads() {
>    return nof_parallel_worker_threads(5, 8, 8);
> }
>
> We can see that, if active_processor_count is 1, it will return 1, and
> the VM will run with single GC thread. So the better choices maybe:
>
> 1) get processor count not active processor count for ParallelGCThreads,
> that is up to decision from GC team.

Given this occurs at VM startup there may not be any difference. It 
depends on the OS and any "container" facility (like Solaris zones) as 
to what number of processors will be seen to "exist" versus what are 
seen to be "available".

But this is a GC ergonomics issue distinct from the is_MP problem.

> 2) Recommend usage is
>
>    -XX:+AssumeMP -XX:ParallelGCThreads=<number>

It is hard to know whether the people launching the VM will have the 
necessary knowledge as to what to put here.

David
-----

> Thanks
> Yumin
>
>>> 2178143:  VM crashes if the number of bound CPUs changed during runtime.
>>>
>>> Situation: Customer first configure only one CPU online and turn others
>>> offline to run java application, after java program started, bring more
>>> CPUs back online. Since VM started on a single CPU, os::is_MP() will
>>> return false, but after more CPUs available, OS will schedule the app
>>> run on multiple CPUs, this caused SEGV in various places where data
>>> consistency was broken. The solution is supply a flag to assume it is
>>> running on MP, so lock is forced to be called.
>>>
>>> http://cr.openjdk.java.net/~minqi/2178143/
>>>
>>> Thanks
>>> Yumin
>