code review (round 2) for memory commit failure fix (8013057)

Thu Jun 6 19:49:23 PDT 2013

David,

Thanks for the re-review. Replies are embedded below.

On 6/6/13 8:05 PM, David Holmes wrote:
> Hi Dan,
>
> On 6/06/2013 10:17 AM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> Since we're in a holding pattern with RT_Baseline, I've taken
>> the opportunity to make the code style changes that Stefan
>> requested.
>>
>> New webrev:
>>
>>      http://cr.openjdk.java.net/~dcubed/8013057-webrev/3-hsx25/
>>
>> Stefan has already re-reviewed and approves. JPRT-hotspotwest job
>> is in progress.
>
> For the record I find it somewhat confusing to have these different 
> failure semantics depending on the platform. It might simplify both 
> the pd code and the shared code if any mmap failure were simply a 
> fatal error. But we're not taking that path so ...
>
> I don't really understand the "recoverable mmap error" logic:
>
> 2557   case EBADF:
> 2558   case EINVAL:
> 2559   case ENOTSUP:
>
> these are, we hope, errors caused by failed pre-conditions and so we 
> expect that at the point these issues are detected the mmap has been a 
> no-op and so no reservation has been lost. Fair enough.

Prior this this fix, we called os::commit_memory() and it came back
with true or false and the caller "did something" with that failure
mode. The idea is that the upper layers have some idea about the
overall semantics of what they are trying to do and are in the best
position to decide whether to fail or retry later.

Along comes our realization that the underlying mmap() API doesn't
quite behave the way we expect on all OSes. If all the OSes were
equally screwed up, we would fix this a different way. However, it
is just Linux and Solaris that screw this up so we limit the new
vm_exit_out_of_memory() calls for this mmap() failure to those two
OSes.

Any failures that make it back from os::commit_memory() to the upper
layers will continue to be handled at those upper layers the way they
were handled before. I don't think we're ready to completely rewrite
the whole reserve_memory/commit_memory model. That would be a bit
much since this has to be backported.

> But to me these indicate programming errors in our code so why not 
> fail-fast anyway?

Maybe they are programming errors and maybe they are crazy commit_memory()
probes in an attempt to adapt to some page size thing. I can't say that
I understand all the reasons that the above three errnos are returned
nor do I need to for this bug fix. What I do know is that these three
errnos were permitted to trickle back before and it appears to be safe
to continue to let them do so since they won't mess up existing mappings.

> If these occur we are going to lose the detailed information by 
> returning false and this may not even trigger an upper layer failure 
> allowing a programming error to go undetected.

Sorry, this is no different than we behaved before so I'm not planning
to worry about that for this fix. As I said above, we're not rewriting
the whole reserve_memory/commit_memory model.

> Is there a reason that:
>
> int os::Linux::commit_memory_impl
>
> is not simply simply
>
> static int commit_memory_impl? It seems to be purely internal and so 
> doesn't need to be part of the os::Linux exported API. Ditto for the 
> Solaris version.

Seemed like the place to put it since there are other similar
peers at the os::Linux level. I used the 'static' function for
the things that were strictly helper functions. Also seemed
like the thing to do.

> ---
>
> Otherwise ok. This is messy issue, thanks for taking the time to try 
> and make it cleaner.

Thanks! I'm not planning to make any changes based on your review.
Please let me know if you're OK with that.

Credit for making it cleaner goes to Stefan K. He kept after me
readability of this code and not overloading too many things into
one construct.

Dan

>
> Cheers,
> David
> -----
>
>> Dan
>>
>>
>>
>> On 6/5/13 7:01 AM, Daniel D. Daugherty wrote:
>>> Stefan,
>>>
>>> Thanks for the re-review! And for several off-line reviews
>>> as we were trying to work out this 'round 2' version.
>>>
>>>
>>> On 6/5/13 2:02 AM, Stefan Karlsson wrote:
>>>> On 06/04/2013 07:32 PM, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> I have another revised version of the proposed fix for the following
>>>>> bug:
>>>>>
>>>>>     8013057 assert(_needs_gc ||
>>>>> SafepointSynchronize::is_at_safepoint())
>>>>>             failed: only read at safepoint
>>>>>
>>>>> Here are the (round 2) webrev URLs:
>>>>>
>>>>> OpenJDK: http://cr.openjdk.java.net/~dcubed/8013057-webrev/2-hsx25/
>>>>> Internal:
>>>>> http://javaweb.us.oracle.com/~ddaugher/8013057-webrev/2-hsx25/
>>>>
>>>> Looks good. Thanks for fixing this and making sure that this new code
>>>> work well with the error handling in the GC code.
>>>
>>> Thanks! The new commit_memory_or_exit() cleaned up call sites in
>>> several places so I appreciate your feedback.
>>>
>>>
>>>> I have two code style comments that are not necessary to fix, but it
>>>> would make me happy if they were.
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8013057-webrev/2-hsx25/src/share/vm/runtime/os.hpp.udiff.html 
>>>>
>>>>
>>>>
>>>> +// Executable parameter flag for os::commit_memory() and
>>>> +// os::commit_memory_or_exit().
>>>> +enum ExecMemFlag {
>>>> +  ExecMem = true
>>>> +};
>>>>
>>>>
>>>> Why is this an Enum and not a const bool when it's used as a bool?
>>>
>>> Hmmm... It's an Enum because that seemed to be style used in other 
>>> places
>>> in the VM when we didn't want to pass "true /* some comment */" as a
>>> parameter. Do you happen to have a "const bool" example I could look 
>>> at?
>>> I'll poke around, but if you could point me at an example, I would
>>> appreciate it...
>>>
>>>
>>>> http://cr.openjdk.java.net/~dcubed/8013057-webrev/2-hsx25/src/os/linux/vm/os_linux.cpp.udiff.html 
>>>>
>>>>
>>>>
>>>> +  // see if the error is one we can let the caller handle
>>>> +  switch (err) {
>>>> +  case EBADF:
>>>> +  case EINVAL:
>>>> +  case ENOTSUP:
>>>> +    // let the caller deal with these errors
>>>> +    break;
>>>> +
>>>> +  default:
>>>> +    // Any remaining errors on this OS can cause our reserved mapping
>>>> +    // to be lost. That can cause confusion where different data
>>>> +    // structures think they have the same memory mapped. The worst
>>>> +    // scenario is if both the VM and a library think they have the
>>>> +    // same memory mapped.
>>>> +    warn_fail_commit_memory(addr, size, exec, err);
>>>> +    vm_exit_out_of_memory(size, OOM_MMAP_ERROR, "committing reserved
>>>> memory.");
>>>> +    break;
>>>> +  }
>>>>
>>>> and
>>>>
>>>> +    switch (err) {
>>>> +    case EBADF:
>>>> +    case EINVAL:
>>>> +    case ENOTSUP:
>>>> +      // these errors are OK for retrying with small pages
>>>> +      break;
>>>> +
>>>> +    default:
>>>> +      // Any remaining errors on this OS can cause our reserved 
>>>> mapping
>>>> +      // to be lost. That can cause confusion where different data
>>>> +      // structures think they have the same memory mapped. The worst
>>>> +      // scenario is if both the VM and a library think they have the
>>>> +      // same memory mapped.
>>>> +      //
>>>> +      // However, it is not clear that this loss of our reserved
>>>> mapping
>>>> +      // happens with large pages on Linux or that we cannot recover
>>>> +      // from the loss. For now, we just issue a warning and we don't
>>>> +      // call vm_exit_out_of_memory(). This issue is being tracked by
>>>> +      // JBS-8007074.
>>>> +      warn_fail_commit_memory(addr, size, alignment_hint, exec, err);
>>>> +#if 0
>>>> +      vm_exit_out_of_memory(size, OOM_MMAP_ERROR,
>>>> +                            "committing reserved memory.");
>>>> +#endif
>>>> +      break;
>>>> +    }
>>>>
>>>> These construct are rather long and duplicated. I'd prefer if this
>>>> could be extracted out into a helper function so that the logic of
>>>> commit_memory_impl is a bit clearer.
>>>
>>> I agree that this could be improved. You had recommended this:
>>>
>>> if (!recoverable_error(err)) {
>>>   warn_fail_commit_memory(addr, size, exec, err);
>>>   vm_exit_out_of_memory(size, OOM_MMAP_ERROR, "committing reserved
>>> memory.");
>>> }
>>>
>>> and this:
>>>
>>> if (!recoverable_error(err)) {
>>>   warn_fail_commit_memory(addr, size, alignment_hint, exec, err);
>>> #if 0
>>>   vm_exit_out_of_memory(size, OOM_MMAP_ERROR, "committing reserved
>>> memory.");
>>> #endif
>>> }
>>>
>>> I concur that the identical parts can be refactored into
>>> recoverable_error(), but the "However, it is not clear..."
>>> paragraph needs to stay with disabled vm_exit_out_of_memory()
>>> call. BTW, Coleen requested (off-thread) that the call be
>>> commented out since of #if'ed out.
>>>
>>> I'll definitely make these changes if I have to re-roll the
>>> patch for non-code style changes. Depending on how this
>>> changeset lines up with RT_Baseline's schedule this week,
>>> I may even make the changes anyway.
>>>
>>> Again, thanks for the many review cycles.
>>>
>>> Dan
>>>
>>>
>>>>
>>>> thanks,
>>>> StefanK
>>>>
>>>>>
>>>>> Testing:
>>>>> - Aurora Adhoc vm.quick batch for all OSes in the following configs:
>>>>>   {Client VM, Server VM} x {fastdebug} x {-Xmixed}
>>>>> - I've created a standalone Java stress test with a shell script
>>>>>   wrapper that reproduces the failing code paths on my Solaris X86
>>>>>   server and on Ron's DevOps Linux machine. This test will not be
>>>>>   integrated since running the machine out of swap space is very
>>>>>   disruptive (crashes the window system, causes various services to
>>>>>   exit, etc.)
>>>>>
>>>>> There are three parts to this fix:
>>>>>
>>>>> 1) Detect commit memory failures on Linux and Solaris where the
>>>>>    previous reservation can be lost and call vm_exit_out_of_memory()
>>>>>    to report the resource exhaustion. Add os::commit_memory_or_exit()
>>>>>    API to provide more consistent handling of vm_exit_out_of_memory()
>>>>>    calls.
>>>>> 2) Change existing os::commit_memory() calls to make the executable
>>>>>    status of memory more clear; this makes security analysis easier.
>>>>> 3) Clean up some platform dependent layer calls that were resulting
>>>>>    in extra NMT accounting. Clean up some mmap() return value checks.
>>>>>
>>>>> Gory details are below. As always, comments, questions and
>>>>> suggestions are welome.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> Gory Details:
>>>>>
>>>>> The VirtualSpace data structure is built on top of the ReservedSpace
>>>>> data structure. VirtualSpace presumes that failed os::commit_memory()
>>>>> calls do not affect the underlying ReservedSpace memory mappings.
>>>>> That assumption is true on MacOS X and Windows, but it is not true
>>>>> on Linux or Solaris. The mmap() system call on Linux or Solaris can
>>>>> lose previous mappings in the event of certain errors. On MacOS X,
>>>>> the mmap() system call clearly states that previous mappings are
>>>>> replaced only on success. On Windows, a different set of APIs are
>>>>> used and they do not document any loss of previous mappings.
>>>>>
>>>>> The solution is to implement the proper failure checks in the
>>>>> os::commit_memory() implementations on Linux and Solaris. On MacOS X
>>>>> and Windows, no additional checks are needed.
>>>>>
>>>>> During code review round 1, there was a request from the GC team to
>>>>> provide an os::commit_memory_or_exit() entry point in order to 
>>>>> preserve
>>>>> the existing error messages on all platforms. This entry point allows
>>>>> code like this:
>>>>>
>>>>> src/share/vm/gc_implementation/parallelScavenge/cardTableExtension.cpp: 
>>>>>
>>>>>
>>>>>  568       if (!os::commit_memory((char*)new_committed.start(),
>>>>>  569 new_committed.byte_size())) {
>>>>>  570 vm_exit_out_of_memory(new_committed.byte_size(),
>>>>> OOM_MMAP_ERROR,
>>>>>  571                               "card table expansion");
>>>>>
>>>>> to be replaced with code like this:
>>>>>
>>>>>  568 os::commit_memory_or_exit((char*)new_committed.start(),
>>>>>  569 new_committed.byte_size(), !ExecMem,
>>>>>  570                                 "card table expansion");
>>>>>
>>>>> All uses of os::commit_memory() have been visited and those locations
>>>>> that previously exited on error have been updated to use the new 
>>>>> entry
>>>>> point. This new entry point cleans up the original call sites and the
>>>>> vm_exit_out_of_memory() calls are now consistent on all platforms.
>>>>>
>>>>> As a secondary change, while visiting all os::commit_memory() 
>>>>> calls, I
>>>>> also updated them to use the new ExecMem enum in order to make the
>>>>> executable status of the memory more clear. Since executable 
>>>>> memory can
>>>>> be an attack vector, it is prudent to make the executable status of
>>>>> memory crystal clear. This also allowed me to remove the default
>>>>> executable flag value of 'false'. Now all new uses of commit_memory()
>>>>> must be clear about the executable status of the memory.
>>>>>
>>>>> There are also tertiary changes where some of the pd_commit_memory()
>>>>> calls were calling os::commit_memory() instead of calling their 
>>>>> sibling
>>>>> os::pd_commit_memory(). This resulted in double NMT tracking so this
>>>>> has also been fixed. There were also some incorrect mmap)() return
>>>>> value checks which have been fixed.
>>>>>
>>>>> Just to be clear: This fix simply properly detects the "out of swap
>>>>> space" condition on Linux and Solaris and causes the VM to fail in a
>>>>> more orderly fashion with a message that looks like this:
>>>>>
>>>>> The Java process' stderr will show:
>>>>>
>>>>> INFO: os::commit_memory(0xfffffd7fb2522000, 4096, 4096, 0) failed;
>>>>> errno=11
>>>>> #
>>>>> # There is insufficient memory for the Java Runtime Environment to
>>>>> continue.
>>>>> # Native memory allocation (mmap) failed to map 4096 bytes for
>>>>> committing reserved memory.
>>>>> # An error report file with more information is saved as:
>>>>> # /work/shared/bugs/8013057/looper.03/hs_err_pid9111.log
>>>>>
>>>>> The hs_err_pid file will have the more verbose info:
>>>>>
>>>>> #
>>>>> # There is insufficient memory for the Java Runtime Environment to
>>>>> continue.
>>>>> # Native memory allocation (mmap) failed to map 4096 bytes for
>>>>> committing reserved memory.
>>>>> # Possible reasons:
>>>>> #   The system is out of physical RAM or swap space
>>>>> #   In 32 bit mode, the process size limit was hit
>>>>> # Possible solutions:
>>>>> #   Reduce memory load on the system
>>>>> #   Increase physical memory or swap space
>>>>> #   Check if swap backing store is full
>>>>> #   Use 64 bit Java on a 64 bit OS
>>>>> #   Decrease Java heap size (-Xmx/-Xms)
>>>>> #   Decrease number of Java threads
>>>>> #   Decrease Java thread stack sizes (-Xss)
>>>>> #   Set larger code cache with -XX:ReservedCodeCacheSize=
>>>>> # This output file may be truncated or incomplete.
>>>>> #
>>>>> #  Out of Memory Error
>>>>> (/work/shared/bug_hunt/hsx_rt_latest/exp_8013057/src/os/s
>>>>> olaris/vm/os_solaris.cpp:2791), pid=9111, tid=21
>>>>> #
>>>>> # JRE version: Java(TM) SE Runtime Environment (8.0-b89) (build
>>>>> 1.8.0-ea-b89)
>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>> (25.0-b33-bh_hsx_rt_exp_8013057_dcu
>>>>> bed-product-fastdebug mixed mode solaris-amd64 compressed oops)
>>>>> # Core dump written. Default location:
>>>>> /work/shared/bugs/8013057/looper.03/core
>>>>> or core.9111
>>>>> #
>>>>>
>>>>> You might be wondering why we are assuming that the failed mmap()
>>>>> commit operation has lost the 'reserved memory' mapping.
>>>>>
>>>>>     We have no good way to determine if the 'reserved memory' mapping
>>>>>     is lost. Since all the other threads are not idle, it is possible
>>>>>     for another thread to have 'reserved' the same memory space for a
>>>>>     different data structure. Our thread could observe that the 
>>>>> memory
>>>>>     is still 'reserved' but we have no way to know that the 
>>>>> reservation
>>>>>     isn't ours.
>>>>>
>>>>> You might be wondering why we can't recover from this transient
>>>>> resource availability issue.
>>>>>
>>>>>     We could retry the failed mmap() commit operation, but we would
>>>>>     again run into the issue that we no longer know which data
>>>>>     structure 'owns' the 'reserved' memory mapping. In particular, 
>>>>> the
>>>>>     memory could be reserved by native code calling mmap() 
>>>>> directly so
>>>>>     the VM really has no way to recover from this failure.
>>>>>
>>>>> You might be wondering why part of his work is deferred:
>>>>>
>>>>> 2654     default:
>>>>> 2655       // Any remaining errors on this OS can cause our reserved
>>>>> mapping
>>>>> 2656       // to be lost. That can cause confusion where different 
>>>>> data
>>>>> 2657       // structures think they have the same memory mapped. The
>>>>> worst
>>>>> 2658       // scenario is if both the VM and a library think they
>>>>> have the
>>>>> 2659       // same memory mapped.
>>>>> 2660       //
>>>>> 2661       // However, it is not clear that this loss of our
>>>>> reserved mapping
>>>>> 2662       // happens with large pages on Linux or that we cannot
>>>>> recover
>>>>> 2663       // from the loss. For now, we just issue a warning and we
>>>>> don't
>>>>> 2664       // call vm_exit_out_of_memory(). This issue is being
>>>>> tracked by
>>>>> 2665       // JBS-8007074.
>>>>> 2666       warn_fail_commit_memory(addr, size, alignment_hint, exec,
>>>>> err);
>>>>> 2667 #if 0
>>>>> 2668       vm_exit_out_of_memory(size, OOM_MMAP_ERROR,
>>>>> 2669                             "committing reserved memory.");
>>>>> 2670 #endif
>>>>> 2671       break;
>>>>>
>>>>>     When lines 2668-2669 are enabled and UseHugeTLBFS is specified,
>>>>>     then the VM will exit because no more huge/large pages are
>>>>>     available. It is not yet clear that this transition from large to
>>>>>     small pages is actually unsafe, but we don't yet have proof that
>>>>>     it is safe either. More research will be done via JBS-8007074.
>>>>>
>>>>
>>>
>>>
>>>
>>