code review for memory commit failure fix (8013057)

Thu May 23 09:39:27 PDT 2013

David,

Thanks for the quick review!

On 5/23/13 3:53 AM, David Holmes wrote:
> Hi Dan,
>
> I think I am in full agreement with your approach here, but need a 
> little time to digest the details.

No problem, we need to be absolutely paranoid about this fix.
However, I'm going to rework the fix based on Stefan's review
so hold off on code specifics. Feel free to keep mulling on
the theory behind this 'fix'.

>
> Meanwhile I noticed that something seems wrong with the error message:
>
> > # Native memory allocation (mmap) failed to map 4096 bytes for out of
> > swap space

Yes, I saw the same thing when I was testing and I convinced myself
that my message was similar to other uses and thus it was "OK".
I'll see what I can do about fixing that up.

>
> The invocation:
>
> vm_exit_out_of_memory(bytes, OOM_MMAP_ERROR, "out of swap space");
>
> should be passing information about what action failed, not why it 
> failed (based on how the message gets used). Grepping through I can 
> see a lot of other uses of vm_exit_out_of_memory that will also 
> produce grammatically strange messages eg:
>
> failed to map 4096 bytes for unable to create Unicode strings for 
> String table rehash
>
> failed to map 4096 bytes for Cannot create GC thread. Out of system 
> resources.
>
> failed to map 4096 bytes for Out of swap space to map in thread stack.

I don't think I'll revisit the other places since this fix will
likely be applied to HSX-24 (and possibly older). However, a new
tracking bug for the 'grammatically strange messages' will be filed.

Dan

>
> David
> -----
>
> On 23/05/2013 11:37 AM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I have a proposed fix for the following bug:
>>
>>      8013057 assert(_needs_gc || 
>> SafepointSynchronize::is_at_safepoint())
>>              failed: only read at safepoint
>>
>> Here are the webrev URLs:
>>
>> OpenJDK: http://cr.openjdk.java.net/~dcubed/8013057-webrev/0-hsx25/
>> Internal: http://javaweb.us.oracle.com/~ddaugher/8013057-webrev/0-hsx25/
>>
>> Testing:
>> - vm.quick on Win32 and Solaris X86 in the followings configs:
>>    {Client VM, Server VM} x {product, fastdebug} x {-Xmixed, -Xcomp}
>> - I also have an Aurora Adhoc vm.quick batch for all OSes in the
>>    following configs:
>>    {Client VM, Server VM} x {fastdebug} x {-Xmixed}
>> - I've created a standalone Java stress test with a shell script
>>    wrapper that reproduces the failing code paths on my Solaris X86
>>    server. This test will not be integrated since running the machine
>>    out of swap space is very disruptive (crashes the window system,
>>    causes various services to exit, etc.)
>>
>> Gory details are below. As always, comments, questions and
>> suggestions are welome.
>>
>> Dan
>>
>>
>> Gory Details:
>>
>> The VirtualSpace data structure is built on top of the ReservedSpace
>> data structure. VirtualSpace presumes that failed os::commit_memory()
>> calls do not affect the underlying ReservedSpace memory mappings.
>> That assumption is true on MacOS X and Windows, but it is not true
>> on Linux or Solaris. The mmap() system call on Linux or Solaris can
>> lose previous mappings in the event of certain errors. On MacOS X,
>> the mmap() system call clearly states that previous mappings are
>> replaced only on success. On Windows, a different set of APIs are
>> used and they do not document any loss of previous mappings.
>>
>> The solution is to add a new os::commit_reserved_memory() call and
>> implement the proper failure checks on Linux and Solaris. VirtualSpace
>> is changed to use os::commit_reserved_memory(). On MacOS X and Windows,
>> os::commit_reserved_memory() is simply a wrapper around the original
>> os::commit_memory().
>>
>> There is also a secondary change where some of the pd_commit_memory()
>> calls were calling os::commit_memory() instead of calling their sibling
>> os::pd_commit_memory(). This resulted in double NMT tracking so this
>> has also been fixed.
>>
>> Just to be clear: This fix simply properly detects the "out of swap
>> space" condition on Linux and Solaris and causes the VM to fail in a
>> more orderly fashion with a message that looks like this:
>>
>> The Java process' stderr will show:
>>
>> INFO: os::commit_reserved_memory(0xfffffd7fb2522000, 4096, 4096, 0)
>> failed; errno=11
>> #
>> # There is insufficient memory for the Java Runtime Environment to
>> continue.
>> # Native memory allocation (mmap) failed to map 4096 bytes for out of
>> swap space
>> # An error report file with more information is saved as:
>> # /work/shared/bugs/8013057/looper.03/hs_err_pid9111.log
>>
>> The hs_err_pid file will have the more verbose info:
>>
>> #
>> # There is insufficient memory for the Java Runtime Environment to
>> continue.
>> # Native memory allocation (mmap) failed to map 4096 bytes for out of
>> swap space
>> # Possible reasons:
>> #   The system is out of physical RAM or swap space
>> #   In 32 bit mode, the process size limit was hit
>> # Possible solutions:
>> #   Reduce memory load on the system
>> #   Increase physical memory or swap space
>> #   Check if swap backing store is full
>> #   Use 64 bit Java on a 64 bit OS
>> #   Decrease Java heap size (-Xmx/-Xms)
>> #   Decrease number of Java threads
>> #   Decrease Java thread stack sizes (-Xss)
>> #   Set larger code cache with -XX:ReservedCodeCacheSize=
>> # This output file may be truncated or incomplete.
>> #
>> #  Out of Memory Error
>> (/work/shared/bug_hunt/hsx_rt_latest/exp_8013057/src/os/s
>> olaris/vm/os_solaris.cpp:2791), pid=9111, tid=21
>> #
>> # JRE version: Java(TM) SE Runtime Environment (8.0-b89) (build
>> 1.8.0-ea-b89)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>> (25.0-b33-bh_hsx_rt_exp_8013057_dcu
>> bed-product-fastdebug mixed mode solaris-amd64 compressed oops)
>> # Core dump written. Default location:
>> /work/shared/bugs/8013057/looper.03/core
>> or core.9111
>> #
>>
>> You might be wondering why we added os::commit_reserved_memory()
>> instead of changing os::commit_memory().
>>
>>      We wanted to limit the potential impact. The new checks are only
>>      needed when commiting previously reserved memory on Linux and
>>      Solaris. It appears that os::commit_memory() does not require
>>      a previous os::reserve_memory() call and in those cases, we
>>      didn't want to make the code path unrecoverable.
>>
>> You might be wondering why we are assuming that the failed mmap()
>> commit operation has lost the 'reserved memory' mapping.
>>
>>      We have no good way to determine if the 'reserved memory' mapping
>>      is lost. Since all the other threads are not idle, it is possible
>>      for another thread to have 'reserved' the same memory space for a
>>      different data structure. Our thread could observe that the memory
>>      is still 'reserved' but we have no way to know that the reservation
>>      isn't ours.
>>
>> You might be wondering why we can't recover from this transient
>> resource availability issue.
>>
>>      We could retry the failed mmap() commit operation, but we would
>>      again run into the issue that we no longer know which data
>>      structure 'owns' the 'reserved' memory mapping.
>>
>> It looks like PSVirtualSpace::expand_by(size_t bytes),
>> PSVirtualSpace::expand_into(PSVirtualSpace* other_space, size_t bytes),
>> PSVirtualSpaceHighToLow::expand_by(size_t bytes), and
>> PSVirtualSpaceHighToLow::expand_into(PSVirtualSpace* other_space,
>> size_t bytes) in
>> src/share/vm/gc_implementation/parallelScavenge/psVirtualspace.cpp
>> have the same issue. Why aren't you fixing them?
>>
>>      Those functions do indeed appear to be suffering from the same
>>      assumption, but those calls are in GC code. A follow-up bug will be
>>      filed so that someone on the GC team can properly diagnose, fix and
>>      test that code.
>>