code review (round 1) for memory commit failure fix (8013057)
Daniel D. Daugherty
daniel.daugherty at oracle.com
Tue May 28 09:05:45 PDT 2013
Thanks for the re-review!
For those watching closely, Zhengyu was the preliminary reviewer for the
initial version of the fix. He original review was done off-thread.
Dan
On 5/28/13 7:22 AM, Zhengyu Gu wrote:
> Looks good from NMT point of view.
>
> Thanks,
>
> -Zhengyu
>
> On May 24, 2013, at 2:23 PM, Daniel D. Daugherty wrote:
>
>> Greetings,
>>
>> I have a revised version of the proposed fix for the following bug:
>>
>> 8013057 assert(_needs_gc || SafepointSynchronize::is_at_safepoint())
>> failed: only read at safepoint
>>
>> Here are the (round 1) webrev URLs:
>>
>> OpenJDK: http://cr.openjdk.java.net/~dcubed/8013057-webrev/1-hsx25/
>> Internal: http://javaweb.us.oracle.com/~ddaugher/8013057-webrev/1-hsx25/
>>
>> Testing:
>> - Aurora Adhoc vm.quick batch for all OSes in the following configs:
>> {Client VM, Server VM} x {fastdebug} x {-Xmixed}
>> - I've created a standalone Java stress test with a shell script
>> wrapper that reproduces the failing code paths on my Solaris X86
>> server. This test will not be integrated since running the machine
>> out of swap space is very disruptive (crashes the window system,
>> causes various services to exit, etc.)
>>
>> Gory details are below. As always, comments, questions and
>> suggestions are welome.
>>
>> Dan
>>
>>
>> Gory Details:
>>
>> The VirtualSpace data structure is built on top of the ReservedSpace
>> data structure. VirtualSpace presumes that failed os::commit_memory()
>> calls do not affect the underlying ReservedSpace memory mappings.
>> That assumption is true on MacOS X and Windows, but it is not true
>> on Linux or Solaris. The mmap() system call on Linux or Solaris can
>> lose previous mappings in the event of certain errors. On MacOS X,
>> the mmap() system call clearly states that previous mappings are
>> replaced only on success. On Windows, a different set of APIs are
>> used and they do not document any loss of previous mappings.
>>
>> The solution is to implement the proper failure checks in the
>> os::commit_memory() implementations on Linux and Solaris. On MacOS X
>> and Windows, no additional checks are needed.
>>
>> There is also a secondary change where some of the pd_commit_memory()
>> calls were calling os::commit_memory() instead of calling their sibling
>> os::pd_commit_memory(). This resulted in double NMT tracking so this
>> has also been fixed. There were also some incorrect mmap)() return
>> value checks which have been fixed.
>>
>> Just to be clear: This fix simply properly detects the "out of swap
>> space" condition on Linux and Solaris and causes the VM to fail in a
>> more orderly fashion with a message that looks like this:
>>
>> The Java process' stderr will show:
>>
>> INFO: os::commit_memory(0xfffffd7fb2522000, 4096, 4096, 0) failed; errno=11
>> #
>> # There is insufficient memory for the Java Runtime Environment to continue.
>> # Native memory allocation (mmap) failed to map 4096 bytes for committing reserved memory.
>> # An error report file with more information is saved as:
>> # /work/shared/bugs/8013057/looper.03/hs_err_pid9111.log
>>
>> The hs_err_pid file will have the more verbose info:
>>
>> #
>> # There is insufficient memory for the Java Runtime Environment to continue.
>> # Native memory allocation (mmap) failed to map 4096 bytes for committing reserved memory.
>> # Possible reasons:
>> # The system is out of physical RAM or swap space
>> # In 32 bit mode, the process size limit was hit
>> # Possible solutions:
>> # Reduce memory load on the system
>> # Increase physical memory or swap space
>> # Check if swap backing store is full
>> # Use 64 bit Java on a 64 bit OS
>> # Decrease Java heap size (-Xmx/-Xms)
>> # Decrease number of Java threads
>> # Decrease Java thread stack sizes (-Xss)
>> # Set larger code cache with -XX:ReservedCodeCacheSize=
>> # This output file may be truncated or incomplete.
>> #
>> # Out of Memory Error (/work/shared/bug_hunt/hsx_rt_latest/exp_8013057/src/os/s
>> olaris/vm/os_solaris.cpp:2791), pid=9111, tid=21
>> #
>> # JRE version: Java(TM) SE Runtime Environment (8.0-b89) (build 1.8.0-ea-b89)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.0-b33-bh_hsx_rt_exp_8013057_dcu
>> bed-product-fastdebug mixed mode solaris-amd64 compressed oops)
>> # Core dump written. Default location: /work/shared/bugs/8013057/looper.03/core
>> or core.9111
>> #
>>
>> You might be wondering why we are assuming that the failed mmap()
>> commit operation has lost the 'reserved memory' mapping.
>>
>> We have no good way to determine if the 'reserved memory' mapping
>> is lost. Since all the other threads are not idle, it is possible
>> for another thread to have 'reserved' the same memory space for a
>> different data structure. Our thread could observe that the memory
>> is still 'reserved' but we have no way to know that the reservation
>> isn't ours.
>>
>> You might be wondering why we can't recover from this transient
>> resource availability issue.
>>
>> We could retry the failed mmap() commit operation, but we would
>> again run into the issue that we no longer know which data
>> structure 'owns' the 'reserved' memory mapping. In particular, the
>> memory could be reserved by native code calling mmap() directly so
>> the VM really has no way to recover from this failure.
More information about the hotspot-runtime-dev
mailing list