code review (round 1) for memory commit failure fix (8013057)

Daniel D. Daugherty daniel.daugherty at oracle.com
Tue May 28 07:52:40 PDT 2013


Thanks!

Dan


On 5/27/13 6:37 PM, David Holmes wrote:
> Perfect!
>
> Thanks Dan.
>
> David
> -----
>
> On 28/05/2013 12:43 AM, Daniel D. Daugherty wrote:
>> Thanks for the rereview!
>>
>>
>> On 5/26/13 10:50 PM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> This looks good to me. Only minor quggestion is that in:
>>>
>>>   warning("INFO: os::commit_memory(" PTR_FORMAT ", " SIZE_FORMAT
>>>           ", %d) failed; errno=%d", addr, size, exec, err);
>>>
>>> you also use strerror to output more meaningful text about the error.
>>
>> Had to reread this a couple of times. My brain kept getting stuck on
>> the fact that I didn't use strerror... Once I realized that you meant:
>>
>>     you can also use strerror to output more meaningful text about the
>> error.
>>
>> it made more sense. Not enough coffee yet... Yes, I can use strerror()
>> and that would better. I don't normally use strerror() because I prefer
>> numeric values. Old habits die hard...
>>
>> How about something like this:
>>
>> $ diff src/os/bsd/vm/os_bsd.cpp{.cr1,}
>> 2002c2002,2003
>> <           ", %d) failed; errno=%d", addr, size, exec, err);
>> ---
>>  >           ", %d) failed; error='%s' (errno=%d)", addr, size, exec,
>>  >           strerror(err), err);
>>
>> $ diff src/os/linux/vm/os_linux.cpp{.cr1,}
>> 2574c2574,2575
>> <             ", %d) failed; errno=%d", addr, size, exec, err);
>> ---
>>  >             ", %d) failed; error='%s' (errno=%d)", addr, size, exec,
>>  >             strerror(err), err);
>> 2616,2617c2617,2618
>> <               ", " SIZE_FORMAT ", %d) failed; errno=%d", addr, size,
>> <               alignment_hint, exec, err);
>> ---
>>  >               ", " SIZE_FORMAT ", %d) failed; error='%s' (errno=%d)",
>> addr,
>>  >               size, alignment_hint, exec, strerror(err), err);
>>
>> $ diff src/os/solaris/vm/os_solaris.cpp{.cr1,}
>> 2754c2754,2755
>> <             ", %d) failed; errno=%d", addr, bytes, exec, errno);
>> ---
>>  >             ", %d) failed; error='%s' (errno=%d)", addr, bytes, exec,
>>  >             strerror(err), err);
>>
>>
>> so we get the 'meaningful text' and the errno (in parens)...
>>
>> Dan
>>
>>
>>
>>>
>>> Thanks,
>>> David
>>>
>>> On 25/05/2013 4:23 AM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I have a revised version of the proposed fix for the following bug:
>>>>
>>>>      8013057 assert(_needs_gc ||
>>>> SafepointSynchronize::is_at_safepoint())
>>>>              failed: only read at safepoint
>>>>
>>>> Here are the (round 1) webrev URLs:
>>>>
>>>> OpenJDK: http://cr.openjdk.java.net/~dcubed/8013057-webrev/1-hsx25/
>>>> Internal: 
>>>> http://javaweb.us.oracle.com/~ddaugher/8013057-webrev/1-hsx25/
>>>>
>>>> Testing:
>>>> - Aurora Adhoc vm.quick batch for all OSes in the following configs:
>>>>    {Client VM, Server VM} x {fastdebug} x {-Xmixed}
>>>> - I've created a standalone Java stress test with a shell script
>>>>    wrapper that reproduces the failing code paths on my Solaris X86
>>>>    server. This test will not be integrated since running the machine
>>>>    out of swap space is very disruptive (crashes the window system,
>>>>    causes various services to exit, etc.)
>>>>
>>>> Gory details are below. As always, comments, questions and
>>>> suggestions are welome.
>>>>
>>>> Dan
>>>>
>>>>
>>>> Gory Details:
>>>>
>>>> The VirtualSpace data structure is built on top of the ReservedSpace
>>>> data structure. VirtualSpace presumes that failed os::commit_memory()
>>>> calls do not affect the underlying ReservedSpace memory mappings.
>>>> That assumption is true on MacOS X and Windows, but it is not true
>>>> on Linux or Solaris. The mmap() system call on Linux or Solaris can
>>>> lose previous mappings in the event of certain errors. On MacOS X,
>>>> the mmap() system call clearly states that previous mappings are
>>>> replaced only on success. On Windows, a different set of APIs are
>>>> used and they do not document any loss of previous mappings.
>>>>
>>>> The solution is to implement the proper failure checks in the
>>>> os::commit_memory() implementations on Linux and Solaris. On MacOS X
>>>> and Windows, no additional checks are needed.
>>>>
>>>> There is also a secondary change where some of the pd_commit_memory()
>>>> calls were calling os::commit_memory() instead of calling their 
>>>> sibling
>>>> os::pd_commit_memory(). This resulted in double NMT tracking so this
>>>> has also been fixed. There were also some incorrect mmap)() return
>>>> value checks which have been fixed.
>>>>
>>>> Just to be clear: This fix simply properly detects the "out of swap
>>>> space" condition on Linux and Solaris and causes the VM to fail in a
>>>> more orderly fashion with a message that looks like this:
>>>>
>>>> The Java process' stderr will show:
>>>>
>>>> INFO: os::commit_memory(0xfffffd7fb2522000, 4096, 4096, 0) failed;
>>>> errno=11
>>>> #
>>>> # There is insufficient memory for the Java Runtime Environment to
>>>> continue.
>>>> # Native memory allocation (mmap) failed to map 4096 bytes for
>>>> committing reserved memory.
>>>> # An error report file with more information is saved as:
>>>> # /work/shared/bugs/8013057/looper.03/hs_err_pid9111.log
>>>>
>>>> The hs_err_pid file will have the more verbose info:
>>>>
>>>> #
>>>> # There is insufficient memory for the Java Runtime Environment to
>>>> continue.
>>>> # Native memory allocation (mmap) failed to map 4096 bytes for
>>>> committing reserved memory.
>>>> # Possible reasons:
>>>> #   The system is out of physical RAM or swap space
>>>> #   In 32 bit mode, the process size limit was hit
>>>> # Possible solutions:
>>>> #   Reduce memory load on the system
>>>> #   Increase physical memory or swap space
>>>> #   Check if swap backing store is full
>>>> #   Use 64 bit Java on a 64 bit OS
>>>> #   Decrease Java heap size (-Xmx/-Xms)
>>>> #   Decrease number of Java threads
>>>> #   Decrease Java thread stack sizes (-Xss)
>>>> #   Set larger code cache with -XX:ReservedCodeCacheSize=
>>>> # This output file may be truncated or incomplete.
>>>> #
>>>> #  Out of Memory Error
>>>> (/work/shared/bug_hunt/hsx_rt_latest/exp_8013057/src/os/s
>>>> olaris/vm/os_solaris.cpp:2791), pid=9111, tid=21
>>>> #
>>>> # JRE version: Java(TM) SE Runtime Environment (8.0-b89) (build
>>>> 1.8.0-ea-b89)
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>> (25.0-b33-bh_hsx_rt_exp_8013057_dcu
>>>> bed-product-fastdebug mixed mode solaris-amd64 compressed oops)
>>>> # Core dump written. Default location:
>>>> /work/shared/bugs/8013057/looper.03/core
>>>> or core.9111
>>>> #
>>>>
>>>> You might be wondering why we are assuming that the failed mmap()
>>>> commit operation has lost the 'reserved memory' mapping.
>>>>
>>>>      We have no good way to determine if the 'reserved memory' mapping
>>>>      is lost. Since all the other threads are not idle, it is possible
>>>>      for another thread to have 'reserved' the same memory space for a
>>>>      different data structure. Our thread could observe that the 
>>>> memory
>>>>      is still 'reserved' but we have no way to know that the 
>>>> reservation
>>>>      isn't ours.
>>>>
>>>> You might be wondering why we can't recover from this transient
>>>> resource availability issue.
>>>>
>>>>      We could retry the failed mmap() commit operation, but we would
>>>>      again run into the issue that we no longer know which data
>>>>      structure 'owns' the 'reserved' memory mapping. In particular, 
>>>> the
>>>>      memory could be reserved by native code calling mmap() 
>>>> directly so
>>>>      the VM really has no way to recover from this failure.
>>



More information about the hotspot-runtime-dev mailing list