code review (round 1) for memory commit failure fix (8013057)

Tue May 28 21:41:30 PDT 2013

This is making my head hurt. :(

I had missed the fact that this updated form basically caused deadcode 
at those places which already checked for os::commit_memory failing. 
That is certainly undesirable. And looking at this again, particularly 
in relation to the large pages issue (where we would, it seems, need to 
reestablish the reservation after a failure) perhaps the more 
appropriate fix is for all callers of os::commit_memory to check for 
failure and have the call sites determine whether vm_exit is needed.

David

On 29/05/2013 8:50 AM, Daniel D. Daugherty wrote:
> More replies embedded below...
>
>
> On 5/28/13 3:51 PM, Stefan Karlsson wrote:
>> On 5/28/13 11:03 PM, Daniel D. Daugherty wrote:
>>> On 5/28/13 2:47 PM, Stefan Karlsson wrote:
>>>> On 5/28/13 10:08 PM, Daniel D. Daugherty wrote:
>>>>> Stefan,
>>>>>
>>>>> Thanks for getting back to me! Replies embedded below...
>>>>>
>>>>>
>>>>> On 5/28/13 1:02 PM, Stefan Karlsson wrote:
>>>>>> On 5/28/13 5:55 PM, Daniel D. Daugherty wrote:
>>>>>>> Stefan,
>>>>>>>
>>>>>>> Thanks for the re-review! Replies embedded below.
>>>>>>>
>>>>>>>
>>>>>>> On 5/28/13 2:56 AM, Stefan Karlsson wrote:
>>>>>>>> On 05/24/2013 08:23 PM, Daniel D. Daugherty wrote:
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I have a revised version of the proposed fix for the following
>>>>>>>>> bug:
>>>>>>>>>
>>>>>>>>>     8013057 assert(_needs_gc ||
>>>>>>>>> SafepointSynchronize::is_at_safepoint())
>>>>>>>>>             failed: only read at safepoint
>>>>>>>>>
>>>>>>>>> Here are the (round 1) webrev URLs:
>>>>>>>>>
>>>>>>>>> OpenJDK:
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8013057-webrev/1-hsx25/
>>>>>>>>> Internal:
>>>>>>>>> http://javaweb.us.oracle.com/~ddaugher/8013057-webrev/1-hsx25/
>>>>>>>>
>>>>>>>> Your patch exits the VM if we fail to get memory in
>>>>>>>> os::commit_memory. There are a couple of places where we already
>>>>>>>> have error messages, what should we do about them? For example:
>>>>>>>>   if (!os::commit_memory((char*)guard_page, _page_size,
>>>>>>>> _page_size)) {
>>>>>>>>     // Do better than this for Merlin
>>>>>>>>     vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card
>>>>>>>> table last card");
>>>>>>>>   }
>>>>>>>
>>>>>>> Actually, my patch only exits the VM on Linux and Solaris only for
>>>>>>> certain mmap() error code values so we shouldn't do anything with
>>>>>>> existing code that exits on os::commit_memory() failures.
>>>>>>>
>>>>>>> For the specific example above, it is in platform independent code in
>>>>>>> src/share/vm/memory/cardTableModRefBS.cpp so we definitely don't want
>>>>>>> to remove that check. That would be bad for non-Linux and non-Solaris
>>>>>>> platforms.
>>>>>>
>>>>>> I talked to the GC team today, and we don't want to loose the
>>>>>> hints that tell what the memory is allocated for. The code you're
>>>>>> suggesting will report the following, for the example above, on
>>>>>> Linux/Solaris:
>>>>>>
>>>>>> +      warning("INFO: os::commit_memory(" PTR_FORMAT ", " SIZE_FORMAT
>>>>>> +              ", " SIZE_FORMAT ", %d) failed; errno=%d", addr, size,
>>>>>> +              alignment_hint, exec, err);
>>>>>> +      vm_exit_out_of_memory(size, OOM_MMAP_ERROR,
>>>>>> +                            "committing reserved memory.");
>>>>>>
>>>>>> and for the other platforms:
>>>>>>
>>>>>>            vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card table
>>>>>>        last card");
>>>>>>
>>>>>> We'd like to get the"card table last card"  string into the Linux/Solaris error message, if possible.
>>>>>
>>>>> Well, it is software so just about anything is possible... :-)
>>>>>
>>>>> How about a new optional parameter to os::commit_memory() where an
>>>>> "alt_mesg" can be passed in? If alt_mesg != NULL (and we're on a
>>>>> platform that call vm_exit_out_of_memory() the platform dependent
>>>>> code), we'll use the alt_mesg instead of "committing reserved memory".
>>>>> Does that sound acceptable?
>>>>>
>>>>> Of course, I'll have to revisit and potentially update many more calls
>>>>> to os::commit_memory().
>>>> I need to think about that. I don't think it look that nice if we
>>>> duplicate that string like:
>>>> if (!os::commit_memory(..., "card table last card")) {
>>>>   vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card table last
>>>> card");
>>>> }
>>>
>>> Could solve that with a local string...
>
> I guess you didn't like the local string idea?
>
>
>>>
>>>
>>>> Maybe we could have a:
>>>> os::commit_memory_or_exit(..., "card table last card")
>>>
>>> So we would have both commit_memory() and commit_memory_or_exit()?
>>> Not liking that... Also that would require more new code on MacOS X
>>> and Windows that we don't currently need.
>>>
>>
>> But now you have two vm_exit_out_of_memory calls in the path and you
>> need to figure out which one is going to be called. I'm not liking
>> that either.
>>
>> I'm sure you could make commit_memory_or_exit a thin wrapper around
>> os::commit_memory.
>>
>> os::commit_memory_or_exit(..., msg) {
>>   if (!os::commit_memory(..., no_exit_on_OOM)) {
>>     vm_exit_out_of_memory(...)
>>   }
>> }
>>
>> or whatever that makes the caller code a bit cleaner.
>
> The original version of this patch had two thin wrappers around
> the internal guts of platform specific commit memory. One was
> the original os::commit_memory() and the other was the new
> os::commit_reserved_memory(). I added the new API for the wrong
> reasons which you pointed out (thank you!). So I undid the
> new API and rolled the logic into the existing API.
>
> And now, for a completely different reason, you're asking me
> to add back in another API. Are you trying to make my head
> explode? :D :D :D
>
> So what you're your looking is to replace a block like this:
>
>    if (!os::commit_memory((char*)guard_page, _page_size, _page_size)) {
>      // Do better than this for Merlin
>      vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card table last
> card");
>    }
>
> with the new function call like this:
>
>    os::commit_memory_or_exit((char*)guard_page, _page_size, _page_size,
>                              "card table last card");
>
> where there is no return value because if the new function fails
> to commit the memory _for any reason_, then we will call
> vm_exit_out_of_memory() from the platform specific code. This will only
> be used by existing places that have their own error handling for
> os::commit_memory() failures. Cool.
>
> My existing change to VirtualSpace() stays the same because on platforms
> where we can recover from os::commit_memory() failures we will do so.
> I think I'm OK with this, but it will increase the footprint of the fix
> to larger than it was in 'round 0' (in number of files touched).
>
>
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8013057-webrev/1-hsx25/src/os/linux/vm/os_linux.cpp.frames.html
>>>>>>>>
>>>>>>>>  Lines 2564-2578:
>>>>>>>>
>>>>>>>> I don't think we want to exit immediately if we can't get large
>>>>>>>> pages. With this change JVMs will stop to run if to few large
>>>>>>>> pages have been setup. I think this will affect a lot of users.
>>>>>>>>
>>>>>>>> This code path should probably be handled by:
>>>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8007074
>>>>>>>
>>>>>>> I suppose that depends on whether the failure to get large pages
>>>>>>> causes the underlying reservation mapping to get lost.
>>>>>> Yes, we'll loose the reservation but we'll get it back a couple of
>>>>>> rows bellow in os::commit_memory. That's why 8007074 wasn't found
>>>>>> immediately.
>>>>>
>>>>> Ummmmm. As soon as the reservation is lost, then something else in
>>>>> the same process can come along a mmap() that memory... It could be
>>>>> a native library over which you have no control...
>>>>
>>>> I agree about the dangers with this code. I'm just concerned that
>>>> this patch will shutdown JVMs on machines that end using up all
>>>> large pages or get their large pages fragmented. This will probably
>>>> make the default Large Pages implementation in HotSpot unusable on
>>>> Linux.
>>>
>>> Hopefully more testing will help me understand more about how things
>>> work on Linux. Right now, I'm having trouble just getting to the
>>> regular failure point let alone a large pages failure...
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> It also
>>>>>>> depends on the errno value when large page mmap() fails.
>>>>>>
>>>>>> ENOMEM
>>>>>>
>>>>>>>
>>>>>>> If you prefer, I can back out lines 2606-2621 in the following
>>>>>>> function:
>>>>>>>
>>>>>>> 2592 bool os::pd_commit_memory(char* addr, size_t size, size_t alignment_hint,
>>>>>>> 2593                        bool exec) {
>>>>>>> 2594   if (UseHugeTLBFS && alignment_hint > (size_t)vm_page_size()) {
>>>>>>> 2595     int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC : PROT_READ|PROT_WRITE;
>>>>>>> 2596     uintptr_t res =
>>>>>>> 2597       (uintptr_t) ::mmap(addr, size, prot,
>>>>>>> 2598                          MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_HUGETLB,
>>>>>>> 2599                          -1, 0);
>>>>>>> 2600     if (res != (uintptr_t) MAP_FAILED) {
>>>>>>> 2601       if (UseNUMAInterleaving) {
>>>>>>> 2602         numa_make_global(addr, size);
>>>>>>> 2603       }
>>>>>>> 2604       return true;
>>>>>>> 2605     }
>>>>>>> 2606
>>>>>>> 2607     int err = errno;  // save errno from mmap() call above
>>>>>>> 2608     switch (err) {
>>>>>>> 2609     case EBADF:
>>>>>>> 2610     case EINVAL:
>>>>>>> 2611     case ENOTSUP:
>>>>>>> 2612       break;
>>>>>>> 2613
>>>>>>> 2614     default:
>>>>>>> 2615       warning("INFO: os::commit_memory(" PTR_FORMAT ", " SIZE_FORMAT
>>>>>>> 2616               ", " SIZE_FORMAT ", %d) failed; errno=%d", addr, size,
>>>>>>> 2617               alignment_hint, exec, err);
>>>>>>> 2618       vm_exit_out_of_memory(size, OOM_MMAP_ERROR,
>>>>>>> 2619                             "committing reserved memory.");
>>>>>>> 2620       break;
>>>>>>> 2621     }
>>>>>>> 2622     // Fall through and try to use small pages
>>>>>>> 2623   }
>>>>>>> 2624
>>>>>>> 2625   if (pd_commit_memory(addr, size, exec)) {
>>>>>>> 2626     realign_memory(addr, size, alignment_hint);
>>>>>>> 2627     return true;
>>>>>>> 2628   }
>>>>>>> 2629   return false;
>>>>>>> 2630 }
>>>>>>> and then I can leave that potential location for your work on
>>>>>>> 8007074.
>>>>>>
>>>>>> I think that would be good.
>>>>>
>>>>> Based on your reply above that the reservation can be lost, now
>>>>> I need to be convinced why you think it is OK to leave it for
>>>>> now...
>>>>
>>>> I just see that we have two bad solutions at the moment: one that
>>>> exits the JVM and the other that _might_ crash the JVM. I think
>>>> neither is satisfactory for the users.
>>>
>>> The current situation is worse. What happens if the transient out of
>>> memory condition doesn't crash the JVM, it just corrupts that JVM in
>>> subtle ways... and the JVM just propagates the corrupted data.... :-(
>> If you think this is the correct way forward, could you do this as
>> patch for 8007074 and handle politics around doing this?
>
> Hmmm. Let's see. Your choices are:
>
> 1) definitive failure message from the VM when you've run out of
>     resources (even temporarily)
> 2) possible random corruption of the Java process without (easy) detection
>
> Which do you think political people will prefer?
>
> Basically, what's happening here is that Linux and Solaris both lose
> critical information on resource exhaustion. In this specific case, both
> Linux and Solaris are weaker than MacOS X and Win*. Unless you can think
> of a way to add an interceptor layer to prevent mmap() from screwing us
> on these two platforms...
>
>
>> Just to be clear. Doing the following:
>> $ echo 1 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>
>> will most likely cause all subsequent JVM invocations to fail at
>> startup with this patch.
>
> So I presume that a count of zero in that field means "don't even
> try to ask for huge pages"...
>
> We'll have to figure out if that is really what's going to happen.
> I do think you're right, but I'll verify it just for completeness.
>
> Dan
>
>
>>
>> thanks,
>> StefanK
>>
>>> Dan
>>>
>>>
>>>>
>>>> StefanK
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Before I back out the code, I would be interested in exercising it
>>>>>>> in a large page config on a Linux machine. Can you give me some info
>>>>>>> about how to enable large pages on Linux?
>>>>>>
>>>>>> I have the instructions in the bug report. Contact me if that
>>>>>> doesn't work.
>>>>>
>>>>> I thought of that after I sent my reply this morning. I'm doing
>>>>> basic testing on Ron D's Linux machine right now. Large pages
>>>>> will be next...
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'll have to see about getting access to a Linux machine. I don't
>>>>>>> have one of those in my lab. (More on that below)
>>>>>>>
>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8013057-webrev/1-hsx25/src/os/linux/vm/os_linux.cpp.frames.html
>>>>>>>>
>>>>>>>>  Lines 2614-2620:
>>>>>>>>
>>>>>>>> Do we ever end up failing like this on Linux? Have you been able
>>>>>>>> to reproduce this on Linux?
>>>>>>>
>>>>>>> According to this bug:
>>>>>>>
>>>>>>>     JDK-6843484 os::commit_memory() failures are not handled
>>>>>>> properly on linux
>>>>>>> https://jbs.oracle.com/bugs/browse/JDK-6843484
>>>>>>>
>>>>>>> we do run into this issue on Linux. However, I have not tried my
>>>>>>> reproducer on a Linux machine. I'm a bit worried about doing that
>>>>>>> since I swamped my local Solaris server so bad that I had to power
>>>>>>> cycle it Friday night.
>>>>>>>
>>>>>>> I will investigate getting access to a remote Linux machine and I'll
>>>>>>> check into how they get rebooted, get unstuck, etc... I'm worried
>>>>>>> about screwing up someone else's machine so I'll see about getting
>>>>>>> my own...
>>>>>>
>>>>>> OK.
>>>>>>
>>>>>> StefanK
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> StefanK
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Testing:
>>>>>>>>> - Aurora Adhoc vm.quick batch for all OSes in the following
>>>>>>>>> configs:
>>>>>>>>>   {Client VM, Server VM} x {fastdebug} x {-Xmixed}
>>>>>>>>> - I've created a standalone Java stress test with a shell script
>>>>>>>>>   wrapper that reproduces the failing code paths on my Solaris X86
>>>>>>>>>   server. This test will not be integrated since running the
>>>>>>>>> machine
>>>>>>>>>   out of swap space is very disruptive (crashes the window system,
>>>>>>>>>   causes various services to exit, etc.)
>>>>>>>>>
>>>>>>>>> Gory details are below. As always, comments, questions and
>>>>>>>>> suggestions are welome.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Gory Details:
>>>>>>>>>
>>>>>>>>> The VirtualSpace data structure is built on top of the
>>>>>>>>> ReservedSpace
>>>>>>>>> data structure. VirtualSpace presumes that failed
>>>>>>>>> os::commit_memory()
>>>>>>>>> calls do not affect the underlying ReservedSpace memory mappings.
>>>>>>>>> That assumption is true on MacOS X and Windows, but it is not true
>>>>>>>>> on Linux or Solaris. The mmap() system call on Linux or Solaris
>>>>>>>>> can
>>>>>>>>> lose previous mappings in the event of certain errors. On MacOS X,
>>>>>>>>> the mmap() system call clearly states that previous mappings are
>>>>>>>>> replaced only on success. On Windows, a different set of APIs are
>>>>>>>>> used and they do not document any loss of previous mappings.
>>>>>>>>>
>>>>>>>>> The solution is to implement the proper failure checks in the
>>>>>>>>> os::commit_memory() implementations on Linux and Solaris. On
>>>>>>>>> MacOS X
>>>>>>>>> and Windows, no additional checks are needed.
>>>>>>>>>
>>>>>>>>> There is also a secondary change where some of the
>>>>>>>>> pd_commit_memory()
>>>>>>>>> calls were calling os::commit_memory() instead of calling their
>>>>>>>>> sibling
>>>>>>>>> os::pd_commit_memory(). This resulted in double NMT tracking so
>>>>>>>>> this
>>>>>>>>> has also been fixed. There were also some incorrect mmap)() return
>>>>>>>>> value checks which have been fixed.
>>>>>>>>>
>>>>>>>>> Just to be clear: This fix simply properly detects the "out of
>>>>>>>>> swap
>>>>>>>>> space" condition on Linux and Solaris and causes the VM to fail
>>>>>>>>> in a
>>>>>>>>> more orderly fashion with a message that looks like this:
>>>>>>>>>
>>>>>>>>> The Java process' stderr will show:
>>>>>>>>>
>>>>>>>>> INFO: os::commit_memory(0xfffffd7fb2522000, 4096, 4096, 0)
>>>>>>>>> failed; errno=11
>>>>>>>>> #
>>>>>>>>> # There is insufficient memory for the Java Runtime Environment
>>>>>>>>> to continue.
>>>>>>>>> # Native memory allocation (mmap) failed to map 4096 bytes for
>>>>>>>>> committing reserved memory.
>>>>>>>>> # An error report file with more information is saved as:
>>>>>>>>> # /work/shared/bugs/8013057/looper.03/hs_err_pid9111.log
>>>>>>>>>
>>>>>>>>> The hs_err_pid file will have the more verbose info:
>>>>>>>>>
>>>>>>>>> #
>>>>>>>>> # There is insufficient memory for the Java Runtime Environment
>>>>>>>>> to continue.
>>>>>>>>> # Native memory allocation (mmap) failed to map 4096 bytes for
>>>>>>>>> committing reserved memory.
>>>>>>>>> # Possible reasons:
>>>>>>>>> #   The system is out of physical RAM or swap space
>>>>>>>>> #   In 32 bit mode, the process size limit was hit
>>>>>>>>> # Possible solutions:
>>>>>>>>> #   Reduce memory load on the system
>>>>>>>>> #   Increase physical memory or swap space
>>>>>>>>> #   Check if swap backing store is full
>>>>>>>>> #   Use 64 bit Java on a 64 bit OS
>>>>>>>>> #   Decrease Java heap size (-Xmx/-Xms)
>>>>>>>>> #   Decrease number of Java threads
>>>>>>>>> #   Decrease Java thread stack sizes (-Xss)
>>>>>>>>> #   Set larger code cache with -XX:ReservedCodeCacheSize=
>>>>>>>>> # This output file may be truncated or incomplete.
>>>>>>>>> #
>>>>>>>>> #  Out of Memory Error
>>>>>>>>> (/work/shared/bug_hunt/hsx_rt_latest/exp_8013057/src/os/s
>>>>>>>>> olaris/vm/os_solaris.cpp:2791), pid=9111, tid=21
>>>>>>>>> #
>>>>>>>>> # JRE version: Java(TM) SE Runtime Environment (8.0-b89) (build
>>>>>>>>> 1.8.0-ea-b89)
>>>>>>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>>>>>>>> (25.0-b33-bh_hsx_rt_exp_8013057_dcu
>>>>>>>>> bed-product-fastdebug mixed mode solaris-amd64 compressed oops)
>>>>>>>>> # Core dump written. Default location:
>>>>>>>>> /work/shared/bugs/8013057/looper.03/core
>>>>>>>>> or core.9111
>>>>>>>>> #
>>>>>>>>>
>>>>>>>>> You might be wondering why we are assuming that the failed mmap()
>>>>>>>>> commit operation has lost the 'reserved memory' mapping.
>>>>>>>>>
>>>>>>>>>     We have no good way to determine if the 'reserved memory'
>>>>>>>>> mapping
>>>>>>>>>     is lost. Since all the other threads are not idle, it is
>>>>>>>>> possible
>>>>>>>>>     for another thread to have 'reserved' the same memory space
>>>>>>>>> for a
>>>>>>>>>     different data structure. Our thread could observe that the
>>>>>>>>> memory
>>>>>>>>>     is still 'reserved' but we have no way to know that the
>>>>>>>>> reservation
>>>>>>>>>     isn't ours.
>>>>>>>>>
>>>>>>>>> You might be wondering why we can't recover from this transient
>>>>>>>>> resource availability issue.
>>>>>>>>>
>>>>>>>>>     We could retry the failed mmap() commit operation, but we
>>>>>>>>> would
>>>>>>>>>     again run into the issue that we no longer know which data
>>>>>>>>>     structure 'owns' the 'reserved' memory mapping. In
>>>>>>>>> particular, the
>>>>>>>>>     memory could be reserved by native code calling mmap()
>>>>>>>>> directly so
>>>>>>>>>     the VM really has no way to recover from this failure.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>