code review (round 1) for memory commit failure fix (8013057)

Wed May 29 09:08:41 PDT 2013

Thanks for not getting tired of discussing this issue...

Replies embedded below...

Also, the reply chain was getting too convoluted so I've
trimmed it down...

Dan

On 5/29/13 6:33 AM, Stefan Karlsson wrote:
> On 05/29/2013 12:50 AM, Daniel D. Daugherty wrote:
>> On 5/28/13 3:51 PM, Stefan Karlsson wrote:
>>> On 5/28/13 11:03 PM, Daniel D. Daugherty wrote:
>>>> On 5/28/13 2:47 PM, Stefan Karlsson wrote:
>>>>> On 5/28/13 10:08 PM, Daniel D. Daugherty wrote:
>>>>>>
>>>>>>>>>> Here are the (round 1) webrev URLs:
>>>>>>>>>>
>>>>>>>>>> OpenJDK: 
>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8013057-webrev/1-hsx25/
>>>>>>>>>> Internal: 
>>>>>>>>>> http://javaweb.us.oracle.com/~ddaugher/8013057-webrev/1-hsx25/
>>>>>>>>>
>>>>>>>>> Your patch exits the VM if we fail to get memory in 
>>>>>>>>> os::commit_memory. There are a couple of places where we 
>>>>>>>>> already have error messages, what should we do about them? For 
>>>>>>>>> example:
>>>>>>>>>   if (!os::commit_memory((char*)guard_page, _page_size, 
>>>>>>>>> _page_size)) {
>>>>>>>>>     // Do better than this for Merlin
>>>>>>>>>     vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card 
>>>>>>>>> table last card");
>>>>>>>>>   }
>>>>>>>>
>>>>>>>> Actually, my patch only exits the VM on Linux and Solaris only for
>>>>>>>> certain mmap() error code values so we shouldn't do anything with
>>>>>>>> existing code that exits on os::commit_memory() failures.
>>>>>>>>
>>>>>>>> For the specific example above, it is in platform independent 
>>>>>>>> code in
>>>>>>>> src/share/vm/memory/cardTableModRefBS.cpp so we definitely 
>>>>>>>> don't want
>>>>>>>> to remove that check. That would be bad for non-Linux and 
>>>>>>>> non-Solaris
>>>>>>>> platforms.
>>>>>>>
>>>>>>> I talked to the GC team today, and we don't want to loose the 
>>>>>>> hints that tell what the memory is allocated for. The code 
>>>>>>> you're suggesting will report the following, for the example 
>>>>>>> above, on Linux/Solaris:
>>>>>>>
>>>>>>> +      warning("INFO: os::commit_memory("  PTR_FORMAT ", " SIZE_FORMAT
>>>>>>> +              ", " SIZE_FORMAT ", %d) failed; errno=%d", addr, size,
>>>>>>> +              alignment_hint, exec, err);
>>>>>>> +      vm_exit_out_of_memory(size, OOM_MMAP_ERROR,
>>>>>>> +                            "committing reserved memory.");
>>>>>>>
>>>>>>> and for the other platforms:
>>>>>>>
>>>>>>>            vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card table
>>>>>>>        last card");
>>>>>>>        
>>>>>>> We'd like to get the"card table last card"  string into the Linux/Solaris error message, if possible.
>>>>>>
>>>>>> Well, it is software so just about anything is possible... :-)
>>>>>>
>>>>>> How about a new optional parameter to os::commit_memory() where an
>>>>>> "alt_mesg" can be passed in? If alt_mesg != NULL (and we're on a
>>>>>> platform that call vm_exit_out_of_memory() the platform dependent
>>>>>> code), we'll use the alt_mesg instead of "committing reserved 
>>>>>> memory".
>>>>>> Does that sound acceptable?
>>>>>>
>>>>>> Of course, I'll have to revisit and potentially update many more 
>>>>>> calls
>>>>>> to os::commit_memory().
>>>>> I need to think about that. I don't think it look that nice if we 
>>>>> duplicate that string like:
>>>>> if (!os::commit_memory(..., "card table last card")) {
>>>>> vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card table last 
>>>>> card");
>>>>> }
>>>>
>>>> Could solve that with a local string...
>>
>> I guess you didn't like the local string idea?
>
> Not really. It doesn't solve the problem that we get different error 
> messages on different platforms, and it adds to the noise at the call 
> site.

Agreed. Let's jettison that idea.

>
>>>>> Maybe we could have a:
>>>>> os::commit_memory_or_exit(..., "card table last card")
>>>>
>>>> So we would have both commit_memory() and commit_memory_or_exit()?
>>>> Not liking that... Also that would require more new code on MacOS X
>>>> and Windows that we don't currently need.
>>>>
>>>
>>> But now you have two vm_exit_out_of_memory calls in the path and you 
>>> need to figure out which one is going to be called. I'm not liking 
>>> that either.
>>>
>>> I'm sure you could make commit_memory_or_exit a thin wrapper around 
>>> os::commit_memory.
>>>
>>> os::commit_memory_or_exit(..., msg) {
>>>   if (!os::commit_memory(..., no_exit_on_OOM)) {
>>>     vm_exit_out_of_memory(...)
>>>   }
>>> }
>>>
>>> or whatever that makes the caller code a bit cleaner.
>>
>> The original version of this patch had two thin wrappers around
>> the internal guts of platform specific commit memory. One was
>> the original os::commit_memory() and the other was the new
>> os::commit_reserved_memory(). I added the new API for the wrong
>> reasons which you pointed out (thank you!). So I undid the
>> new API and rolled the logic into the existing API.
>>
>> And now, for a completely different reason, you're asking me
>> to add back in another API. Are you trying to make my head
>> explode? :D :D :D
>>
>> So what you're your looking is to replace a block like this:
>>
>>   if (!os::commit_memory((char*)guard_page, _page_size, _page_size)) {
>>     // Do better than this for Merlin
>>     vm_exit_out_of_memory(_page_size, OOM_MMAP_ERROR, "card table 
>> last card");
>>   }
>>
>> with the new function call like this:
>>
>>   os::commit_memory_or_exit((char*)guard_page, _page_size, _page_size,
>>                             "card table last card");
>>
>> where there is no return value because if the new function fails
>> to commit the memory _for any reason_, then we will call
>> vm_exit_out_of_memory() from the platform specific code. This will only
>> be used by existing places that have their own error handling for
>> os::commit_memory() failures. Cool.
>
> Yes, for the places that already call vm_exit_out_of_memory().

Cool. Getting closure here... :-)

>
>> My existing change to VirtualSpace() stays the same because on platforms
>> where we can recover from os::commit_memory() failures we will do so.
>> I think I'm OK with this, but it will increase the footprint of the fix
>> to larger than it was in 'round 0' (in number of files touched).
>
> OK. Maybe you could commit most of the os::commit_memory_or_exit 
> change as a separate change, done before the fix for 8013057.

I'm here now so I think I want to try doing the whole change in
one webrev (except for the part that blows up large pages on
Linux; see below). Of course, my head might explode, we'll have
to see...

>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8013057-webrev/1-hsx25/src/os/linux/vm/os_linux.cpp.frames.html 
>>>>>>>>>
>>>>>>>>>  Lines 2564-2578:
>>>>>>>>>
>>>>>>>>> I don't think we want to exit immediately if we can't get 
>>>>>>>>> large pages. With this change JVMs will stop to run if to few 
>>>>>>>>> large pages have been setup. I think this will affect a lot of 
>>>>>>>>> users.
>>>>>>>>>
>>>>>>>>> This code path should probably be handled by:
>>>>>>>>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=8007074
>>>>>>>>
>>>>>>>> I suppose that depends on whether the failure to get large pages
>>>>>>>> causes the underlying reservation mapping to get lost.
>>>>>>> Yes, we'll loose the reservation but we'll get it back a couple 
>>>>>>> of rows bellow in os::commit_memory. That's why 8007074 wasn't 
>>>>>>> found immediately.
>>>>>>
>>>>>> Ummmmm. As soon as the reservation is lost, then something else in
>>>>>> the same process can come along a mmap() that memory... It could be
>>>>>> a native library over which you have no control...
>>>>>
>>>>> I agree about the dangers with this code. I'm just concerned that 
>>>>> this patch will shutdown JVMs on machines that end using up all 
>>>>> large pages or get their large pages fragmented. This will 
>>>>> probably make the default Large Pages implementation in HotSpot 
>>>>> unusable on Linux.
>>>>
>>>> Hopefully more testing will help me understand more about how things
>>>> work on Linux. Right now, I'm having trouble just getting to the
>>>> regular failure point let alone a large pages failure...
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> It also
>>>>>>>> depends on the errno value when large page mmap() fails.
>>>>>>>
>>>>>>> ENOMEM
>>>>>>>
>>>>>>>>
>>>>>>>> If you prefer, I can back out lines 2606-2621 in the following
>>>>>>>> function:
>>>>>>>>
>>>>>>>> 2592 bool os::pd_commit_memory(char* addr, size_t size, size_t alignment_hint,
>>>>>>>> 2593                        bool exec) {
>>>>>>>> 2594   if (UseHugeTLBFS && alignment_hint > (size_t)vm_page_size()) {
>>>>>>>> 2595     int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC : PROT_READ|PROT_WRITE;
>>>>>>>> 2596     uintptr_t res =
>>>>>>>> 2597       (uintptr_t) ::mmap(addr, size, prot,
>>>>>>>> 2598                          MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS|MAP_HUGETLB,
>>>>>>>> 2599                          -1, 0);
>>>>>>>> 2600     if (res != (uintptr_t) MAP_FAILED) {
>>>>>>>> 2601       if (UseNUMAInterleaving) {
>>>>>>>> 2602         numa_make_global(addr, size);
>>>>>>>> 2603       }
>>>>>>>> 2604       return true;
>>>>>>>> 2605     }
>>>>>>>> 2606
>>>>>>>> 2607     int err = errno;  // save errno from mmap() call above
>>>>>>>> 2608     switch (err) {
>>>>>>>> 2609     case EBADF:
>>>>>>>> 2610     case EINVAL:
>>>>>>>> 2611     case ENOTSUP:
>>>>>>>> 2612       break;
>>>>>>>> 2613
>>>>>>>> 2614     default:
>>>>>>>> 2615       warning("INFO: os::commit_memory("  PTR_FORMAT ", " SIZE_FORMAT
>>>>>>>> 2616               ", " SIZE_FORMAT ", %d) failed; errno=%d", addr, size,
>>>>>>>> 2617               alignment_hint, exec, err);
>>>>>>>> 2618       vm_exit_out_of_memory(size, OOM_MMAP_ERROR,
>>>>>>>> 2619                             "committing reserved memory.");
>>>>>>>> 2620       break;
>>>>>>>> 2621     }
>>>>>>>> 2622     // Fall through and try to use small pages
>>>>>>>> 2623   }
>>>>>>>> 2624
>>>>>>>> 2625   if (pd_commit_memory(addr, size, exec)) {
>>>>>>>> 2626     realign_memory(addr, size, alignment_hint);
>>>>>>>> 2627     return true;
>>>>>>>> 2628   }
>>>>>>>> 2629   return false;
>>>>>>>> 2630 }
>>>>>>>> and then I can leave that potential location for your work on 
>>>>>>>> 8007074.
>>>>>>>
>>>>>>> I think that would be good.
>>>>>>
>>>>>> Based on your reply above that the reservation can be lost, now
>>>>>> I need to be convinced why you think it is OK to leave it for
>>>>>> now...
>>>>>
>>>>> I just see that we have two bad solutions at the moment: one that 
>>>>> exits the JVM and the other that _might_ crash the JVM. I think 
>>>>> neither is satisfactory for the users.
>>>>
>>>> The current situation is worse. What happens if the transient out of
>>>> memory condition doesn't crash the JVM, it just corrupts that JVM in
>>>> subtle ways... and the JVM just propagates the corrupted data.... :-(
>>> If you think this is the correct way forward, could you do this as 
>>> patch for 8007074 and handle politics around doing this?
>>
>> Hmmm. Let's see. Your choices are:
>>
>> 1) definitive failure message from the VM when you've run out of
>>    resources (even temporarily)
>> 2) possible random corruption of the Java process without (easy) 
>> detection
>>
>> Which do you think political people will prefer?
>
> I'll defer this question to users that run and reaps the benefits of 
> using Large Pages on Linux.
>
> Currently, the only safe workaround for this, with and without this 
> patch, is to turn off UseHugeTLBFS. Unless you are willing to accept a 
> risk of that the JVM will shutdown, crash or corrupt the memory.

OK. I think we're on the same page (no pun intended). I just reread
what I wrote above and I realize that I wasn't clear. For my fix for
8013057, I intend to #ifdef out/comment out the code block that would
screw up large pages with a comment that you and I agree upon. I want
to leave appropriate bread crumbs for your work on 8013057.

>
>> Basically, what's happening here is that Linux and Solaris both lose
>> critical information on resource exhaustion. In this specific case, both
>> Linux and Solaris are weaker than MacOS X and Win*. Unless you can think
>> of a way to add an interceptor layer to prevent mmap() from screwing us
>> on these two platforms...
>>
>>
>>> Just to be clear. Doing the following:
>>> $ echo 1 | sudo tee 
>>> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>
>>> will most likely cause all subsequent JVM invocations to fail at 
>>> startup with this patch.
>>
>> So I presume that a count of zero in that field means "don't even
>> try to ask for huge pages"...
>
> Yes. See os::Linux::hugetlbfs_sanity_check.

Cool. Thanks for the pointer. Although, I have to admit that this
large pages stuff makes me want to run for the mountains... :-)

>
>> We'll have to figure out if that is really what's going to happen.
>> I do think you're right, but I'll verify it just for completeness.
>
> OK. With the patch:
>
> $ echo 1 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/
>
> $ build/linux/linux_amd64_compiler2/product/hotspot alloc.AllocArrays
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00000007d7000000, 44040192, 2097152, 0) failed; 
> errno=12
> #
> # There is insufficient memory for the Java Runtime Environment to 
> continue.
> # Native memory allocation (mmap) failed to map 44040192 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /home/stefank/hg/hsx-gc.test/hs_err_pid29174.log

I did not mean for you to do the testing for me! Looks like things
are broken exactly as you predicted! Definitely not going to include
that part of the patch (as functional code).

> and with some more large pages, we fail after startup:
>
> $ cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 309
>
> $ build/linux/linux_amd64_compiler2/product/hotspot alloc.AllocArrays
> Running AllocArrays
>  Duration           : 60 seconds
>  ObjectSize         : [1024-20480] bytes
>  NullBeforeAlloc    : true
> Java HotSpot(TM) 64-Bit Server VM warning: INFO: 
> os::commit_memory(0x00000007e7000000, 268435456, 2097152, 0) failed; 
> errno=12
> #
> # There is insufficient memory for the Java Runtime Environment to 
> continue.
> # Native memory allocation (mmap) failed to map 268435456 bytes for 
> committing reserved memory.
> # An error report file with more information is saved as:
> # /home/stefank/hg/hsx-gc.test/hs_err_pid29526.log

So when we have large pages, we try to allocate and then commit as
many large pages as the system will give us. When we run out, we
drop back to small pages.If more large pages become available later,
will try to use those. Lather, rinse, repeat.

OK. I think I know where I need to investigate for the next iteration
of the fix...

Again, thanks for your continued input.

Dan

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/attachments/20130529/82864163/attachment-0001.html