RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)

Stefan Karlsson stefan.karlsson at oracle.com
Thu Apr 23 08:48:11 UTC 2015



On 2015-04-22 17:20, Thomas Stüfe wrote:
> Hi Stefan,
>
> new webrev: 
> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/ 
> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.02/webrev/>
>
> See my answers inline.
>
> On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson 
> <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>> wrote:
>
>     Hi Thomas,
>
>     On 2015-04-21 18:31, Thomas Stüfe wrote:
>>     Hi Stefan,
>>
>>     thanks for reviewing!
>>
>>     Here is the new webrev:
>>     http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/>
>>
>>     I worked in all your remarks except one: the test unfortunately
>>     cannot just assert if allocation fails, because that may happen
>>     sometimes, especially with small allocation sizes
>>     (os::Linux::reserve_memory_special_huge_tlbfs_mixed() will refuse
>>     to allocate if the area cannot accomodate one large page). The
>>     test also did not assert prior to my change, so I did not want to
>>     change that behaviour.
>>
>>     I changed some more things:
>>
>>     1) Before my change,
>>     os::Linux::reserve_memory_special_huge_tlbfs_mixed() would call
>>     os::reserve_memory() or os::reserve_memory_aligned(), depending
>>     on whether req_addr was NULL or not. That was a bit unclean (a
>>     lower-level API calling a high-level API) and led to the weird
>>     workaround at os_linux.cpp:3497, where the NMT Memory tracker had
>>     to deregister a allocation which it expected to be registered
>>     again later at an upper level. Yikes.
>>
>>     Because I changed the first case to use low-level mmap, I also
>>     changed the second - replaced os::reserve_memory_aligned with low
>>     level coding. Actually I merged both branches together as good as
>>     possible, so the new code (os_linux.cpp:3484 ff) now handles
>>     alignment as well: If req_addr is set, allocation is attempted at
>>     req_addr and alignment is ignored. If req_addr is NULL, alignment
>>     is honored. This way I could get rid of the workaround for the
>>     NMT completely.
>>
>>     2) I completely rewrote the test function. It now tests three
>>     scenarios, see comment. I hope the code is clear and easy to
>>     understand.
>
>     http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html>
>
>     ---
>     +  start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>     +    MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>     +  if (start == MAP_FAILED) {
>          return NULL;
>        }
>
>     You've changed the behavior of this function. It used to return
>     NULL if reserve_memory couldn't reserve memory at the requested
>     address. Your proposed patch can return another address. This will
>     be a problem for the upper layers when using compressed oops. I
>     think it would be better if you keep the old behavior.
>
>
> Arguably the new behaviour is better: The code which allocates the 
> heap for compressed Oops actually deals with this case quite nicely by 
> doing a range check for the address, not an identity check (see 
> virtualspace.cpp "ReservedHeapSpace::try_reserve_range()"). The other 
> code path in "ReservedSpace::initialize()" at least handles this case 
> (if returned address is unequal to requested address, releases memory 
> again). So if the OS cooperates by returning an address in the 
> vicinity of the requested address, it would actually be better to use 
> this address if it fits instead of retrying the allocation.
>
> However, I am not completely sure about other callers of 
> os::reserve_memory_special(), and changing the behaviour would be 
> beyond the scope of my change anyway, so I restored the old behaviour 
> as you wished. Now we either return the requested address or NULL if a 
> requested address was given.

Sounds good to me.

>
>
>     ---
>     Instead of inlining the reserve memory code into
>     reserve_memory_special_huge_tlbfs_mixed, could you extract it into
>     function? So that we have one function that's responsible for
>     reservinb PROT_NONE memory, and one that commits the small, large,
>     small pages.
>
>
> Ok, did this. Because there is already a anon_mmap(), which fits the 
> req_addr != NULL case, I just reused this and added a sister function 
> "anon_mmap_aligned" for the alignment case.

I didn't try to argue that you should split up the code, but rather move 
the entire reserve code that you wrote for webrev.01, and move it into a 
separate function. The code would then look like:

char* start = reserve_memory_something(bytes, req_addr, alignment);
if (start == NULL) {
   return NULL;
}

and the reader of reserve_memory_special_huge_tlbfs_mixed could simply 
ignore the details of reserve_memory_something, unless they were 
important for some reason. That way you could also write separate tests 
for reserve_memory_something (or whatever it would be called).

---
You are now using anon_mmap which has the comment:

3070     // anon_mmap() should only get called during VM initialization,
3071     // don't need lock (actually we can skip locking even it can be 
called
3072     // from multiple threads, because _highest_vm_reserved_address 
is just a
3073     // hint about the upper limit of non-stack memory regions.)
3074     if ((address)addr + bytes > _highest_vm_reserved_address) {
3075       _highest_vm_reserved_address = (address)addr + bytes;
3076     }

So, it's not entirely clear to me that you should be using this 
function. I don't know what _highest_vm_reserved_address is used for. 
Your new function, anon_mmap_aligned isn't updating this variable.

Could we go back and use the reserve code you had in webrev.01, so that 
we don't have to consider the impact of this?

>     ---
>     -  size_t large_page_size = os::large_page_size();
>     +  char* start;
>
>     +  size_t large_page_size = os::large_page_size();
>
>     There's no need anymore to declare 'start' before it's set. So
>     this should be incorporated into:
>
>     3489   start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>     3490     MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>
>     ---
>     3480   // first reserve - but not commit - the address range in
>     small pages.
>     ...
>     3529   // Commit small-paged leading area
>
>     Could you start your comments with a capital and with a period?
>
>
> Done. Note that there are comments where the first word is a parameter 
> name - if that name is lowercase, I would rather not write it upper case.

3082 // reserve small-paged memory, optionally aligned to given boundary.
3083 // alignment must be a multiple of allocation granularity. bytes 
must be

and some in the test as well.

>
>     ---
>     3481   // Note: if a wish address was given
>
>     There's no other reference to a "wish address". Could you reword
>     it to be req_addr or "requested address"?
>
>
> Done.
>
>
>     ---
>     3481   // Note: if a wish address was given, we still do not use
>     MAP_FIXED - do not want to trash
>     3482   // existing mappings. We hand down req_addr to mmap and
>     hope the OS does its best.
>
>     Could you move this comment down to the mmap call?
>
>
> Done.
>
>
>     ---
>     3529   // Commit small-paged leading area
>     ...
>     3540   // Commit large paged area
>
>     Shouldn't it say large-paged as well?
>
>
> Done.
>
>
>     ---
>     3483   // Note2: if wish address was given, alignment is ignored
>
>     Previously, we asserted that our in-parameters were aligned. Why
>     did you replace that with a comment?
>
>
> Ok, I restored the asserts, but I really am not happy with that. I 
> find the asserts way too strict and they limit the usefulness of the 
> functions:
>
> - "assert(is_ptr_aligned(req_addr, alignment), "Must be");"
>
> Why must this be? Why do I even have to input an alignment value if I 
> input a requested address? It would make more sense just to say that 
> alignment is ignored if req_addr is set.
> Note that this was not really well tested beforehand, because there 
> were no tests for a non-null req_addr.
>
> - "assert(is_size_aligned(bytes, alignment), "Must be");"
>
> This is even worse. Why should the size of the allocation be aligned 
> to the alignment? Why can't I allocate a small chunk of memory at a 
> larger alignment? I do not understand this limit at all.
>
> But ok, I restored the asserts and had to limit the test accordingly 
> to not to trigger those asserts.

You are probably right, and those alignment restrictions should probably 
have been put in the higher layers were it's needed. But, please, don't 
change this kind of things in an unrelated bug fix. It needs a separate 
RFE so that it gets enough scrutiny to make sure that we don't break 
something.

Thanks,
StefanK

>     ---
>     +  size_t extra_size = ...
>     +    char* const end0 = start0 + bytes;
>     +    char* const end = start + extra_size;
>
>     It's not obvious from the variable names what they represent,
>     IMHO. Could you try to give them more descriptive names?
>
>
> Done.
>
>     I'll take a look at the tests after next review round.
>
>
> Ok, thanks for taking the time!
>
> ..Thomas
>
>     Thanks,
>     StefanK
>
>
>>
>>     Thanks!
>>
>>     ..Thomas
>>
>>
>>     On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson
>>     <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>>
>>     wrote:
>>
>>         Hi Thomas,
>>
>>         Thanks for fixing this  bug.
>>
>>         Some comments inlined:
>>
>>         On 2015-04-09 17:08, Thomas Stüfe wrote:
>>
>>             Hi all,
>>
>>             please review this fix to huge page allocation on Linux.
>>
>>             bug: https://bugs.openjdk.java.net/browse/JDK-8077276
>>             webrev:
>>             http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/>
>>
>>
>>         http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html>
>>
>>         +  const int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC :
>>         PROT_READ|PROT_WRITE;
>>             char* start;
>>            if (req_addr != NULL) {
>>              assert(is_ptr_aligned(req_addr, alignment), "Must be");
>>              assert(is_size_aligned(bytes, alignment), "Must be");
>>         -    start = os::reserve_memory(bytes, req_addr);
>>         -    assert(start == NULL || start == req_addr, "Must be");
>>         +    // do NOT use MAP_FIXED here - do not trash existing
>>         mappings.
>>         +    start = (char*) ::mmap(req_addr, bytes, prot,
>>         + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>         +    if (start == MAP_FAILED) {
>>         +      start = NULL;
>>         +    }
>>         +    if (!is_ptr_aligned(start, alignment)) {
>>         +      ::munmap(start, bytes);
>>         +      start = NULL;
>>         +    }
>>
>>
>>         Previously, the small page area was mapped with PROT_NONE,
>>         but now you change it to RW. Is there a reason why you
>>         changed that?
>>
>>
>>            result = ::mmap(lp_start, lp_bytes, prot,
>>          MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>                            -1, 0);
>>            if (result == MAP_FAILED) {
>>         -    warn_on_large_pages_failure(req_addr, bytes, errno);
>>         +    warn_on_large_pages_failure(start, bytes, errno);
>>
>>         Shouldn't we report lp_start instead of start here?
>>
>>
>>         http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp>
>>
>>         There are a number of things that I would like to change with
>>         the the test.
>>
>>         Can you use mapping == NULL in null checks instead of just if
>>         (mapping). The same for the asserts.
>>
>>         I think the test will be easier to understand if you get rid
>>         of the for loop and extract code into helper functions, or do
>>         a small amount of code duplication:
>>
>>             // do the test twice: once with a req_addr=NULL, once
>>         with a non-NULL
>>             // req_addr; the latter shall test if we trash existing
>>         mappings
>>             for (int run = 0; run < 2; run ++) {
>>
>>         I think you should fail or exit the test if you can't get the
>>         expected mapping, instead of continuing and have :
>>
>>               if (addr) {
>>                 test_log("... mapped [" PTR_FORMAT "-" PTR_FORMAT
>>         ").", addr, addr + size);
>>               } else {
>>                 test_log("... failed.");
>>               }
>>
>>               if (addr != NULL) {
>>
>>         Thanks,
>>         StefanK
>>
>>
>>
>>             os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>             first establishes a
>>             mapping with small pages over the whole requested range,
>>             then exchanges the
>>             parts of the mapping which are aligned to large page size
>>             with large pages.
>>
>>             Now, when establishing the first mapping, it uses
>>             os::reserve_memory() with
>>             a potentially non-null req_addr. In that case,
>>             os::reserve_memory() will do
>>             a mmap(MAP_FIXED).
>>
>>             This will trash any pre-existing mappings.
>>
>>             Note that I could not willingly reproduce the error with
>>             an unmodified VM.
>>             But I added a reproduction case which demonstrated that
>>             if one were to call
>>             os::Linux::reserve_memory_special_huge_tlbfs_mixed() with
>>             a non-NULL
>>             req_addr, existing mappings would be trashed. Depending
>>             on where we are in
>>             address space, we also would overwrite libc structures
>>             and crash
>>             immediately.
>>
>>             The repro case is part of the change, see changes in
>>             test_reserve_memory_special_huge_tlbfs_mixed(), and can
>>             be executed with:
>>
>>             ./images/jdk/bin/java -XX:+UseLargePages
>>             -XX:+UseHugeTLBFS -XX:-UseSHM
>>             -XX:+ExecuteInternalVMTests -XX:+VerboseInternalVMTests
>>
>>             The fix: instead of using os::reserve_memory(),
>>             os::Linux::reserve_memory_special_huge_tlbfs_mixed() now
>>             calls mmap()
>>             directly with the non-NULL req_addr, but without
>>             MAP_FIXED. This means the
>>             OS may do its best to allocate at req_addr, but will not
>>             trash any
>>             mappings. This also follows the pattern in
>>             os::Linux::reserve_memory_special_huge_tlbfs_only(),
>>             which is a sister
>>             function of
>>             os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>
>>             Note also discussion at:
>>             http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html.
>>
>>             Thanks for reviewing!
>>
>>             Kind Regards, Thomas
>>
>>
>>
>
>



More information about the hotspot-runtime-dev mailing list