RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)

Stefan Karlsson stefan.karlsson at oracle.com
Mon Apr 27 08:25:10 UTC 2015


On 2015-04-24 11:41, Thomas Stüfe wrote:
> Hi Stefan,
>
> ok. I see it makes sense to limit the scope of the change as much as 
> possible to the actual error and do semantic changes separately.
>
> I worked for a long time with os::reserve_memory() and friends, 
> especially because I had to port it to AIX here at SAP. That was 
> annoyingly difficult because of AIX specifics, so it basically ended 
> up being a complete new implementation. I guess that is the reason I 
> am a bit "looser" around this coding and tend to change too much - 
> hard to get this "ownership" feeling out of the bones.
>
> On a side note, I am quite happy now that SAP is contributing to the 
> OpenJDK, so now I have the chance to actually bring some changes 
> upstream :)

Thanks for fixing the bugs and cleaning out our inconsistent APIs. :)

>
> Ok, what you suggest makes sense and I will change the fix accordingly.
>
> Some more remarks inline:
>
>
>
> On Thu, Apr 23, 2015 at 9:53 PM, Stefan Karlsson 
> <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>> wrote:
>
>     Hi Thomas,
>
>     On 2015-04-23 18:09, Thomas Stüfe wrote:
>>     Hi Stefan,
>>
>>     new webrev:
>>     http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.03/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.03/webrev/>
>
>     I think I've realized that we have different views on what the API
>     for the reserve function/code should be.
>
>     The code you have written has two more or less mutual exclusive
>     execution modes:
>     1) req_addr != NULL => ignore the alignment
>     2) req_addr == NULL => adhere to the alignment
>
>     this reflects in the code, the name of the new function and in the
>     comments.
>
>     If we take the previous implementation of the code, before your
>     patches:
>     3482   if (req_addr != NULL) {
>     3483     assert(is_ptr_aligned(req_addr, alignment), "Must be");
>     3484     assert(is_size_aligned(bytes, alignment), "Must be");
>     3485     start = os::reserve_memory(bytes, req_addr);
>     3486     assert(start == NULL || start == req_addr, "Must be");
>     3487   } else {
>     3488     start = os::reserve_memory_aligned(bytes, alignment);
>     3489   }
>     3490
>     3491   if (start == NULL) {
>     3492     return NULL;
>     3493   }
>     3494
>     3495   assert(is_ptr_aligned(start, alignment), "Must be");
>
>
>     If we would have extracted that into a function then the API would
>     have been:
>     INPUT:
>     alignment - requested alignment
>
>
> So... alignment can be 0 or not? It was never tested, the old code 
> somewhat assumes that it would never be 0.

Currently, we never pass in 0 as the alignment, unless it changed recently.

>     req_addr - requested address (must be 'alignment' aligned)
>     bytes - the size of the memory chunk to reserve (must be
>     [arguably] 'alignment' aligned)
>
>
> Almost. bytes had to be aligned to alignment only for the req_addr != 
> NULL case, which makes not much sense.

The alignment check was done here:

char* os::reserve_memory_aligned(size_t size, size_t alignment) {
   assert((alignment & (os::vm_allocation_granularity() - 1)) == 0,
       "Alignment must be a multiple of allocation granularity (page 
size)");
   assert((size & (alignment -1)) == 0, "size must be 'alignment' aligned");

>
>
>     OUTPUT:
>     address - a memory chunk that is 'alignment' aligned and 'bytes'
>     large, that starts at 'req_addr' if 'req_addr' is given.  Or NULL
>     at failures.
>
>     Your current code is written to allow for a broader use-case that
>     the current code doesn't need, where the req_addr doesn't have to
>     be 'alignment' aligned. I would much prefer if you kept the INPUT
>     and OUPUT requirements above and made sure that the function
>     implemented those. And _if_ we need to alter the current
>     restrictions, we go back and do it as a separate change.
>
> ok!
>
>     The code code be written as (not tested with a compiler):
>
>     static char* anon_mmap_aligned(size_t bytes, size_t alignment,
>     char* req_addr) {
>       // Input requirements.
>       assert(is_ptr_aligned(req_addr, alignment), "Must be");
>       assert(is_size_aligned(bytes, alignment), "Must be");
>
>       size_t extra_size = bytes;
>       if (alignment > 0) {
>         extra_size += alignment;
>       }
>
>       char* start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>     MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
>                                    -1, 0);
>       if (start == MAP_FAILED) {
>         return NULL;
>       }
>
>       if (start != req_addr) {
>         ::munmap(start, extra_size);
>         return NULL;
>       }
>
>       if (alignment > 0) {
>         char* const start_aligned = (char*) align_ptr_up(start,
>     alignment);
>         char* const end_aligned = start_aligned + bytes;
>         char* const end = start + extra_size;
>         if (start_aligned > start) {
>           ::munmap(start, start_aligned - start);
>         }
>         if (end_aligned < end) {
>           ::munmap(end_aligned, end - end_aligned);
>         }
>         start = start_aligned;
>       }
>
>       // Output requirements.
>       assert(req_addr == NULL || start == req_addr, "Must be");
>       assert(is_ptr_aligned(start, alignment), "Must be");
>       return start;
>     }
>
>     and we could even get rid of the alignment > 0 checks, if we want to:
>
>
>
> No, because you'd change the behaviour in case req_addr != NULL 
> compared to the old coding. You would always mmap more than needed, 
> then discard the rest. Aside from being unnecessary, this would also 
> change mapping placement, because OS must fit a larger mapping.

Yes, realized this over the night.

Thanks,
StefanK

>
>     static char* anon_map_aligned(size_t bytes, size_t alignment,
>     char* req_addr) {
>       // Input requirements.
>       assert(is_ptr_aligned(req_addr, alignment), "Must be");
>       assert(is_size_aligned(bytes, alignment), "Must be");
>
>       size_t extra_size = bytes + alignment;
>
>       char* start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>     MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE,
>                                    -1, 0);
>       if (start == MAP_FAILED) {
>         return NULL;
>       }
>
>       if (start != req_addr) {
>         ::munmap(start, extra_size);
>         return NULL;
>       }
>
>       // Potentially adjust to get an aligned mmapping.
>       char* const start_aligned = (char*) align_ptr_up(start, alignment);
>       char* const end_aligned = start_aligned + bytes;
>       char* const end = start + extra_size;
>       if (start_aligned > start) {
>         ::munmap(start, start_aligned - start);
>       }
>       if (end_aligned < end) {
>         ::munmap(end_aligned, end - end_aligned);
>       }
>       start = start_aligned;
>
>       // Output requirements.
>       assert(req_addr == NULL || start == req_addr, "Must be");
>       assert(is_ptr_aligned(start, alignment), "Must be");
>       return start;
>     }
>
>     And as a side-note, even with these restrictions, we're not far
>     from implementing anon_mmap as:
>
>     static char* anon_mmap(char* requested_addr, size_t bytes, bool
>     fixed) {
>       return anon_mmap_aligned(bytes, 0, requested_addr);
>     }
>
>     we just need to cater for the fixed flag. (The input parameters
>     are in different order in the current patch. That should probably
>     be fixed.).
>
>     ---
>
>     Smaller style issues:
>
>     - Don't NULL-check with if (p), use if (p != NULL)
>     - i ++ should be written as i++
>
>
> ok, will change.
>
>     Thanks,
>     StefanK
>
>
> Thanks, Thomas
>
>
>>     Best Regards, Thomas
>>
>>
>>     On Thu, Apr 23, 2015 at 4:16 PM, Stefan Karlsson
>>     <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>>
>>     wrote:
>>
>>         Hi Thomas,
>>
>>
>>         On 2015-04-23 14:55, Thomas Stüfe wrote:
>>>         Hi Stefan,
>>>
>>>         lets talk a bit more to get an understanding before I change
>>>         the webrev again:
>>>
>>>         On Thu, Apr 23, 2015 at 10:48 AM, Stefan Karlsson
>>>         <stefan.karlsson at oracle.com
>>>         <mailto:stefan.karlsson at oracle.com>> wrote:
>>>
>>>
>>>
>>>             On 2015-04-22 17:20, Thomas Stüfe wrote:
>>>>             Hi Stefan,
>>>>
>>>>             new webrev:
>>>>             http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/
>>>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.02/webrev/>
>>>>
>>>>             See my answers inline.
>>>>
>>>>             On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson
>>>>             <stefan.karlsson at oracle.com
>>>>             <mailto:stefan.karlsson at oracle.com>> wrote:
>>>>
>>>>                 Hi Thomas,
>>>>
>>>>                 On 2015-04-21 18:31, Thomas Stüfe wrote:
>>>>>                 Hi Stefan,
>>>>>
>>>>>                 thanks for reviewing!
>>>>>
>>>>>                 Here is the new webrev:
>>>>>                 http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>>>>>                 <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/>
>>>>>
>>>>>                 I worked in all your remarks except one: the test
>>>>>                 unfortunately cannot just assert if allocation
>>>>>                 fails, because that may happen sometimes,
>>>>>                 especially with small allocation sizes
>>>>>                 (os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>>>                 will refuse to allocate if the area cannot
>>>>>                 accomodate one large page). The test also did not
>>>>>                 assert prior to my change, so I did not want to
>>>>>                 change that behaviour.
>>>>>
>>>>>                 I changed some more things:
>>>>>
>>>>>                 1) Before my change,
>>>>>                 os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>>>                 would call os::reserve_memory() or
>>>>>                 os::reserve_memory_aligned(), depending on whether
>>>>>                 req_addr was NULL or not. That was a bit unclean
>>>>>                 (a lower-level API calling a high-level API) and
>>>>>                 led to the weird workaround at os_linux.cpp:3497,
>>>>>                 where the NMT Memory tracker had to deregister a
>>>>>                 allocation which it expected to be registered
>>>>>                 again later at an upper level. Yikes.
>>>>>
>>>>>                 Because I changed the first case to use low-level
>>>>>                 mmap, I also changed the second - replaced
>>>>>                 os::reserve_memory_aligned with low level coding.
>>>>>                 Actually I merged both branches together as good
>>>>>                 as possible, so the new code (os_linux.cpp:3484
>>>>>                 ff) now handles alignment as well: If req_addr is
>>>>>                 set, allocation is attempted at req_addr and
>>>>>                 alignment is ignored. If req_addr is NULL,
>>>>>                 alignment is honored. This way I could get rid of
>>>>>                 the workaround for the NMT completely.
>>>>>
>>>>>                 2) I completely rewrote the test function. It now
>>>>>                 tests three scenarios, see comment. I hope the
>>>>>                 code is clear and easy to understand.
>>>>
>>>>                 http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>>>>                 <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html>
>>>>
>>>>                 ---
>>>>                 +  start = (char*) ::mmap(req_addr, extra_size,
>>>>                 PROT_NONE,
>>>>                 + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>>>                 +  if (start == MAP_FAILED) {
>>>>                 return NULL;
>>>>                    }
>>>>
>>>>                 You've changed the behavior of this function. It
>>>>                 used to return NULL if reserve_memory couldn't
>>>>                 reserve memory at the requested address. Your
>>>>                 proposed patch can return another address. This
>>>>                 will be a problem for the upper layers when using
>>>>                 compressed oops. I think it would be better if you
>>>>                 keep the old behavior.
>>>>
>>>>
>>>>             Arguably the new behaviour is better: The code which
>>>>             allocates the heap for compressed Oops actually deals
>>>>             with this case quite nicely by doing a range check for
>>>>             the address, not an identity check (see
>>>>             virtualspace.cpp
>>>>             "ReservedHeapSpace::try_reserve_range()"). The other
>>>>             code path in "ReservedSpace::initialize()" at least
>>>>             handles this case (if returned address is unequal to
>>>>             requested address, releases memory again). So if the OS
>>>>             cooperates by returning an address in the vicinity of
>>>>             the requested address, it would actually be better to
>>>>             use this address if it fits instead of retrying the
>>>>             allocation.
>>>>
>>>>             However, I am not completely sure about other callers
>>>>             of os::reserve_memory_special(), and changing the
>>>>             behaviour would be beyond the scope of my change
>>>>             anyway, so I restored the old behaviour as you wished.
>>>>             Now we either return the requested address or NULL if a
>>>>             requested address was given.
>>>
>>>             Sounds good to me.
>>>
>>>
>>>         Ok, I'll keep it that way.
>>>
>>>>
>>>>
>>>>                 ---
>>>>                 Instead of inlining the reserve memory code into
>>>>                 reserve_memory_special_huge_tlbfs_mixed, could you
>>>>                 extract it into function? So that we have one
>>>>                 function that's responsible for reservinb PROT_NONE
>>>>                 memory, and one that commits the small, large,
>>>>                 small pages.
>>>>
>>>>
>>>>             Ok, did this. Because there is already a anon_mmap(),
>>>>             which fits the req_addr != NULL case, I just reused
>>>>             this and added a sister function "anon_mmap_aligned"
>>>>             for the alignment case.
>>>
>>>             I didn't try to argue that you should split up the code,
>>>             but rather move the entire reserve code that you wrote
>>>             for webrev.01, and move it into a separate function. The
>>>             code would then look like:
>>>
>>>             char* start = reserve_memory_something(bytes, req_addr,
>>>             alignment);
>>>             if (start == NULL) {
>>>               return NULL;
>>>             }
>>>
>>>             and the reader of
>>>             reserve_memory_special_huge_tlbfs_mixed could simply
>>>             ignore the details of reserve_memory_something, unless
>>>             they were important for some reason. That way you could
>>>             also write separate tests for reserve_memory_something
>>>             (or whatever it would be called).
>>>
>>>             ---
>>>             You are now using anon_mmap which has the comment:
>>>
>>>             3070     // anon_mmap() should only get called during VM
>>>             initialization,
>>>             3071     // don't need lock (actually we can skip
>>>             locking even it can be called
>>>             3072     // from multiple threads, because
>>>             _highest_vm_reserved_address is just a
>>>             3073     // hint about the upper limit of non-stack
>>>             memory regions.)
>>>             3074     if ((address)addr + bytes >
>>>             _highest_vm_reserved_address) {
>>>             3075 _highest_vm_reserved_address = (address)addr + bytes;
>>>             3076     }
>>>
>>>             So, it's not entirely clear to me that you should be
>>>             using this function. I don't know what
>>>             _highest_vm_reserved_address is used for. Your new
>>>             function, anon_mmap_aligned isn't updating this variable.
>>>
>>>
>>>         The comment is obsolete and does not reflect reality;
>>>         anon_mmap is used for os::reserve_memory(), which in turn
>>>         can be used at any time in the life of the VM.
>>>
>>>         The comment - and the associated variable
>>>         "_highest_vm_reserved_address" - only affects the
>>>         LinuxThreads implementation which is legacy and should be
>>>         removed (I just did ask this question on hotspot-dev and
>>>         will do the cleanup if I get green light).
>>>
>>>         It is also broken since forever, since a lot of allocations
>>>         happen in VM and JDK where _highest_vm_reserved_address is
>>>         just not updated.
>>
>>         OK. It would be good to get rid of it at some point then.
>>
>>>
>>>             Could we go back and use the reserve code you had in
>>>             webrev.01, so that we don't have to consider the impact
>>>             of this?
>>>
>>>
>>>         It actually changes less than my last webrev:
>>>
>>>         - old code was using os::reserve_memory() and
>>>         os::reserve_memory_aligned().
>>>            os::reserve_memory() directly called anon_mmap().
>>>          os::reserve_memory_aligned() called os::reserve_memory(),
>>>         which directly called anon_mmap().
>>>
>>>         - the new code:
>>>             for the req_addr != NULL, calls anon_mmap directly, only
>>>         difference to before is that we now do not specify MAP_FIXED.
>>>             for the req_addr==NULL case, it calls the new function
>>>         anon_mmap_aligned(), which did not exist before, so no
>>>         compatibility issues can arise.
>>>
>>>         I can live with both my webrev.01 proposal and this one, but
>>>         the solution to factor out all the allocation to a single
>>>         function would be my least favorite choice: I do not really
>>>         like factoring out a function which only exists to be called
>>>         from a single other function, and which is too awkward to be
>>>         used any other way.
>>
>>         The reason for moving it out to a function is to get better
>>         "separation of concerns" and hide the implementation details,
>>         it doesn't really matter if it's used only once.
>>
>>>         And your "reserved_memory_something()"  would be quite
>>>         awkward: bundling mmap allocation for the req_addr case
>>>         (which duplicates what anon_mmap() does), plus adding
>>>         support for allocating at an alignment. It does not easily
>>>         fit together imho.
>>
>>         But you almost already had the "awkward" code in webrev.01,
>>         and it was quite easy to understand. With the requirement
>>         that req_addr is aligned if an alignment is passed in, then
>>         the code would be even easier. It could probably be used to
>>         implement anon_mmap with just a few small changes. And by
>>         doing that we would get rid of some code duplication as well.
>>         Not sure it worth it though.
>>
>>>
>>>         Of course, it is all a matter of taste, and if you insist,
>>>         I'll do it. Most important is to fix the bug.
>>
>>         OK. Then, please, extract it out to a separate function so
>>         that reserve_memory_special_huge_tlbfs_mixed only have one
>>         call out to a function that reserves the memory. It's OK if
>>         you want to implement it as a anon_mmap/anon_mmap_aligned
>>         pair or inline the code (as in webrev.01), as long as I don't
>>         have to see the code in reserve_memory_special_huge_tlbfs_mixed.
>>
>>         Thanks,
>>         StefanK
>>
>>
>>>
>>>>                 ---
>>>>                 -  size_t large_page_size = os::large_page_size();
>>>>                 +  char* start;
>>>>
>>>>                 +  size_t large_page_size = os::large_page_size();
>>>>
>>>>                 There's no need anymore to declare 'start' before
>>>>                 it's set. So this should be incorporated into:
>>>>
>>>>                 3489   start = (char*) ::mmap(req_addr, extra_size,
>>>>                 PROT_NONE,
>>>>                 3490 MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>>>
>>>>                 ---
>>>>                 3480   // first reserve - but not commit - the
>>>>                 address range in small pages.
>>>>                 ...
>>>>                 3529   // Commit small-paged leading area
>>>>
>>>>                 Could you start your comments with a capital and
>>>>                 with a period?
>>>>
>>>>
>>>>             Done. Note that there are comments where the first word
>>>>             is a parameter name - if that name is lowercase, I
>>>>             would rather not write it upper case.
>>>
>>>             3082 // reserve small-paged memory, optionally aligned
>>>             to given boundary.
>>>             3083 // alignment must be a multiple of allocation
>>>             granularity. bytes must be
>>>
>>>             and some in the test as well.
>>>
>>>
>>>         Okay, will change.
>>>
>>>>
>>>>                 ---
>>>>                 3481   // Note: if a wish address was given
>>>>
>>>>                 There's no other reference to a "wish address".
>>>>                 Could you reword it to be req_addr or "requested
>>>>                 address"?
>>>>
>>>>
>>>>             Done.
>>>>
>>>>
>>>>                 ---
>>>>                 3481   // Note: if a wish address was given, we
>>>>                 still do not use MAP_FIXED - do not want to trash
>>>>                 3482   // existing mappings. We hand down req_addr
>>>>                 to mmap and hope the OS does its best.
>>>>
>>>>                 Could you move this comment down to the mmap call?
>>>>
>>>>
>>>>             Done.
>>>>
>>>>
>>>>                 ---
>>>>                 3529   // Commit small-paged leading area
>>>>                 ...
>>>>                 3540   // Commit large paged area
>>>>
>>>>                 Shouldn't it say large-paged as well?
>>>>
>>>>
>>>>             Done.
>>>>
>>>>
>>>>                 ---
>>>>                 3483   // Note2: if wish address was given,
>>>>                 alignment is ignored
>>>>
>>>>                 Previously, we asserted that our in-parameters were
>>>>                 aligned. Why did you replace that with a comment?
>>>>
>>>>
>>>>             Ok, I restored the asserts, but I really am not happy
>>>>             with that. I find the asserts way too strict and they
>>>>             limit the usefulness of the functions:
>>>>
>>>>             - "assert(is_ptr_aligned(req_addr, alignment), "Must be");"
>>>>
>>>>             Why must this be? Why do I even have to input an
>>>>             alignment value if I input a requested address? It
>>>>             would make more sense just to say that alignment is
>>>>             ignored if req_addr is set.
>>>>             Note that this was not really well tested beforehand,
>>>>             because there were no tests for a non-null req_addr.
>>>>
>>>>             - "assert(is_size_aligned(bytes, alignment), "Must be");"
>>>>
>>>>             This is even worse. Why should the size of the
>>>>             allocation be aligned to the alignment? Why can't I
>>>>             allocate a small chunk of memory at a larger alignment?
>>>>             I do not understand this limit at all.
>>>>
>>>>             But ok, I restored the asserts and had to limit the
>>>>             test accordingly to not to trigger those asserts.
>>>
>>>             You are probably right, and those alignment restrictions
>>>             should probably have been put in the higher layers were
>>>             it's needed. But, please, don't change this kind of
>>>             things in an unrelated bug fix. It needs a separate RFE
>>>             so that it gets enough scrutiny to make sure that we
>>>             don't break something.
>>>
>>>
>>>         I agree with you. It is out of scope for this change.
>>>
>>>         Kind Regards, Thomas
>>>
>>>             Thanks,
>>>             StefanK
>>>
>>>
>>>>                 ---
>>>>                 +  size_t extra_size = ...
>>>>                 +    char* const end0 = start0 + bytes;
>>>>                 +    char* const end = start + extra_size;
>>>>
>>>>                 It's not obvious from the variable names what they
>>>>                 represent, IMHO. Could you try to give them more
>>>>                 descriptive names?
>>>>
>>>>
>>>>             Done.
>>>>
>>>>                 I'll take a look at the tests after next review round.
>>>>
>>>>
>>>>             Ok, thanks for taking the time!
>>>>
>>>>             ..Thomas
>>>>
>>>>                 Thanks,
>>>>                 StefanK
>>>>
>>>>
>>>>>
>>>>>                 Thanks!
>>>>>
>>>>>                 ..Thomas
>>>>>
>>>>>
>>>>>                 On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson
>>>>>                 <stefan.karlsson at oracle.com
>>>>>                 <mailto:stefan.karlsson at oracle.com>> wrote:
>>>>>
>>>>>                     Hi Thomas,
>>>>>
>>>>>                     Thanks for fixing this bug.
>>>>>
>>>>>                     Some comments inlined:
>>>>>
>>>>>                     On 2015-04-09 17:08, Thomas Stüfe wrote:
>>>>>
>>>>>                         Hi all,
>>>>>
>>>>>                         please review this fix to huge page
>>>>>                         allocation on Linux.
>>>>>
>>>>>                         bug:
>>>>>                         https://bugs.openjdk.java.net/browse/JDK-8077276
>>>>>                         webrev:
>>>>>                         http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>>>>                         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/>
>>>>>
>>>>>
>>>>>                     http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>>>>                     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html>
>>>>>
>>>>>                     +  const int prot = exec ?
>>>>>                     PROT_READ|PROT_WRITE|PROT_EXEC :
>>>>>                     PROT_READ|PROT_WRITE;
>>>>>                         char* start;
>>>>>                        if (req_addr != NULL) {
>>>>>                      assert(is_ptr_aligned(req_addr, alignment),
>>>>>                     "Must be");
>>>>>                      assert(is_size_aligned(bytes, alignment),
>>>>>                     "Must be");
>>>>>                     -    start = os::reserve_memory(bytes, req_addr);
>>>>>                     - assert(start == NULL || start == req_addr,
>>>>>                     "Must be");
>>>>>                     +    // do NOT use MAP_FIXED here - do not
>>>>>                     trash existing mappings.
>>>>>                     +    start = (char*) ::mmap(req_addr, bytes, prot,
>>>>>                     + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>>>>                     +    if (start == MAP_FAILED) {
>>>>>                     +      start = NULL;
>>>>>                     +    }
>>>>>                     +    if (!is_ptr_aligned(start, alignment)) {
>>>>>                     + ::munmap(start, bytes);
>>>>>                     +      start = NULL;
>>>>>                     +    }
>>>>>
>>>>>
>>>>>                     Previously, the small page area was mapped
>>>>>                     with PROT_NONE, but now you change it to RW.
>>>>>                     Is there a reason why you changed that?
>>>>>
>>>>>
>>>>>                        result = ::mmap(lp_start, lp_bytes, prot,
>>>>>                      MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>>>>                          -1, 0);
>>>>>                        if (result == MAP_FAILED) {
>>>>>                     - warn_on_large_pages_failure(req_addr, bytes,
>>>>>                     errno);
>>>>>                     + warn_on_large_pages_failure(start, bytes,
>>>>>                     errno);
>>>>>
>>>>>                     Shouldn't we report lp_start instead of start
>>>>>                     here?
>>>>>
>>>>>
>>>>>                     http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>>>>                     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp>
>>>>>
>>>>>                     There are a number of things that I would like
>>>>>                     to change with the the test.
>>>>>
>>>>>                     Can you use mapping == NULL in null checks
>>>>>                     instead of just if (mapping). The same for the
>>>>>                     asserts.
>>>>>
>>>>>                     I think the test will be easier to understand
>>>>>                     if you get rid of the for loop and extract
>>>>>                     code into helper functions, or do a small
>>>>>                     amount of code duplication:
>>>>>
>>>>>                         // do the test twice: once with a
>>>>>                     req_addr=NULL, once with a non-NULL
>>>>>                         // req_addr; the latter shall test if we
>>>>>                     trash existing mappings
>>>>>                         for (int run = 0; run < 2; run ++) {
>>>>>
>>>>>                     I think you should fail or exit the test if
>>>>>                     you can't get the expected mapping, instead of
>>>>>                     continuing and have :
>>>>>
>>>>>                           if (addr) {
>>>>>                     test_log("... mapped [" PTR_FORMAT "-"
>>>>>                     PTR_FORMAT ").", addr, addr + size);
>>>>>                           } else {
>>>>>                     test_log("... failed.");
>>>>>                           }
>>>>>
>>>>>                           if (addr != NULL) {
>>>>>
>>>>>                     Thanks,
>>>>>                     StefanK
>>>>>
>>>>>
>>>>>
>>>>>                         os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>>>                         first establishes a
>>>>>                         mapping with small pages over the whole
>>>>>                         requested range, then exchanges the
>>>>>                         parts of the mapping which are aligned to
>>>>>                         large page size with large pages.
>>>>>
>>>>>                         Now, when establishing the first mapping,
>>>>>                         it uses os::reserve_memory() with
>>>>>                         a potentially non-null req_addr. In that
>>>>>                         case, os::reserve_memory() will do
>>>>>                         a mmap(MAP_FIXED).
>>>>>
>>>>>                         This will trash any pre-existing mappings.
>>>>>
>>>>>                         Note that I could not willingly reproduce
>>>>>                         the error with an unmodified VM.
>>>>>                         But I added a reproduction case which
>>>>>                         demonstrated that if one were to call
>>>>>                         os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>>>                         with a non-NULL
>>>>>                         req_addr, existing mappings would be
>>>>>                         trashed. Depending on where we are in
>>>>>                         address space, we also would overwrite
>>>>>                         libc structures and crash
>>>>>                         immediately.
>>>>>
>>>>>                         The repro case is part of the change, see
>>>>>                         changes in
>>>>>                         test_reserve_memory_special_huge_tlbfs_mixed(),
>>>>>                         and can be executed with:
>>>>>
>>>>>                         ./images/jdk/bin/java -XX:+UseLargePages
>>>>>                         -XX:+UseHugeTLBFS -XX:-UseSHM
>>>>>                         -XX:+ExecuteInternalVMTests
>>>>>                         -XX:+VerboseInternalVMTests
>>>>>
>>>>>                         The fix: instead of using
>>>>>                         os::reserve_memory(),
>>>>>                         os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>>>                         now calls mmap()
>>>>>                         directly with the non-NULL req_addr, but
>>>>>                         without MAP_FIXED. This means the
>>>>>                         OS may do its best to allocate at
>>>>>                         req_addr, but will not trash any
>>>>>                         mappings. This also follows the pattern in
>>>>>                         os::Linux::reserve_memory_special_huge_tlbfs_only(),
>>>>>                         which is a sister
>>>>>                         function of
>>>>>                         os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>>>>
>>>>>                         Note also discussion at:
>>>>>                         http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html.
>>>>>
>>>>>                         Thanks for reviewing!
>>>>>
>>>>>                         Kind Regards, Thomas
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>



More information about the hotspot-runtime-dev mailing list