RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)

Thu Apr 23 14:16:39 UTC 2015

Hi Thomas,

On 2015-04-23 14:55, Thomas Stüfe wrote:
> Hi Stefan,
>
> lets talk a bit more to get an understanding before I change the 
> webrev again:
>
> On Thu, Apr 23, 2015 at 10:48 AM, Stefan Karlsson 
> <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>> wrote:
>
>
>
>     On 2015-04-22 17:20, Thomas Stüfe wrote:
>>     Hi Stefan,
>>
>>     new webrev:
>>     http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.02/webrev/>
>>
>>     See my answers inline.
>>
>>     On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson
>>     <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>>
>>     wrote:
>>
>>         Hi Thomas,
>>
>>         On 2015-04-21 18:31, Thomas Stüfe wrote:
>>>         Hi Stefan,
>>>
>>>         thanks for reviewing!
>>>
>>>         Here is the new webrev:
>>>         http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/>
>>>
>>>         I worked in all your remarks except one: the test
>>>         unfortunately cannot just assert if allocation fails,
>>>         because that may happen sometimes, especially with small
>>>         allocation sizes
>>>         (os::Linux::reserve_memory_special_huge_tlbfs_mixed() will
>>>         refuse to allocate if the area cannot accomodate one large
>>>         page). The test also did not assert prior to my change, so I
>>>         did not want to change that behaviour.
>>>
>>>         I changed some more things:
>>>
>>>         1) Before my change,
>>>         os::Linux::reserve_memory_special_huge_tlbfs_mixed() would
>>>         call os::reserve_memory() or os::reserve_memory_aligned(),
>>>         depending on whether req_addr was NULL or not. That was a
>>>         bit unclean (a lower-level API calling a high-level API) and
>>>         led to the weird workaround at os_linux.cpp:3497, where the
>>>         NMT Memory tracker had to deregister a allocation which it
>>>         expected to be registered again later at an upper level. Yikes.
>>>
>>>         Because I changed the first case to use low-level mmap, I
>>>         also changed the second - replaced
>>>         os::reserve_memory_aligned with low level coding. Actually I
>>>         merged both branches together as good as possible, so the
>>>         new code (os_linux.cpp:3484 ff) now handles alignment as
>>>         well: If req_addr is set, allocation is attempted at
>>>         req_addr and alignment is ignored. If req_addr is NULL,
>>>         alignment is honored. This way I could get rid of the
>>>         workaround for the NMT completely.
>>>
>>>         2) I completely rewrote the test function. It now tests
>>>         three scenarios, see comment. I hope the code is clear and
>>>         easy to understand.
>>
>>         http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html>
>>
>>         ---
>>         +  start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>>         + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>         +  if (start == MAP_FAILED) {
>>              return NULL;
>>            }
>>
>>         You've changed the behavior of this function. It used to
>>         return NULL if reserve_memory couldn't reserve memory at the
>>         requested address. Your proposed patch can return another
>>         address. This will be a problem for the upper layers when
>>         using compressed oops. I think it would be better if you keep
>>         the old behavior.
>>
>>
>>     Arguably the new behaviour is better: The code which allocates
>>     the heap for compressed Oops actually deals with this case quite
>>     nicely by doing a range check for the address, not an identity
>>     check (see virtualspace.cpp
>>     "ReservedHeapSpace::try_reserve_range()"). The other code path in
>>     "ReservedSpace::initialize()" at least handles this case (if
>>     returned address is unequal to requested address, releases memory
>>     again). So if the OS cooperates by returning an address in the
>>     vicinity of the requested address, it would actually be better to
>>     use this address if it fits instead of retrying the allocation.
>>
>>     However, I am not completely sure about other callers of
>>     os::reserve_memory_special(), and changing the behaviour would be
>>     beyond the scope of my change anyway, so I restored the old
>>     behaviour as you wished. Now we either return the requested
>>     address or NULL if a requested address was given.
>
>     Sounds good to me.
>
>
> Ok, I'll keep it that way.
>
>>
>>
>>         ---
>>         Instead of inlining the reserve memory code into
>>         reserve_memory_special_huge_tlbfs_mixed, could you extract it
>>         into function? So that we have one function that's
>>         responsible for reservinb PROT_NONE memory, and one that
>>         commits the small, large, small pages.
>>
>>
>>     Ok, did this. Because there is already a anon_mmap(), which fits
>>     the req_addr != NULL case, I just reused this and added a sister
>>     function "anon_mmap_aligned" for the alignment case.
>
>     I didn't try to argue that you should split up the code, but
>     rather move the entire reserve code that you wrote for webrev.01,
>     and move it into a separate function. The code would then look like:
>
>     char* start = reserve_memory_something(bytes, req_addr, alignment);
>     if (start == NULL) {
>       return NULL;
>     }
>
>     and the reader of reserve_memory_special_huge_tlbfs_mixed could
>     simply ignore the details of reserve_memory_something, unless they
>     were important for some reason. That way you could also write
>     separate tests for reserve_memory_something (or whatever it would
>     be called).
>
>     ---
>     You are now using anon_mmap which has the comment:
>
>     3070     // anon_mmap() should only get called during VM
>     initialization,
>     3071     // don't need lock (actually we can skip locking even it
>     can be called
>     3072     // from multiple threads, because
>     _highest_vm_reserved_address is just a
>     3073     // hint about the upper limit of non-stack memory regions.)
>     3074     if ((address)addr + bytes > _highest_vm_reserved_address) {
>     3075       _highest_vm_reserved_address = (address)addr + bytes;
>     3076     }
>
>     So, it's not entirely clear to me that you should be using this
>     function. I don't know what _highest_vm_reserved_address is used
>     for. Your new function, anon_mmap_aligned isn't updating this
>     variable.
>
>
> The comment is obsolete and does not reflect reality; anon_mmap is 
> used for os::reserve_memory(), which in turn can be used at any time 
> in the life of the VM.
>
> The comment - and the associated variable 
> "_highest_vm_reserved_address" - only affects the LinuxThreads 
> implementation which is legacy and should be removed (I just did ask 
> this question on hotspot-dev and will do the cleanup if I get green 
> light).
>
> It is also broken since forever, since a lot of allocations happen in 
> VM and JDK where _highest_vm_reserved_address is just not updated.

OK. It would be good to get rid of it at some point then.

>
>     Could we go back and use the reserve code you had in webrev.01, so
>     that we don't have to consider the impact of this?
>
>
> It actually changes less than my last webrev:
>
> - old code was using os::reserve_memory() and 
> os::reserve_memory_aligned().
>    os::reserve_memory() directly called anon_mmap().
>    os::reserve_memory_aligned() called os::reserve_memory(), which 
> directly called anon_mmap().
>
> - the new code:
>     for the req_addr != NULL, calls anon_mmap directly, only 
> difference to before is that we now do not specify MAP_FIXED.
>     for the req_addr==NULL case, it calls the new function 
> anon_mmap_aligned(), which did not exist before, so no compatibility 
> issues can arise.
>
> I can live with both my webrev.01 proposal and this one, but the 
> solution to factor out all the allocation to a single function would 
> be my least favorite choice: I do not really like factoring out a 
> function which only exists to be called from a single other function, 
> and which is too awkward to be used any other way.

The reason for moving it out to a function is to get better "separation 
of concerns" and hide the implementation details, it doesn't really 
matter if it's used only once.

> And your "reserved_memory_something()"  would be quite awkward: 
> bundling mmap allocation for the req_addr case (which duplicates what 
> anon_mmap() does), plus adding support for allocating at an alignment. 
> It does not easily fit together imho.

But you almost already had the "awkward" code in webrev.01, and it was 
quite easy to understand. With the requirement that req_addr is aligned 
if an alignment is passed in, then the code would be even easier. It 
could probably be used to implement anon_mmap with just a few small 
changes. And by doing that we would get rid of some code duplication as 
well. Not sure it worth it though.

>
> Of course, it is all a matter of taste, and if you insist, I'll do it. 
> Most important is to fix the bug.

OK. Then, please, extract it out to a separate function so that 
reserve_memory_special_huge_tlbfs_mixed only have one call out to a 
function that reserves the memory. It's OK if you want to implement it 
as a anon_mmap/anon_mmap_aligned pair or inline the code (as in 
webrev.01), as long as I don't have to see the code in 
reserve_memory_special_huge_tlbfs_mixed.

Thanks,
StefanK

>
>>         ---
>>         -  size_t large_page_size = os::large_page_size();
>>         +  char* start;
>>
>>         +  size_t large_page_size = os::large_page_size();
>>
>>         There's no need anymore to declare 'start' before it's set.
>>         So this should be incorporated into:
>>
>>         3489   start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>>         3490 MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>
>>         ---
>>         3480   // first reserve - but not commit - the address range
>>         in small pages.
>>         ...
>>         3529   // Commit small-paged leading area
>>
>>         Could you start your comments with a capital and with a period?
>>
>>
>>     Done. Note that there are comments where the first word is a
>>     parameter name - if that name is lowercase, I would rather not
>>     write it upper case.
>
>     3082 // reserve small-paged memory, optionally aligned to given
>     boundary.
>     3083 // alignment must be a multiple of allocation granularity.
>     bytes must be
>
>     and some in the test as well.
>
>
> Okay, will change.
>
>>
>>         ---
>>         3481   // Note: if a wish address was given
>>
>>         There's no other reference to a "wish address". Could you
>>         reword it to be req_addr or "requested address"?
>>
>>
>>     Done.
>>
>>
>>         ---
>>         3481   // Note: if a wish address was given, we still do not
>>         use MAP_FIXED - do not want to trash
>>         3482   // existing mappings. We hand down req_addr to mmap
>>         and hope the OS does its best.
>>
>>         Could you move this comment down to the mmap call?
>>
>>
>>     Done.
>>
>>
>>         ---
>>         3529   // Commit small-paged leading area
>>         ...
>>         3540   // Commit large paged area
>>
>>         Shouldn't it say large-paged as well?
>>
>>
>>     Done.
>>
>>
>>         ---
>>         3483   // Note2: if wish address was given, alignment is ignored
>>
>>         Previously, we asserted that our in-parameters were aligned.
>>         Why did you replace that with a comment?
>>
>>
>>     Ok, I restored the asserts, but I really am not happy with that.
>>     I find the asserts way too strict and they limit the usefulness
>>     of the functions:
>>
>>     - "assert(is_ptr_aligned(req_addr, alignment), "Must be");"
>>
>>     Why must this be? Why do I even have to input an alignment value
>>     if I input a requested address? It would make more sense just to
>>     say that alignment is ignored if req_addr is set.
>>     Note that this was not really well tested beforehand, because
>>     there were no tests for a non-null req_addr.
>>
>>     - "assert(is_size_aligned(bytes, alignment), "Must be");"
>>
>>     This is even worse. Why should the size of the allocation be
>>     aligned to the alignment? Why can't I allocate a small chunk of
>>     memory at a larger alignment? I do not understand this limit at all.
>>
>>     But ok, I restored the asserts and had to limit the test
>>     accordingly to not to trigger those asserts.
>
>     You are probably right, and those alignment restrictions should
>     probably have been put in the higher layers were it's needed. But,
>     please, don't change this kind of things in an unrelated bug fix.
>     It needs a separate RFE so that it gets enough scrutiny to make
>     sure that we don't break something.
>
>
> I agree with you. It is out of scope for this change.
>
> Kind Regards, Thomas
>
>     Thanks,
>     StefanK
>
>
>>         ---
>>         +  size_t extra_size = ...
>>         +    char* const end0 = start0 + bytes;
>>         +    char* const end = start + extra_size;
>>
>>         It's not obvious from the variable names what they represent,
>>         IMHO. Could you try to give them more descriptive names?
>>
>>
>>     Done.
>>
>>         I'll take a look at the tests after next review round.
>>
>>
>>     Ok, thanks for taking the time!
>>
>>     ..Thomas
>>
>>         Thanks,
>>         StefanK
>>
>>
>>>
>>>         Thanks!
>>>
>>>         ..Thomas
>>>
>>>
>>>         On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson
>>>         <stefan.karlsson at oracle.com
>>>         <mailto:stefan.karlsson at oracle.com>> wrote:
>>>
>>>             Hi Thomas,
>>>
>>>             Thanks for fixing this  bug.
>>>
>>>             Some comments inlined:
>>>
>>>             On 2015-04-09 17:08, Thomas Stüfe wrote:
>>>
>>>                 Hi all,
>>>
>>>                 please review this fix to huge page allocation on Linux.
>>>
>>>                 bug: https://bugs.openjdk.java.net/browse/JDK-8077276
>>>                 webrev:
>>>                 http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>>                 <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/>
>>>
>>>
>>>             http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html>
>>>
>>>             +  const int prot = exec ?
>>>             PROT_READ|PROT_WRITE|PROT_EXEC : PROT_READ|PROT_WRITE;
>>>                 char* start;
>>>                if (req_addr != NULL) {
>>>              assert(is_ptr_aligned(req_addr, alignment), "Must be");
>>>              assert(is_size_aligned(bytes, alignment), "Must be");
>>>             -    start = os::reserve_memory(bytes, req_addr);
>>>             -    assert(start == NULL || start == req_addr, "Must be");
>>>             +    // do NOT use MAP_FIXED here - do not trash
>>>             existing mappings.
>>>             +    start = (char*) ::mmap(req_addr, bytes, prot,
>>>             + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>>             +    if (start == MAP_FAILED) {
>>>             +      start = NULL;
>>>             +    }
>>>             +    if (!is_ptr_aligned(start, alignment)) {
>>>             +      ::munmap(start, bytes);
>>>             +      start = NULL;
>>>             +    }
>>>
>>>
>>>             Previously, the small page area was mapped with
>>>             PROT_NONE, but now you change it to RW. Is there a
>>>             reason why you changed that?
>>>
>>>
>>>                result = ::mmap(lp_start, lp_bytes, prot,
>>>              MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>>                                -1, 0);
>>>                if (result == MAP_FAILED) {
>>>             - warn_on_large_pages_failure(req_addr, bytes, errno);
>>>             + warn_on_large_pages_failure(start, bytes, errno);
>>>
>>>             Shouldn't we report lp_start instead of start here?
>>>
>>>
>>>             http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp>
>>>
>>>             There are a number of things that I would like to change
>>>             with the the test.
>>>
>>>             Can you use mapping == NULL in null checks instead of
>>>             just if (mapping). The same for the asserts.
>>>
>>>             I think the test will be easier to understand if you get
>>>             rid of the for loop and extract code into helper
>>>             functions, or do a small amount of code duplication:
>>>
>>>                 // do the test twice: once with a req_addr=NULL,
>>>             once with a non-NULL
>>>                 // req_addr; the latter shall test if we trash
>>>             existing mappings
>>>                 for (int run = 0; run < 2; run ++) {
>>>
>>>             I think you should fail or exit the test if you can't
>>>             get the expected mapping, instead of continuing and have :
>>>
>>>                   if (addr) {
>>>                     test_log("... mapped [" PTR_FORMAT "-"
>>>             PTR_FORMAT ").", addr, addr + size);
>>>                   } else {
>>>                     test_log("... failed.");
>>>                   }
>>>
>>>                   if (addr != NULL) {
>>>
>>>             Thanks,
>>>             StefanK
>>>
>>>
>>>
>>>                 os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>                 first establishes a
>>>                 mapping with small pages over the whole requested
>>>                 range, then exchanges the
>>>                 parts of the mapping which are aligned to large page
>>>                 size with large pages.
>>>
>>>                 Now, when establishing the first mapping, it uses
>>>                 os::reserve_memory() with
>>>                 a potentially non-null req_addr. In that case,
>>>                 os::reserve_memory() will do
>>>                 a mmap(MAP_FIXED).
>>>
>>>                 This will trash any pre-existing mappings.
>>>
>>>                 Note that I could not willingly reproduce the error
>>>                 with an unmodified VM.
>>>                 But I added a reproduction case which demonstrated
>>>                 that if one were to call
>>>                 os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>                 with a non-NULL
>>>                 req_addr, existing mappings would be trashed.
>>>                 Depending on where we are in
>>>                 address space, we also would overwrite libc
>>>                 structures and crash
>>>                 immediately.
>>>
>>>                 The repro case is part of the change, see changes in
>>>                 test_reserve_memory_special_huge_tlbfs_mixed(), and
>>>                 can be executed with:
>>>
>>>                 ./images/jdk/bin/java -XX:+UseLargePages
>>>                 -XX:+UseHugeTLBFS -XX:-UseSHM
>>>                 -XX:+ExecuteInternalVMTests -XX:+VerboseInternalVMTests
>>>
>>>                 The fix: instead of using os::reserve_memory(),
>>>                 os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>>                 now calls mmap()
>>>                 directly with the non-NULL req_addr, but without
>>>                 MAP_FIXED. This means the
>>>                 OS may do its best to allocate at req_addr, but will
>>>                 not trash any
>>>                 mappings. This also follows the pattern in
>>>                 os::Linux::reserve_memory_special_huge_tlbfs_only(),
>>>                 which is a sister
>>>                 function of
>>>                 os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>>
>>>                 Note also discussion at:
>>>                 http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html.
>>>
>>>                 Thanks for reviewing!
>>>
>>>                 Kind Regards, Thomas
>>>
>>>
>>>
>>
>>
>
>