RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)

Thu Apr 23 16:09:33 UTC 2015

Hi Stefan,

new webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.03/webrev/

Best Regards, Thomas

On Thu, Apr 23, 2015 at 4:16 PM, Stefan Karlsson <stefan.karlsson at oracle.com
> wrote:

>  Hi Thomas,
>
>
> On 2015-04-23 14:55, Thomas Stüfe wrote:
>
> Hi Stefan,
>
>  lets talk a bit more to get an understanding before I change the webrev
> again:
>
> On Thu, Apr 23, 2015 at 10:48 AM, Stefan Karlsson <
> stefan.karlsson at oracle.com> wrote:
>
>>
>>
>> On 2015-04-22 17:20, Thomas Stüfe wrote:
>>
>> Hi Stefan,
>>
>>  new webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/
>>
>>  See my answers inline.
>>
>> On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson <
>> stefan.karlsson at oracle.com> wrote:
>>
>>>  Hi Thomas,
>>>
>>> On 2015-04-21 18:31, Thomas Stüfe wrote:
>>>
>>> Hi Stefan,
>>>
>>>  thanks for reviewing!
>>>
>>>  Here is the new webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>>>
>>>  I worked in all your remarks except one: the test unfortunately cannot
>>> just assert if allocation fails, because that may happen sometimes,
>>> especially with small allocation sizes (os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>> will refuse to allocate if the area cannot accomodate one large page).
>>> The test also did not assert prior to my change, so I did not want to
>>> change that behaviour.
>>>
>>>  I changed some more things:
>>>
>>>  1) Before my change, os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>> would call os::reserve_memory() or os::reserve_memory_aligned(), depending
>>> on whether req_addr was NULL or not. That was a bit unclean (a lower-level
>>> API calling a high-level API) and led to the weird workaround at
>>> os_linux.cpp:3497, where the NMT Memory tracker had to deregister a
>>> allocation which it expected to be registered again later at an upper
>>> level. Yikes.
>>>
>>> Because I changed the first case to use low-level mmap, I also changed
>>> the second - replaced os::reserve_memory_aligned with low level coding.
>>> Actually I merged both branches together as good as possible, so the new
>>> code (os_linux.cpp:3484 ff) now handles alignment as well: If req_addr is
>>> set, allocation is attempted at req_addr and alignment is ignored. If
>>> req_addr is NULL, alignment is honored. This way I could get rid of the
>>> workaround for the NMT completely.
>>>
>>>  2) I completely rewrote the test function. It now tests three
>>> scenarios, see comment. I hope the code is clear and easy to understand.
>>>
>>>
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>>>
>>> ---
>>> +  start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>>> +    MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>> +  if (start == MAP_FAILED) {
>>>       return NULL;
>>>    }
>>>
>>> You've changed the behavior of this function. It used to return NULL if
>>> reserve_memory couldn't reserve memory at the requested address. Your
>>> proposed patch can return another address. This will be a problem for the
>>> upper layers when using compressed oops. I think it would be better if you
>>> keep the old behavior.
>>>
>>
>>  Arguably the new behaviour is better: The code which allocates the heap
>> for compressed Oops actually deals with this case quite nicely by doing a
>> range check for the address, not an identity check (see virtualspace.cpp
>> "ReservedHeapSpace::try_reserve_range()"). The other code path in
>> "ReservedSpace::initialize()" at least handles this case (if returned
>> address is unequal to requested address, releases memory again). So if the
>> OS cooperates by returning an address in the vicinity of the requested
>> address, it would actually be better to use this address if it fits instead
>> of retrying the allocation.
>>
>>  However, I am not completely sure about other callers of
>> os::reserve_memory_special(), and changing the behaviour would be beyond
>> the scope of my change anyway, so I restored the old behaviour as you
>> wished. Now we either return the requested address or NULL if a requested
>> address was given.
>>
>>
>>  Sounds good to me.
>>
>>
>  Ok, I'll keep it that way.
>
>>
>>
>>>
>>> ---
>>> Instead of inlining the reserve memory code into
>>> reserve_memory_special_huge_tlbfs_mixed, could you extract it into
>>> function? So that we have one function that's responsible for reservinb
>>> PROT_NONE memory, and one that commits the small, large, small pages.
>>>
>>>
>>  Ok, did this. Because there is already a anon_mmap(), which fits the
>> req_addr != NULL case, I just reused this and added a sister function
>> "anon_mmap_aligned" for the alignment case.
>>
>>
>>  I didn't try to argue that you should split up the code, but rather move
>> the entire reserve code that you wrote for webrev.01, and move it into a
>> separate function. The code would then look like:
>>
>> char* start = reserve_memory_something(bytes, req_addr, alignment);
>> if (start == NULL) {
>>   return NULL;
>> }
>>
>> and the reader of reserve_memory_special_huge_tlbfs_mixed could simply
>> ignore the details of reserve_memory_something, unless they were important
>> for some reason. That way you could also write separate tests for
>> reserve_memory_something (or whatever it would be called).
>>
>> ---
>> You are now using anon_mmap which has the comment:
>>
>> 3070     // anon_mmap() should only get called during VM initialization,
>> 3071     // don't need lock (actually we can skip locking even it can be
>> called
>> 3072     // from multiple threads, because _highest_vm_reserved_address
>> is just a
>> 3073     // hint about the upper limit of non-stack memory regions.)
>> 3074     if ((address)addr + bytes > _highest_vm_reserved_address) {
>> 3075       _highest_vm_reserved_address = (address)addr + bytes;
>> 3076     }
>>
>> So, it's not entirely clear to me that you should be using this function.
>> I don't know what _highest_vm_reserved_address is used for. Your new
>> function, anon_mmap_aligned isn't updating this variable.
>>
>>
>  The comment is obsolete and does not reflect reality; anon_mmap is used
> for os::reserve_memory(), which in turn can be used at any time in the life
> of the VM.
>
>  The comment - and the associated variable "_highest_vm_reserved_address"
> - only affects the LinuxThreads implementation which is legacy and should
> be removed (I just did ask this question on hotspot-dev and will do the
> cleanup if I get green light).
>
>  It is also broken since forever, since a lot of allocations happen in VM
> and JDK where _highest_vm_reserved_address is just not updated.
>
>
> OK. It would be good to get rid of it at some point then.
>
>
>
>>  Could we go back and use the reserve code you had in webrev.01, so that
>> we don't have to consider the impact of this?
>>
>>
>  It actually changes less than my last webrev:
>
>  - old code was using os::reserve_memory() and
> os::reserve_memory_aligned().
>    os::reserve_memory() directly called anon_mmap().
>    os::reserve_memory_aligned() called os::reserve_memory(), which
> directly called anon_mmap().
>
>  - the new code:
>     for the req_addr != NULL, calls anon_mmap directly, only difference to
> before is that we now do not specify MAP_FIXED.
>     for the req_addr==NULL case, it calls the new function
> anon_mmap_aligned(), which did not exist before, so no compatibility issues
> can arise.
>
>  I can live with both my webrev.01 proposal and this one, but the
> solution to factor out all the allocation to a single function would be my
> least favorite choice: I do not really like factoring out a function which
> only exists to be called from a single other function, and which is too
> awkward to be used any other way.
>
>
> The reason for moving it out to a function is to get better "separation of
> concerns" and hide the implementation details, it doesn't really matter if
> it's used only once.
>
>    And your "reserved_memory_something()"  would be quite awkward:
> bundling mmap allocation for the req_addr case (which duplicates what
> anon_mmap() does), plus adding support for allocating at an alignment. It
> does not easily fit together imho.
>
>
> But you almost already had the "awkward" code in webrev.01, and it was
> quite easy to understand. With the requirement that req_addr is aligned if
> an alignment is passed in, then the code would be even easier. It could
> probably be used to implement anon_mmap with just a few small changes. And
> by doing that we would get rid of some code duplication as well. Not sure
> it worth it though.
>
>
>  Of course, it is all a matter of taste, and if you insist, I'll do it.
> Most important is to fix the bug.
>
>
> OK. Then, please, extract it out to a separate function so that
> reserve_memory_special_huge_tlbfs_mixed only have one call out to a
> function that reserves the memory. It's OK if you want to implement it as a
> anon_mmap/anon_mmap_aligned pair or inline the code (as in webrev.01), as
> long as I don't have to see the code in
> reserve_memory_special_huge_tlbfs_mixed.
>
> Thanks,
> StefanK
>
>
>
>     ---
>>> -  size_t large_page_size = os::large_page_size();
>>> +  char* start;
>>>
>>> +  size_t large_page_size = os::large_page_size();
>>>
>>> There's no need anymore to declare 'start' before it's set. So this
>>> should be incorporated into:
>>>
>>> 3489   start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>>> 3490     MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>>
>>> ---
>>> 3480   // first reserve - but not commit - the address range in small
>>> pages.
>>> ...
>>> 3529   // Commit small-paged leading area
>>>
>>> Could you start your comments with a capital and with a period?
>>>
>>
>>  Done. Note that there are comments where the first word is a parameter
>> name - if that name is lowercase, I would rather not write it upper case.
>>
>>
>>  3082 // reserve small-paged memory, optionally aligned to given boundary.
>> 3083 // alignment must be a multiple of allocation granularity. bytes
>> must be
>>
>> and some in the test as well.
>>
>
>  Okay, will change.
>
>>
>>
>>>
>>> ---
>>> 3481   // Note: if a wish address was given
>>>
>>> There's no other reference to a "wish address". Could you reword it to
>>> be req_addr or "requested address"?
>>>
>>
>>  Done.
>>
>>
>>>
>>> ---
>>> 3481   // Note: if a wish address was given, we still do not use
>>> MAP_FIXED - do not want to trash
>>> 3482   // existing mappings. We hand down req_addr to mmap and hope the
>>> OS does its best.
>>>
>>> Could you move this comment down to the mmap call?
>>>
>>
>>  Done.
>>
>>
>>>
>>> ---
>>> 3529   // Commit small-paged leading area
>>> ...
>>> 3540   // Commit large paged area
>>>
>>> Shouldn't it say large-paged as well?
>>>
>>
>>  Done.
>>
>>
>>>
>>> ---
>>> 3483   // Note2: if wish address was given, alignment is ignored
>>>
>>> Previously, we asserted that our in-parameters were aligned. Why did you
>>> replace that with a comment?
>>>
>>>
>>  Ok, I restored the asserts, but I really am not happy with that. I find
>> the asserts way too strict and they limit the usefulness of the functions:
>>
>>  - "assert(is_ptr_aligned(req_addr, alignment), "Must be");"
>>
>>  Why must this be? Why do I even have to input an alignment value if I
>> input a requested address? It would make more sense just to say that
>> alignment is ignored if req_addr is set.
>> Note that this was not really well tested beforehand, because there were
>> no tests for a non-null req_addr.
>>
>>  - "assert(is_size_aligned(bytes, alignment), "Must be");"
>>
>>  This is even worse. Why should the size of the allocation be aligned to
>> the alignment? Why can't I allocate a small chunk of memory at a larger
>> alignment? I do not understand this limit at all.
>>
>>  But ok, I restored the asserts and had to limit the test accordingly to
>> not to trigger those asserts.
>>
>>
>>  You are probably right, and those alignment restrictions should
>> probably have been put in the higher layers were it's needed. But, please,
>> don't change this kind of things in an unrelated bug fix. It needs a
>> separate RFE so that it gets enough scrutiny to make sure that we don't
>> break something.
>>
>>
>  I agree with you. It is out of scope for this change.
>
>  Kind Regards, Thomas
>
>
>
>>  Thanks,
>> StefanK
>>
>>
>>
>>
>>>  ---
>>> +  size_t extra_size = ...
>>> +    char* const end0 = start0 + bytes;
>>> +    char* const end = start + extra_size;
>>>
>>> It's not obvious from the variable names what they represent, IMHO.
>>> Could you try to give them more descriptive names?
>>>
>>>
>>  Done.
>>
>>
>>>  I'll take a look at the tests after next review round.
>>>
>>>
>>  Ok, thanks for taking the time!
>>
>>  ..Thomas
>>
>>
>>>  Thanks,
>>> StefanK
>>>
>>>
>>>
>>>  Thanks!
>>>
>>>  ..Thomas
>>>
>>>
>>> On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson <
>>> stefan.karlsson at oracle.com> wrote:
>>>
>>>> Hi Thomas,
>>>>
>>>> Thanks for fixing this  bug.
>>>>
>>>> Some comments inlined:
>>>>
>>>> On 2015-04-09 17:08, Thomas Stüfe wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> please review this fix to huge page allocation on Linux.
>>>>>
>>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8077276
>>>>> webrev:
>>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>>>>
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>>>
>>>> +  const int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC :
>>>> PROT_READ|PROT_WRITE;
>>>>     char* start;
>>>>    if (req_addr != NULL) {
>>>>      assert(is_ptr_aligned(req_addr, alignment), "Must be");
>>>>      assert(is_size_aligned(bytes, alignment), "Must be");
>>>> -    start = os::reserve_memory(bytes, req_addr);
>>>> -    assert(start == NULL || start == req_addr, "Must be");
>>>> +    // do NOT use MAP_FIXED here - do not trash existing mappings.
>>>> +    start = (char*) ::mmap(req_addr, bytes, prot,
>>>> +      MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>>> +    if (start == MAP_FAILED) {
>>>> +      start = NULL;
>>>> +    }
>>>> +    if (!is_ptr_aligned(start, alignment)) {
>>>> +      ::munmap(start, bytes);
>>>> +      start = NULL;
>>>> +    }
>>>>
>>>>
>>>> Previously, the small page area was mapped with PROT_NONE, but now you
>>>> change it to RW. Is there a reason why you changed that?
>>>>
>>>>
>>>>    result = ::mmap(lp_start, lp_bytes, prot,
>>>>                    MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>>>                    -1, 0);
>>>>    if (result == MAP_FAILED) {
>>>> -    warn_on_large_pages_failure(req_addr, bytes, errno);
>>>> +    warn_on_large_pages_failure(start, bytes, errno);
>>>>
>>>> Shouldn't we report lp_start instead of start here?
>>>>
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>>>
>>>> There are a number of things that I would like to change with the the
>>>> test.
>>>>
>>>> Can you use mapping == NULL in null checks instead of just if
>>>> (mapping). The same for the asserts.
>>>>
>>>> I think the test will be easier to understand if you get rid of the for
>>>> loop and extract code into helper functions, or do a small amount of code
>>>> duplication:
>>>>
>>>>     // do the test twice: once with a req_addr=NULL, once with a
>>>> non-NULL
>>>>     // req_addr; the latter shall test if we trash existing mappings
>>>>     for (int run = 0; run < 2; run ++) {
>>>>
>>>> I think you should fail or exit the test if you can't get the expected
>>>> mapping, instead of continuing and have :
>>>>
>>>>       if (addr) {
>>>>         test_log("... mapped [" PTR_FORMAT "-" PTR_FORMAT ").", addr,
>>>> addr + size);
>>>>       } else {
>>>>         test_log("... failed.");
>>>>       }
>>>>
>>>>       if (addr != NULL) {
>>>>
>>>> Thanks,
>>>> StefanK
>>>>
>>>>
>>>>
>>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() first establishes
>>>>> a
>>>>> mapping with small pages over the whole requested range, then
>>>>> exchanges the
>>>>> parts of the mapping which are aligned to large page size with large
>>>>> pages.
>>>>>
>>>>> Now, when establishing the first mapping, it uses os::reserve_memory()
>>>>> with
>>>>> a potentially non-null req_addr. In that case, os::reserve_memory()
>>>>> will do
>>>>> a mmap(MAP_FIXED).
>>>>>
>>>>> This will trash any pre-existing mappings.
>>>>>
>>>>> Note that I could not willingly reproduce the error with an unmodified
>>>>> VM.
>>>>> But I added a reproduction case which demonstrated that if one were to
>>>>> call
>>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() with a non-NULL
>>>>> req_addr, existing mappings would be trashed. Depending on where we
>>>>> are in
>>>>> address space, we also would overwrite libc structures and crash
>>>>> immediately.
>>>>>
>>>>> The repro case is part of the change, see changes in
>>>>> test_reserve_memory_special_huge_tlbfs_mixed(), and can be executed
>>>>> with:
>>>>>
>>>>> ./images/jdk/bin/java -XX:+UseLargePages -XX:+UseHugeTLBFS -XX:-UseSHM
>>>>> -XX:+ExecuteInternalVMTests -XX:+VerboseInternalVMTests
>>>>>
>>>>> The fix: instead of using os::reserve_memory(),
>>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() now calls mmap()
>>>>> directly with the non-NULL req_addr, but without MAP_FIXED. This means
>>>>> the
>>>>> OS may do its best to allocate at req_addr, but will not trash any
>>>>> mappings. This also follows the pattern in
>>>>> os::Linux::reserve_memory_special_huge_tlbfs_only(), which is a sister
>>>>> function of os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>>>>
>>>>> Note also discussion at:
>>>>>
>>>>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html
>>>>> .
>>>>>
>>>>> Thanks for reviewing!
>>>>>
>>>>> Kind Regards, Thomas
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>