RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)

Thu Apr 23 12:55:44 UTC 2015

Hi Stefan,

lets talk a bit more to get an understanding before I change the webrev
again:

On Thu, Apr 23, 2015 at 10:48 AM, Stefan Karlsson <
stefan.karlsson at oracle.com> wrote:

>
>
> On 2015-04-22 17:20, Thomas Stüfe wrote:
>
> Hi Stefan,
>
>  new webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/
>
>  See my answers inline.
>
> On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson <
> stefan.karlsson at oracle.com> wrote:
>
>>  Hi Thomas,
>>
>> On 2015-04-21 18:31, Thomas Stüfe wrote:
>>
>> Hi Stefan,
>>
>>  thanks for reviewing!
>>
>>  Here is the new webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>>
>>  I worked in all your remarks except one: the test unfortunately cannot
>> just assert if allocation fails, because that may happen sometimes,
>> especially with small allocation sizes (os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>> will refuse to allocate if the area cannot accomodate one large page).
>> The test also did not assert prior to my change, so I did not want to
>> change that behaviour.
>>
>>  I changed some more things:
>>
>>  1) Before my change, os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>> would call os::reserve_memory() or os::reserve_memory_aligned(), depending
>> on whether req_addr was NULL or not. That was a bit unclean (a lower-level
>> API calling a high-level API) and led to the weird workaround at
>> os_linux.cpp:3497, where the NMT Memory tracker had to deregister a
>> allocation which it expected to be registered again later at an upper
>> level. Yikes.
>>
>> Because I changed the first case to use low-level mmap, I also changed
>> the second - replaced os::reserve_memory_aligned with low level coding.
>> Actually I merged both branches together as good as possible, so the new
>> code (os_linux.cpp:3484 ff) now handles alignment as well: If req_addr is
>> set, allocation is attempted at req_addr and alignment is ignored. If
>> req_addr is NULL, alignment is honored. This way I could get rid of the
>> workaround for the NMT completely.
>>
>>  2) I completely rewrote the test function. It now tests three
>> scenarios, see comment. I hope the code is clear and easy to understand.
>>
>>
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>>
>> ---
>> +  start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>> +    MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>> +  if (start == MAP_FAILED) {
>>       return NULL;
>>    }
>>
>> You've changed the behavior of this function. It used to return NULL if
>> reserve_memory couldn't reserve memory at the requested address. Your
>> proposed patch can return another address. This will be a problem for the
>> upper layers when using compressed oops. I think it would be better if you
>> keep the old behavior.
>>
>
>  Arguably the new behaviour is better: The code which allocates the heap
> for compressed Oops actually deals with this case quite nicely by doing a
> range check for the address, not an identity check (see virtualspace.cpp
> "ReservedHeapSpace::try_reserve_range()"). The other code path in
> "ReservedSpace::initialize()" at least handles this case (if returned
> address is unequal to requested address, releases memory again). So if the
> OS cooperates by returning an address in the vicinity of the requested
> address, it would actually be better to use this address if it fits instead
> of retrying the allocation.
>
>  However, I am not completely sure about other callers of
> os::reserve_memory_special(), and changing the behaviour would be beyond
> the scope of my change anyway, so I restored the old behaviour as you
> wished. Now we either return the requested address or NULL if a requested
> address was given.
>
>
> Sounds good to me.
>
>
Ok, I'll keep it that way.

>
>
>>
>> ---
>> Instead of inlining the reserve memory code into
>> reserve_memory_special_huge_tlbfs_mixed, could you extract it into
>> function? So that we have one function that's responsible for reservinb
>> PROT_NONE memory, and one that commits the small, large, small pages.
>>
>>
>  Ok, did this. Because there is already a anon_mmap(), which fits the
> req_addr != NULL case, I just reused this and added a sister function
> "anon_mmap_aligned" for the alignment case.
>
>
> I didn't try to argue that you should split up the code, but rather move
> the entire reserve code that you wrote for webrev.01, and move it into a
> separate function. The code would then look like:
>
> char* start = reserve_memory_something(bytes, req_addr, alignment);
> if (start == NULL) {
>   return NULL;
> }
>
> and the reader of reserve_memory_special_huge_tlbfs_mixed could simply
> ignore the details of reserve_memory_something, unless they were important
> for some reason. That way you could also write separate tests for
> reserve_memory_something (or whatever it would be called).
>
> ---
> You are now using anon_mmap which has the comment:
>
> 3070     // anon_mmap() should only get called during VM initialization,
> 3071     // don't need lock (actually we can skip locking even it can be
> called
> 3072     // from multiple threads, because _highest_vm_reserved_address is
> just a
> 3073     // hint about the upper limit of non-stack memory regions.)
> 3074     if ((address)addr + bytes > _highest_vm_reserved_address) {
> 3075       _highest_vm_reserved_address = (address)addr + bytes;
> 3076     }
>
> So, it's not entirely clear to me that you should be using this function.
> I don't know what _highest_vm_reserved_address is used for. Your new
> function, anon_mmap_aligned isn't updating this variable.
>
>
The comment is obsolete and does not reflect reality; anon_mmap is used for
os::reserve_memory(), which in turn can be used at any time in the life of
the VM.

The comment - and the associated variable "_highest_vm_reserved_address" -
only affects the LinuxThreads implementation which is legacy and should be
removed (I just did ask this question on hotspot-dev and will do the
cleanup if I get green light).

It is also broken since forever, since a lot of allocations happen in VM
and JDK where _highest_vm_reserved_address is just not updated.

> Could we go back and use the reserve code you had in webrev.01, so that we
> don't have to consider the impact of this?
>
>
It actually changes less than my last webrev:

- old code was using os::reserve_memory() and os::reserve_memory_aligned().
   os::reserve_memory() directly called anon_mmap().
   os::reserve_memory_aligned() called os::reserve_memory(), which directly
called anon_mmap().

- the new code:
    for the req_addr != NULL, calls anon_mmap directly, only difference to
before is that we now do not specify MAP_FIXED.
    for the req_addr==NULL case, it calls the new function
anon_mmap_aligned(), which did not exist before, so no compatibility issues
can arise.

I can live with both my webrev.01 proposal and this one, but the solution
to factor out all the allocation to a single function would be my least
favorite choice: I do not really like factoring out a function which only
exists to be called from a single other function, and which is too awkward
to be used any other way. And your "reserved_memory_something()"  would be
quite awkward: bundling mmap allocation for the req_addr case (which
duplicates what anon_mmap() does), plus adding support for allocating at an
alignment. It does not easily fit together imho.

Of course, it is all a matter of taste, and if you insist, I'll do it. Most
important is to fix the bug.

---
>> -  size_t large_page_size = os::large_page_size();
>> +  char* start;
>>
>> +  size_t large_page_size = os::large_page_size();
>>
>> There's no need anymore to declare 'start' before it's set. So this
>> should be incorporated into:
>>
>> 3489   start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>> 3490     MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>
>> ---
>> 3480   // first reserve - but not commit - the address range in small
>> pages.
>> ...
>> 3529   // Commit small-paged leading area
>>
>> Could you start your comments with a capital and with a period?
>>
>
>  Done. Note that there are comments where the first word is a parameter
> name - if that name is lowercase, I would rather not write it upper case.
>
>
> 3082 // reserve small-paged memory, optionally aligned to given boundary.
> 3083 // alignment must be a multiple of allocation granularity. bytes must
> be
>
> and some in the test as well.
>

Okay, will change.

>
>
>>
>> ---
>> 3481   // Note: if a wish address was given
>>
>> There's no other reference to a "wish address". Could you reword it to be
>> req_addr or "requested address"?
>>
>
>  Done.
>
>
>>
>> ---
>> 3481   // Note: if a wish address was given, we still do not use
>> MAP_FIXED - do not want to trash
>> 3482   // existing mappings. We hand down req_addr to mmap and hope the
>> OS does its best.
>>
>> Could you move this comment down to the mmap call?
>>
>
>  Done.
>
>
>>
>> ---
>> 3529   // Commit small-paged leading area
>> ...
>> 3540   // Commit large paged area
>>
>> Shouldn't it say large-paged as well?
>>
>
>  Done.
>
>
>>
>> ---
>> 3483   // Note2: if wish address was given, alignment is ignored
>>
>> Previously, we asserted that our in-parameters were aligned. Why did you
>> replace that with a comment?
>>
>>
>  Ok, I restored the asserts, but I really am not happy with that. I find
> the asserts way too strict and they limit the usefulness of the functions:
>
>  - "assert(is_ptr_aligned(req_addr, alignment), "Must be");"
>
>  Why must this be? Why do I even have to input an alignment value if I
> input a requested address? It would make more sense just to say that
> alignment is ignored if req_addr is set.
> Note that this was not really well tested beforehand, because there were
> no tests for a non-null req_addr.
>
>  - "assert(is_size_aligned(bytes, alignment), "Must be");"
>
>  This is even worse. Why should the size of the allocation be aligned to
> the alignment? Why can't I allocate a small chunk of memory at a larger
> alignment? I do not understand this limit at all.
>
>  But ok, I restored the asserts and had to limit the test accordingly to
> not to trigger those asserts.
>
>
> You are probably right, and those alignment restrictions should probably
> have been put in the higher layers were it's needed. But, please, don't
> change this kind of things in an unrelated bug fix. It needs a separate RFE
> so that it gets enough scrutiny to make sure that we don't break something.
>
>
I agree with you. It is out of scope for this change.

Kind Regards, Thomas

> Thanks,
> StefanK
>
>
>
>
>>  ---
>> +  size_t extra_size = ...
>> +    char* const end0 = start0 + bytes;
>> +    char* const end = start + extra_size;
>>
>> It's not obvious from the variable names what they represent, IMHO. Could
>> you try to give them more descriptive names?
>>
>>
>  Done.
>
>
>>  I'll take a look at the tests after next review round.
>>
>>
>  Ok, thanks for taking the time!
>
>  ..Thomas
>
>
>>  Thanks,
>> StefanK
>>
>>
>>
>>  Thanks!
>>
>>  ..Thomas
>>
>>
>> On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson <
>> stefan.karlsson at oracle.com> wrote:
>>
>>> Hi Thomas,
>>>
>>> Thanks for fixing this  bug.
>>>
>>> Some comments inlined:
>>>
>>> On 2015-04-09 17:08, Thomas Stüfe wrote:
>>>
>>>> Hi all,
>>>>
>>>> please review this fix to huge page allocation on Linux.
>>>>
>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8077276
>>>> webrev:
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>>>
>>>
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>>
>>> +  const int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC :
>>> PROT_READ|PROT_WRITE;
>>>     char* start;
>>>    if (req_addr != NULL) {
>>>      assert(is_ptr_aligned(req_addr, alignment), "Must be");
>>>      assert(is_size_aligned(bytes, alignment), "Must be");
>>> -    start = os::reserve_memory(bytes, req_addr);
>>> -    assert(start == NULL || start == req_addr, "Must be");
>>> +    // do NOT use MAP_FIXED here - do not trash existing mappings.
>>> +    start = (char*) ::mmap(req_addr, bytes, prot,
>>> +      MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>> +    if (start == MAP_FAILED) {
>>> +      start = NULL;
>>> +    }
>>> +    if (!is_ptr_aligned(start, alignment)) {
>>> +      ::munmap(start, bytes);
>>> +      start = NULL;
>>> +    }
>>>
>>>
>>> Previously, the small page area was mapped with PROT_NONE, but now you
>>> change it to RW. Is there a reason why you changed that?
>>>
>>>
>>>    result = ::mmap(lp_start, lp_bytes, prot,
>>>                    MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>>                    -1, 0);
>>>    if (result == MAP_FAILED) {
>>> -    warn_on_large_pages_failure(req_addr, bytes, errno);
>>> +    warn_on_large_pages_failure(start, bytes, errno);
>>>
>>> Shouldn't we report lp_start instead of start here?
>>>
>>>
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>>
>>> There are a number of things that I would like to change with the the
>>> test.
>>>
>>> Can you use mapping == NULL in null checks instead of just if (mapping).
>>> The same for the asserts.
>>>
>>> I think the test will be easier to understand if you get rid of the for
>>> loop and extract code into helper functions, or do a small amount of code
>>> duplication:
>>>
>>>     // do the test twice: once with a req_addr=NULL, once with a non-NULL
>>>     // req_addr; the latter shall test if we trash existing mappings
>>>     for (int run = 0; run < 2; run ++) {
>>>
>>> I think you should fail or exit the test if you can't get the expected
>>> mapping, instead of continuing and have :
>>>
>>>       if (addr) {
>>>         test_log("... mapped [" PTR_FORMAT "-" PTR_FORMAT ").", addr,
>>> addr + size);
>>>       } else {
>>>         test_log("... failed.");
>>>       }
>>>
>>>       if (addr != NULL) {
>>>
>>> Thanks,
>>> StefanK
>>>
>>>
>>>
>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() first establishes a
>>>> mapping with small pages over the whole requested range, then exchanges
>>>> the
>>>> parts of the mapping which are aligned to large page size with large
>>>> pages.
>>>>
>>>> Now, when establishing the first mapping, it uses os::reserve_memory()
>>>> with
>>>> a potentially non-null req_addr. In that case, os::reserve_memory()
>>>> will do
>>>> a mmap(MAP_FIXED).
>>>>
>>>> This will trash any pre-existing mappings.
>>>>
>>>> Note that I could not willingly reproduce the error with an unmodified
>>>> VM.
>>>> But I added a reproduction case which demonstrated that if one were to
>>>> call
>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() with a non-NULL
>>>> req_addr, existing mappings would be trashed. Depending on where we are
>>>> in
>>>> address space, we also would overwrite libc structures and crash
>>>> immediately.
>>>>
>>>> The repro case is part of the change, see changes in
>>>> test_reserve_memory_special_huge_tlbfs_mixed(), and can be executed
>>>> with:
>>>>
>>>> ./images/jdk/bin/java -XX:+UseLargePages -XX:+UseHugeTLBFS -XX:-UseSHM
>>>> -XX:+ExecuteInternalVMTests -XX:+VerboseInternalVMTests
>>>>
>>>> The fix: instead of using os::reserve_memory(),
>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() now calls mmap()
>>>> directly with the non-NULL req_addr, but without MAP_FIXED. This means
>>>> the
>>>> OS may do its best to allocate at req_addr, but will not trash any
>>>> mappings. This also follows the pattern in
>>>> os::Linux::reserve_memory_special_huge_tlbfs_only(), which is a sister
>>>> function of os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>>>
>>>> Note also discussion at:
>>>>
>>>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html
>>>> .
>>>>
>>>> Thanks for reviewing!
>>>>
>>>> Kind Regards, Thomas
>>>>
>>>
>>>
>>
>>
>
>