RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)

Thomas Stüfe thomas.stuefe at gmail.com
Wed Apr 22 15:20:00 UTC 2015


Hi Stefan,

new webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/

See my answers inline.

On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson <stefan.karlsson at oracle.com
> wrote:

>  Hi Thomas,
>
> On 2015-04-21 18:31, Thomas Stüfe wrote:
>
> Hi Stefan,
>
>  thanks for reviewing!
>
>  Here is the new webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>
>  I worked in all your remarks except one: the test unfortunately cannot
> just assert if allocation fails, because that may happen sometimes,
> especially with small allocation sizes (os::Linux::reserve_memory_special_huge_tlbfs_mixed()
> will refuse to allocate if the area cannot accomodate one large page).
> The test also did not assert prior to my change, so I did not want to
> change that behaviour.
>
>  I changed some more things:
>
>  1) Before my change, os::Linux::reserve_memory_special_huge_tlbfs_mixed()
> would call os::reserve_memory() or os::reserve_memory_aligned(), depending
> on whether req_addr was NULL or not. That was a bit unclean (a lower-level
> API calling a high-level API) and led to the weird workaround at
> os_linux.cpp:3497, where the NMT Memory tracker had to deregister a
> allocation which it expected to be registered again later at an upper
> level. Yikes.
>
> Because I changed the first case to use low-level mmap, I also changed the
> second - replaced os::reserve_memory_aligned with low level coding.
> Actually I merged both branches together as good as possible, so the new
> code (os_linux.cpp:3484 ff) now handles alignment as well: If req_addr is
> set, allocation is attempted at req_addr and alignment is ignored. If
> req_addr is NULL, alignment is honored. This way I could get rid of the
> workaround for the NMT completely.
>
>  2) I completely rewrote the test function. It now tests three scenarios,
> see comment. I hope the code is clear and easy to understand.
>
>
>
> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>
> ---
> +  start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
> +    MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
> +  if (start == MAP_FAILED) {
>      return NULL;
>    }
>
> You've changed the behavior of this function. It used to return NULL if
> reserve_memory couldn't reserve memory at the requested address. Your
> proposed patch can return another address. This will be a problem for the
> upper layers when using compressed oops. I think it would be better if you
> keep the old behavior.
>

Arguably the new behaviour is better: The code which allocates the heap for
compressed Oops actually deals with this case quite nicely by doing a range
check for the address, not an identity check (see virtualspace.cpp
"ReservedHeapSpace::try_reserve_range()"). The other code path in
"ReservedSpace::initialize()" at least handles this case (if returned
address is unequal to requested address, releases memory again). So if the
OS cooperates by returning an address in the vicinity of the requested
address, it would actually be better to use this address if it fits instead
of retrying the allocation.

However, I am not completely sure about other callers of
os::reserve_memory_special(), and changing the behaviour would be beyond
the scope of my change anyway, so I restored the old behaviour as you
wished. Now we either return the requested address or NULL if a requested
address was given.


>
> ---
> Instead of inlining the reserve memory code into
> reserve_memory_special_huge_tlbfs_mixed, could you extract it into
> function? So that we have one function that's responsible for reservinb
> PROT_NONE memory, and one that commits the small, large, small pages.
>
>
Ok, did this. Because there is already a anon_mmap(), which fits the
req_addr != NULL case, I just reused this and added a sister function
"anon_mmap_aligned" for the alignment case.


> ---
> -  size_t large_page_size = os::large_page_size();
> +  char* start;
>
> +  size_t large_page_size = os::large_page_size();
>
> There's no need anymore to declare 'start' before it's set. So this should
> be incorporated into:
>
> 3489   start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
> 3490     MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>
> ---
> 3480   // first reserve - but not commit - the address range in small
> pages.
> ...
> 3529   // Commit small-paged leading area
>
> Could you start your comments with a capital and with a period?
>

Done. Note that there are comments where the first word is a parameter name
- if that name is lowercase, I would rather not write it upper case.


>
> ---
> 3481   // Note: if a wish address was given
>
> There's no other reference to a "wish address". Could you reword it to be
> req_addr or "requested address"?
>

Done.


>
> ---
> 3481   // Note: if a wish address was given, we still do not use MAP_FIXED
> - do not want to trash
> 3482   // existing mappings. We hand down req_addr to mmap and hope the OS
> does its best.
>
> Could you move this comment down to the mmap call?
>

Done.


>
> ---
> 3529   // Commit small-paged leading area
> ...
> 3540   // Commit large paged area
>
> Shouldn't it say large-paged as well?
>

Done.


>
> ---
> 3483   // Note2: if wish address was given, alignment is ignored
>
> Previously, we asserted that our in-parameters were aligned. Why did you
> replace that with a comment?
>
>
Ok, I restored the asserts, but I really am not happy with that. I find the
asserts way too strict and they limit the usefulness of the functions:

- "assert(is_ptr_aligned(req_addr, alignment), "Must be");"

Why must this be? Why do I even have to input an alignment value if I input
a requested address? It would make more sense just to say that alignment is
ignored if req_addr is set.
Note that this was not really well tested beforehand, because there were no
tests for a non-null req_addr.

- "assert(is_size_aligned(bytes, alignment), "Must be");"

This is even worse. Why should the size of the allocation be aligned to the
alignment? Why can't I allocate a small chunk of memory at a larger
alignment? I do not understand this limit at all.

But ok, I restored the asserts and had to limit the test accordingly to not
to trigger those asserts.


> ---
> +  size_t extra_size = ...
> +    char* const end0 = start0 + bytes;
> +    char* const end = start + extra_size;
>
> It's not obvious from the variable names what they represent, IMHO. Could
> you try to give them more descriptive names?
>
>
Done.


> I'll take a look at the tests after next review round.
>
>
Ok, thanks for taking the time!

..Thomas


> Thanks,
> StefanK
>
>
>
>  Thanks!
>
>  ..Thomas
>
>
> On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson <
> stefan.karlsson at oracle.com> wrote:
>
>> Hi Thomas,
>>
>> Thanks for fixing this  bug.
>>
>> Some comments inlined:
>>
>> On 2015-04-09 17:08, Thomas Stüfe wrote:
>>
>>> Hi all,
>>>
>>> please review this fix to huge page allocation on Linux.
>>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8077276
>>> webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>>
>>
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>
>> +  const int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC :
>> PROT_READ|PROT_WRITE;
>>     char* start;
>>    if (req_addr != NULL) {
>>      assert(is_ptr_aligned(req_addr, alignment), "Must be");
>>      assert(is_size_aligned(bytes, alignment), "Must be");
>> -    start = os::reserve_memory(bytes, req_addr);
>> -    assert(start == NULL || start == req_addr, "Must be");
>> +    // do NOT use MAP_FIXED here - do not trash existing mappings.
>> +    start = (char*) ::mmap(req_addr, bytes, prot,
>> +      MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>> +    if (start == MAP_FAILED) {
>> +      start = NULL;
>> +    }
>> +    if (!is_ptr_aligned(start, alignment)) {
>> +      ::munmap(start, bytes);
>> +      start = NULL;
>> +    }
>>
>>
>> Previously, the small page area was mapped with PROT_NONE, but now you
>> change it to RW. Is there a reason why you changed that?
>>
>>
>>    result = ::mmap(lp_start, lp_bytes, prot,
>>                    MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>                    -1, 0);
>>    if (result == MAP_FAILED) {
>> -    warn_on_large_pages_failure(req_addr, bytes, errno);
>> +    warn_on_large_pages_failure(start, bytes, errno);
>>
>> Shouldn't we report lp_start instead of start here?
>>
>>
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>
>> There are a number of things that I would like to change with the the
>> test.
>>
>> Can you use mapping == NULL in null checks instead of just if (mapping).
>> The same for the asserts.
>>
>> I think the test will be easier to understand if you get rid of the for
>> loop and extract code into helper functions, or do a small amount of code
>> duplication:
>>
>>     // do the test twice: once with a req_addr=NULL, once with a non-NULL
>>     // req_addr; the latter shall test if we trash existing mappings
>>     for (int run = 0; run < 2; run ++) {
>>
>> I think you should fail or exit the test if you can't get the expected
>> mapping, instead of continuing and have :
>>
>>       if (addr) {
>>         test_log("... mapped [" PTR_FORMAT "-" PTR_FORMAT ").", addr,
>> addr + size);
>>       } else {
>>         test_log("... failed.");
>>       }
>>
>>       if (addr != NULL) {
>>
>> Thanks,
>> StefanK
>>
>>
>>
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() first establishes a
>>> mapping with small pages over the whole requested range, then exchanges
>>> the
>>> parts of the mapping which are aligned to large page size with large
>>> pages.
>>>
>>> Now, when establishing the first mapping, it uses os::reserve_memory()
>>> with
>>> a potentially non-null req_addr. In that case, os::reserve_memory() will
>>> do
>>> a mmap(MAP_FIXED).
>>>
>>> This will trash any pre-existing mappings.
>>>
>>> Note that I could not willingly reproduce the error with an unmodified
>>> VM.
>>> But I added a reproduction case which demonstrated that if one were to
>>> call
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() with a non-NULL
>>> req_addr, existing mappings would be trashed. Depending on where we are
>>> in
>>> address space, we also would overwrite libc structures and crash
>>> immediately.
>>>
>>> The repro case is part of the change, see changes in
>>> test_reserve_memory_special_huge_tlbfs_mixed(), and can be executed with:
>>>
>>> ./images/jdk/bin/java -XX:+UseLargePages -XX:+UseHugeTLBFS -XX:-UseSHM
>>> -XX:+ExecuteInternalVMTests -XX:+VerboseInternalVMTests
>>>
>>> The fix: instead of using os::reserve_memory(),
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() now calls mmap()
>>> directly with the non-NULL req_addr, but without MAP_FIXED. This means
>>> the
>>> OS may do its best to allocate at req_addr, but will not trash any
>>> mappings. This also follows the pattern in
>>> os::Linux::reserve_memory_special_huge_tlbfs_only(), which is a sister
>>> function of os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>>
>>> Note also discussion at:
>>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html
>>> .
>>>
>>> Thanks for reviewing!
>>>
>>> Kind Regards, Thomas
>>>
>>
>>
>
>


More information about the hotspot-runtime-dev mailing list