RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)
Thomas Stüfe
thomas.stuefe at gmail.com
Thu Apr 23 12:55:44 UTC 2015
Hi Stefan,
lets talk a bit more to get an understanding before I change the webrev
again:
On Thu, Apr 23, 2015 at 10:48 AM, Stefan Karlsson <
stefan.karlsson at oracle.com> wrote:
>
>
> On 2015-04-22 17:20, Thomas Stüfe wrote:
>
> Hi Stefan,
>
> new webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/
>
> See my answers inline.
>
> On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson <
> stefan.karlsson at oracle.com> wrote:
>
>> Hi Thomas,
>>
>> On 2015-04-21 18:31, Thomas Stüfe wrote:
>>
>> Hi Stefan,
>>
>> thanks for reviewing!
>>
>> Here is the new webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>>
>> I worked in all your remarks except one: the test unfortunately cannot
>> just assert if allocation fails, because that may happen sometimes,
>> especially with small allocation sizes (os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>> will refuse to allocate if the area cannot accomodate one large page).
>> The test also did not assert prior to my change, so I did not want to
>> change that behaviour.
>>
>> I changed some more things:
>>
>> 1) Before my change, os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>> would call os::reserve_memory() or os::reserve_memory_aligned(), depending
>> on whether req_addr was NULL or not. That was a bit unclean (a lower-level
>> API calling a high-level API) and led to the weird workaround at
>> os_linux.cpp:3497, where the NMT Memory tracker had to deregister a
>> allocation which it expected to be registered again later at an upper
>> level. Yikes.
>>
>> Because I changed the first case to use low-level mmap, I also changed
>> the second - replaced os::reserve_memory_aligned with low level coding.
>> Actually I merged both branches together as good as possible, so the new
>> code (os_linux.cpp:3484 ff) now handles alignment as well: If req_addr is
>> set, allocation is attempted at req_addr and alignment is ignored. If
>> req_addr is NULL, alignment is honored. This way I could get rid of the
>> workaround for the NMT completely.
>>
>> 2) I completely rewrote the test function. It now tests three
>> scenarios, see comment. I hope the code is clear and easy to understand.
>>
>>
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>>
>> ---
>> + start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>> + if (start == MAP_FAILED) {
>> return NULL;
>> }
>>
>> You've changed the behavior of this function. It used to return NULL if
>> reserve_memory couldn't reserve memory at the requested address. Your
>> proposed patch can return another address. This will be a problem for the
>> upper layers when using compressed oops. I think it would be better if you
>> keep the old behavior.
>>
>
> Arguably the new behaviour is better: The code which allocates the heap
> for compressed Oops actually deals with this case quite nicely by doing a
> range check for the address, not an identity check (see virtualspace.cpp
> "ReservedHeapSpace::try_reserve_range()"). The other code path in
> "ReservedSpace::initialize()" at least handles this case (if returned
> address is unequal to requested address, releases memory again). So if the
> OS cooperates by returning an address in the vicinity of the requested
> address, it would actually be better to use this address if it fits instead
> of retrying the allocation.
>
> However, I am not completely sure about other callers of
> os::reserve_memory_special(), and changing the behaviour would be beyond
> the scope of my change anyway, so I restored the old behaviour as you
> wished. Now we either return the requested address or NULL if a requested
> address was given.
>
>
> Sounds good to me.
>
>
Ok, I'll keep it that way.
>
>
>>
>> ---
>> Instead of inlining the reserve memory code into
>> reserve_memory_special_huge_tlbfs_mixed, could you extract it into
>> function? So that we have one function that's responsible for reservinb
>> PROT_NONE memory, and one that commits the small, large, small pages.
>>
>>
> Ok, did this. Because there is already a anon_mmap(), which fits the
> req_addr != NULL case, I just reused this and added a sister function
> "anon_mmap_aligned" for the alignment case.
>
>
> I didn't try to argue that you should split up the code, but rather move
> the entire reserve code that you wrote for webrev.01, and move it into a
> separate function. The code would then look like:
>
> char* start = reserve_memory_something(bytes, req_addr, alignment);
> if (start == NULL) {
> return NULL;
> }
>
> and the reader of reserve_memory_special_huge_tlbfs_mixed could simply
> ignore the details of reserve_memory_something, unless they were important
> for some reason. That way you could also write separate tests for
> reserve_memory_something (or whatever it would be called).
>
> ---
> You are now using anon_mmap which has the comment:
>
> 3070 // anon_mmap() should only get called during VM initialization,
> 3071 // don't need lock (actually we can skip locking even it can be
> called
> 3072 // from multiple threads, because _highest_vm_reserved_address is
> just a
> 3073 // hint about the upper limit of non-stack memory regions.)
> 3074 if ((address)addr + bytes > _highest_vm_reserved_address) {
> 3075 _highest_vm_reserved_address = (address)addr + bytes;
> 3076 }
>
> So, it's not entirely clear to me that you should be using this function.
> I don't know what _highest_vm_reserved_address is used for. Your new
> function, anon_mmap_aligned isn't updating this variable.
>
>
The comment is obsolete and does not reflect reality; anon_mmap is used for
os::reserve_memory(), which in turn can be used at any time in the life of
the VM.
The comment - and the associated variable "_highest_vm_reserved_address" -
only affects the LinuxThreads implementation which is legacy and should be
removed (I just did ask this question on hotspot-dev and will do the
cleanup if I get green light).
It is also broken since forever, since a lot of allocations happen in VM
and JDK where _highest_vm_reserved_address is just not updated.
> Could we go back and use the reserve code you had in webrev.01, so that we
> don't have to consider the impact of this?
>
>
It actually changes less than my last webrev:
- old code was using os::reserve_memory() and os::reserve_memory_aligned().
os::reserve_memory() directly called anon_mmap().
os::reserve_memory_aligned() called os::reserve_memory(), which directly
called anon_mmap().
- the new code:
for the req_addr != NULL, calls anon_mmap directly, only difference to
before is that we now do not specify MAP_FIXED.
for the req_addr==NULL case, it calls the new function
anon_mmap_aligned(), which did not exist before, so no compatibility issues
can arise.
I can live with both my webrev.01 proposal and this one, but the solution
to factor out all the allocation to a single function would be my least
favorite choice: I do not really like factoring out a function which only
exists to be called from a single other function, and which is too awkward
to be used any other way. And your "reserved_memory_something()" would be
quite awkward: bundling mmap allocation for the req_addr case (which
duplicates what anon_mmap() does), plus adding support for allocating at an
alignment. It does not easily fit together imho.
Of course, it is all a matter of taste, and if you insist, I'll do it. Most
important is to fix the bug.
---
>> - size_t large_page_size = os::large_page_size();
>> + char* start;
>>
>> + size_t large_page_size = os::large_page_size();
>>
>> There's no need anymore to declare 'start' before it's set. So this
>> should be incorporated into:
>>
>> 3489 start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>> 3490 MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>
>> ---
>> 3480 // first reserve - but not commit - the address range in small
>> pages.
>> ...
>> 3529 // Commit small-paged leading area
>>
>> Could you start your comments with a capital and with a period?
>>
>
> Done. Note that there are comments where the first word is a parameter
> name - if that name is lowercase, I would rather not write it upper case.
>
>
> 3082 // reserve small-paged memory, optionally aligned to given boundary.
> 3083 // alignment must be a multiple of allocation granularity. bytes must
> be
>
> and some in the test as well.
>
Okay, will change.
>
>
>>
>> ---
>> 3481 // Note: if a wish address was given
>>
>> There's no other reference to a "wish address". Could you reword it to be
>> req_addr or "requested address"?
>>
>
> Done.
>
>
>>
>> ---
>> 3481 // Note: if a wish address was given, we still do not use
>> MAP_FIXED - do not want to trash
>> 3482 // existing mappings. We hand down req_addr to mmap and hope the
>> OS does its best.
>>
>> Could you move this comment down to the mmap call?
>>
>
> Done.
>
>
>>
>> ---
>> 3529 // Commit small-paged leading area
>> ...
>> 3540 // Commit large paged area
>>
>> Shouldn't it say large-paged as well?
>>
>
> Done.
>
>
>>
>> ---
>> 3483 // Note2: if wish address was given, alignment is ignored
>>
>> Previously, we asserted that our in-parameters were aligned. Why did you
>> replace that with a comment?
>>
>>
> Ok, I restored the asserts, but I really am not happy with that. I find
> the asserts way too strict and they limit the usefulness of the functions:
>
> - "assert(is_ptr_aligned(req_addr, alignment), "Must be");"
>
> Why must this be? Why do I even have to input an alignment value if I
> input a requested address? It would make more sense just to say that
> alignment is ignored if req_addr is set.
> Note that this was not really well tested beforehand, because there were
> no tests for a non-null req_addr.
>
> - "assert(is_size_aligned(bytes, alignment), "Must be");"
>
> This is even worse. Why should the size of the allocation be aligned to
> the alignment? Why can't I allocate a small chunk of memory at a larger
> alignment? I do not understand this limit at all.
>
> But ok, I restored the asserts and had to limit the test accordingly to
> not to trigger those asserts.
>
>
> You are probably right, and those alignment restrictions should probably
> have been put in the higher layers were it's needed. But, please, don't
> change this kind of things in an unrelated bug fix. It needs a separate RFE
> so that it gets enough scrutiny to make sure that we don't break something.
>
>
I agree with you. It is out of scope for this change.
Kind Regards, Thomas
> Thanks,
> StefanK
>
>
>
>
>> ---
>> + size_t extra_size = ...
>> + char* const end0 = start0 + bytes;
>> + char* const end = start + extra_size;
>>
>> It's not obvious from the variable names what they represent, IMHO. Could
>> you try to give them more descriptive names?
>>
>>
> Done.
>
>
>> I'll take a look at the tests after next review round.
>>
>>
> Ok, thanks for taking the time!
>
> ..Thomas
>
>
>> Thanks,
>> StefanK
>>
>>
>>
>> Thanks!
>>
>> ..Thomas
>>
>>
>> On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson <
>> stefan.karlsson at oracle.com> wrote:
>>
>>> Hi Thomas,
>>>
>>> Thanks for fixing this bug.
>>>
>>> Some comments inlined:
>>>
>>> On 2015-04-09 17:08, Thomas Stüfe wrote:
>>>
>>>> Hi all,
>>>>
>>>> please review this fix to huge page allocation on Linux.
>>>>
>>>> bug: https://bugs.openjdk.java.net/browse/JDK-8077276
>>>> webrev:
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>>>
>>>
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>>
>>> + const int prot = exec ? PROT_READ|PROT_WRITE|PROT_EXEC :
>>> PROT_READ|PROT_WRITE;
>>> char* start;
>>> if (req_addr != NULL) {
>>> assert(is_ptr_aligned(req_addr, alignment), "Must be");
>>> assert(is_size_aligned(bytes, alignment), "Must be");
>>> - start = os::reserve_memory(bytes, req_addr);
>>> - assert(start == NULL || start == req_addr, "Must be");
>>> + // do NOT use MAP_FIXED here - do not trash existing mappings.
>>> + start = (char*) ::mmap(req_addr, bytes, prot,
>>> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>> + if (start == MAP_FAILED) {
>>> + start = NULL;
>>> + }
>>> + if (!is_ptr_aligned(start, alignment)) {
>>> + ::munmap(start, bytes);
>>> + start = NULL;
>>> + }
>>>
>>>
>>> Previously, the small page area was mapped with PROT_NONE, but now you
>>> change it to RW. Is there a reason why you changed that?
>>>
>>>
>>> result = ::mmap(lp_start, lp_bytes, prot,
>>> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>> -1, 0);
>>> if (result == MAP_FAILED) {
>>> - warn_on_large_pages_failure(req_addr, bytes, errno);
>>> + warn_on_large_pages_failure(start, bytes, errno);
>>>
>>> Shouldn't we report lp_start instead of start here?
>>>
>>>
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>>
>>> There are a number of things that I would like to change with the the
>>> test.
>>>
>>> Can you use mapping == NULL in null checks instead of just if (mapping).
>>> The same for the asserts.
>>>
>>> I think the test will be easier to understand if you get rid of the for
>>> loop and extract code into helper functions, or do a small amount of code
>>> duplication:
>>>
>>> // do the test twice: once with a req_addr=NULL, once with a non-NULL
>>> // req_addr; the latter shall test if we trash existing mappings
>>> for (int run = 0; run < 2; run ++) {
>>>
>>> I think you should fail or exit the test if you can't get the expected
>>> mapping, instead of continuing and have :
>>>
>>> if (addr) {
>>> test_log("... mapped [" PTR_FORMAT "-" PTR_FORMAT ").", addr,
>>> addr + size);
>>> } else {
>>> test_log("... failed.");
>>> }
>>>
>>> if (addr != NULL) {
>>>
>>> Thanks,
>>> StefanK
>>>
>>>
>>>
>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() first establishes a
>>>> mapping with small pages over the whole requested range, then exchanges
>>>> the
>>>> parts of the mapping which are aligned to large page size with large
>>>> pages.
>>>>
>>>> Now, when establishing the first mapping, it uses os::reserve_memory()
>>>> with
>>>> a potentially non-null req_addr. In that case, os::reserve_memory()
>>>> will do
>>>> a mmap(MAP_FIXED).
>>>>
>>>> This will trash any pre-existing mappings.
>>>>
>>>> Note that I could not willingly reproduce the error with an unmodified
>>>> VM.
>>>> But I added a reproduction case which demonstrated that if one were to
>>>> call
>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() with a non-NULL
>>>> req_addr, existing mappings would be trashed. Depending on where we are
>>>> in
>>>> address space, we also would overwrite libc structures and crash
>>>> immediately.
>>>>
>>>> The repro case is part of the change, see changes in
>>>> test_reserve_memory_special_huge_tlbfs_mixed(), and can be executed
>>>> with:
>>>>
>>>> ./images/jdk/bin/java -XX:+UseLargePages -XX:+UseHugeTLBFS -XX:-UseSHM
>>>> -XX:+ExecuteInternalVMTests -XX:+VerboseInternalVMTests
>>>>
>>>> The fix: instead of using os::reserve_memory(),
>>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() now calls mmap()
>>>> directly with the non-NULL req_addr, but without MAP_FIXED. This means
>>>> the
>>>> OS may do its best to allocate at req_addr, but will not trash any
>>>> mappings. This also follows the pattern in
>>>> os::Linux::reserve_memory_special_huge_tlbfs_only(), which is a sister
>>>> function of os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>>>
>>>> Note also discussion at:
>>>>
>>>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html
>>>> .
>>>>
>>>> Thanks for reviewing!
>>>>
>>>> Kind Regards, Thomas
>>>>
>>>
>>>
>>
>>
>
>
More information about the hotspot-runtime-dev
mailing list