RFR(s): 8077276: allocating heap with UseLargePages and HugeTLBFS may trash existing memory mappings (linux)
Stefan Karlsson
stefan.karlsson at oracle.com
Thu Apr 23 14:16:39 UTC 2015
Hi Thomas,
On 2015-04-23 14:55, Thomas Stüfe wrote:
> Hi Stefan,
>
> lets talk a bit more to get an understanding before I change the
> webrev again:
>
> On Thu, Apr 23, 2015 at 10:48 AM, Stefan Karlsson
> <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>> wrote:
>
>
>
> On 2015-04-22 17:20, Thomas Stüfe wrote:
>> Hi Stefan,
>>
>> new webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.02/webrev/
>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.02/webrev/>
>>
>> See my answers inline.
>>
>> On Wed, Apr 22, 2015 at 2:20 PM, Stefan Karlsson
>> <stefan.karlsson at oracle.com <mailto:stefan.karlsson at oracle.com>>
>> wrote:
>>
>> Hi Thomas,
>>
>> On 2015-04-21 18:31, Thomas Stüfe wrote:
>>> Hi Stefan,
>>>
>>> thanks for reviewing!
>>>
>>> Here is the new webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/>
>>>
>>> I worked in all your remarks except one: the test
>>> unfortunately cannot just assert if allocation fails,
>>> because that may happen sometimes, especially with small
>>> allocation sizes
>>> (os::Linux::reserve_memory_special_huge_tlbfs_mixed() will
>>> refuse to allocate if the area cannot accomodate one large
>>> page). The test also did not assert prior to my change, so I
>>> did not want to change that behaviour.
>>>
>>> I changed some more things:
>>>
>>> 1) Before my change,
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed() would
>>> call os::reserve_memory() or os::reserve_memory_aligned(),
>>> depending on whether req_addr was NULL or not. That was a
>>> bit unclean (a lower-level API calling a high-level API) and
>>> led to the weird workaround at os_linux.cpp:3497, where the
>>> NMT Memory tracker had to deregister a allocation which it
>>> expected to be registered again later at an upper level. Yikes.
>>>
>>> Because I changed the first case to use low-level mmap, I
>>> also changed the second - replaced
>>> os::reserve_memory_aligned with low level coding. Actually I
>>> merged both branches together as good as possible, so the
>>> new code (os_linux.cpp:3484 ff) now handles alignment as
>>> well: If req_addr is set, allocation is attempted at
>>> req_addr and alignment is ignored. If req_addr is NULL,
>>> alignment is honored. This way I could get rid of the
>>> workaround for the NMT completely.
>>>
>>> 2) I completely rewrote the test function. It now tests
>>> three scenarios, see comment. I hope the code is clear and
>>> easy to understand.
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html
>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.01/webrev/src/os/linux/vm/os_linux.cpp.frames.html>
>>
>> ---
>> + start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>> + if (start == MAP_FAILED) {
>> return NULL;
>> }
>>
>> You've changed the behavior of this function. It used to
>> return NULL if reserve_memory couldn't reserve memory at the
>> requested address. Your proposed patch can return another
>> address. This will be a problem for the upper layers when
>> using compressed oops. I think it would be better if you keep
>> the old behavior.
>>
>>
>> Arguably the new behaviour is better: The code which allocates
>> the heap for compressed Oops actually deals with this case quite
>> nicely by doing a range check for the address, not an identity
>> check (see virtualspace.cpp
>> "ReservedHeapSpace::try_reserve_range()"). The other code path in
>> "ReservedSpace::initialize()" at least handles this case (if
>> returned address is unequal to requested address, releases memory
>> again). So if the OS cooperates by returning an address in the
>> vicinity of the requested address, it would actually be better to
>> use this address if it fits instead of retrying the allocation.
>>
>> However, I am not completely sure about other callers of
>> os::reserve_memory_special(), and changing the behaviour would be
>> beyond the scope of my change anyway, so I restored the old
>> behaviour as you wished. Now we either return the requested
>> address or NULL if a requested address was given.
>
> Sounds good to me.
>
>
> Ok, I'll keep it that way.
>
>>
>>
>> ---
>> Instead of inlining the reserve memory code into
>> reserve_memory_special_huge_tlbfs_mixed, could you extract it
>> into function? So that we have one function that's
>> responsible for reservinb PROT_NONE memory, and one that
>> commits the small, large, small pages.
>>
>>
>> Ok, did this. Because there is already a anon_mmap(), which fits
>> the req_addr != NULL case, I just reused this and added a sister
>> function "anon_mmap_aligned" for the alignment case.
>
> I didn't try to argue that you should split up the code, but
> rather move the entire reserve code that you wrote for webrev.01,
> and move it into a separate function. The code would then look like:
>
> char* start = reserve_memory_something(bytes, req_addr, alignment);
> if (start == NULL) {
> return NULL;
> }
>
> and the reader of reserve_memory_special_huge_tlbfs_mixed could
> simply ignore the details of reserve_memory_something, unless they
> were important for some reason. That way you could also write
> separate tests for reserve_memory_something (or whatever it would
> be called).
>
> ---
> You are now using anon_mmap which has the comment:
>
> 3070 // anon_mmap() should only get called during VM
> initialization,
> 3071 // don't need lock (actually we can skip locking even it
> can be called
> 3072 // from multiple threads, because
> _highest_vm_reserved_address is just a
> 3073 // hint about the upper limit of non-stack memory regions.)
> 3074 if ((address)addr + bytes > _highest_vm_reserved_address) {
> 3075 _highest_vm_reserved_address = (address)addr + bytes;
> 3076 }
>
> So, it's not entirely clear to me that you should be using this
> function. I don't know what _highest_vm_reserved_address is used
> for. Your new function, anon_mmap_aligned isn't updating this
> variable.
>
>
> The comment is obsolete and does not reflect reality; anon_mmap is
> used for os::reserve_memory(), which in turn can be used at any time
> in the life of the VM.
>
> The comment - and the associated variable
> "_highest_vm_reserved_address" - only affects the LinuxThreads
> implementation which is legacy and should be removed (I just did ask
> this question on hotspot-dev and will do the cleanup if I get green
> light).
>
> It is also broken since forever, since a lot of allocations happen in
> VM and JDK where _highest_vm_reserved_address is just not updated.
OK. It would be good to get rid of it at some point then.
>
> Could we go back and use the reserve code you had in webrev.01, so
> that we don't have to consider the impact of this?
>
>
> It actually changes less than my last webrev:
>
> - old code was using os::reserve_memory() and
> os::reserve_memory_aligned().
> os::reserve_memory() directly called anon_mmap().
> os::reserve_memory_aligned() called os::reserve_memory(), which
> directly called anon_mmap().
>
> - the new code:
> for the req_addr != NULL, calls anon_mmap directly, only
> difference to before is that we now do not specify MAP_FIXED.
> for the req_addr==NULL case, it calls the new function
> anon_mmap_aligned(), which did not exist before, so no compatibility
> issues can arise.
>
> I can live with both my webrev.01 proposal and this one, but the
> solution to factor out all the allocation to a single function would
> be my least favorite choice: I do not really like factoring out a
> function which only exists to be called from a single other function,
> and which is too awkward to be used any other way.
The reason for moving it out to a function is to get better "separation
of concerns" and hide the implementation details, it doesn't really
matter if it's used only once.
> And your "reserved_memory_something()" would be quite awkward:
> bundling mmap allocation for the req_addr case (which duplicates what
> anon_mmap() does), plus adding support for allocating at an alignment.
> It does not easily fit together imho.
But you almost already had the "awkward" code in webrev.01, and it was
quite easy to understand. With the requirement that req_addr is aligned
if an alignment is passed in, then the code would be even easier. It
could probably be used to implement anon_mmap with just a few small
changes. And by doing that we would get rid of some code duplication as
well. Not sure it worth it though.
>
> Of course, it is all a matter of taste, and if you insist, I'll do it.
> Most important is to fix the bug.
OK. Then, please, extract it out to a separate function so that
reserve_memory_special_huge_tlbfs_mixed only have one call out to a
function that reserves the memory. It's OK if you want to implement it
as a anon_mmap/anon_mmap_aligned pair or inline the code (as in
webrev.01), as long as I don't have to see the code in
reserve_memory_special_huge_tlbfs_mixed.
Thanks,
StefanK
>
>> ---
>> - size_t large_page_size = os::large_page_size();
>> + char* start;
>>
>> + size_t large_page_size = os::large_page_size();
>>
>> There's no need anymore to declare 'start' before it's set.
>> So this should be incorporated into:
>>
>> 3489 start = (char*) ::mmap(req_addr, extra_size, PROT_NONE,
>> 3490 MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>
>> ---
>> 3480 // first reserve - but not commit - the address range
>> in small pages.
>> ...
>> 3529 // Commit small-paged leading area
>>
>> Could you start your comments with a capital and with a period?
>>
>>
>> Done. Note that there are comments where the first word is a
>> parameter name - if that name is lowercase, I would rather not
>> write it upper case.
>
> 3082 // reserve small-paged memory, optionally aligned to given
> boundary.
> 3083 // alignment must be a multiple of allocation granularity.
> bytes must be
>
> and some in the test as well.
>
>
> Okay, will change.
>
>>
>> ---
>> 3481 // Note: if a wish address was given
>>
>> There's no other reference to a "wish address". Could you
>> reword it to be req_addr or "requested address"?
>>
>>
>> Done.
>>
>>
>> ---
>> 3481 // Note: if a wish address was given, we still do not
>> use MAP_FIXED - do not want to trash
>> 3482 // existing mappings. We hand down req_addr to mmap
>> and hope the OS does its best.
>>
>> Could you move this comment down to the mmap call?
>>
>>
>> Done.
>>
>>
>> ---
>> 3529 // Commit small-paged leading area
>> ...
>> 3540 // Commit large paged area
>>
>> Shouldn't it say large-paged as well?
>>
>>
>> Done.
>>
>>
>> ---
>> 3483 // Note2: if wish address was given, alignment is ignored
>>
>> Previously, we asserted that our in-parameters were aligned.
>> Why did you replace that with a comment?
>>
>>
>> Ok, I restored the asserts, but I really am not happy with that.
>> I find the asserts way too strict and they limit the usefulness
>> of the functions:
>>
>> - "assert(is_ptr_aligned(req_addr, alignment), "Must be");"
>>
>> Why must this be? Why do I even have to input an alignment value
>> if I input a requested address? It would make more sense just to
>> say that alignment is ignored if req_addr is set.
>> Note that this was not really well tested beforehand, because
>> there were no tests for a non-null req_addr.
>>
>> - "assert(is_size_aligned(bytes, alignment), "Must be");"
>>
>> This is even worse. Why should the size of the allocation be
>> aligned to the alignment? Why can't I allocate a small chunk of
>> memory at a larger alignment? I do not understand this limit at all.
>>
>> But ok, I restored the asserts and had to limit the test
>> accordingly to not to trigger those asserts.
>
> You are probably right, and those alignment restrictions should
> probably have been put in the higher layers were it's needed. But,
> please, don't change this kind of things in an unrelated bug fix.
> It needs a separate RFE so that it gets enough scrutiny to make
> sure that we don't break something.
>
>
> I agree with you. It is out of scope for this change.
>
> Kind Regards, Thomas
>
> Thanks,
> StefanK
>
>
>> ---
>> + size_t extra_size = ...
>> + char* const end0 = start0 + bytes;
>> + char* const end = start + extra_size;
>>
>> It's not obvious from the variable names what they represent,
>> IMHO. Could you try to give them more descriptive names?
>>
>>
>> Done.
>>
>> I'll take a look at the tests after next review round.
>>
>>
>> Ok, thanks for taking the time!
>>
>> ..Thomas
>>
>> Thanks,
>> StefanK
>>
>>
>>>
>>> Thanks!
>>>
>>> ..Thomas
>>>
>>>
>>> On Wed, Apr 15, 2015 at 1:55 PM, Stefan Karlsson
>>> <stefan.karlsson at oracle.com
>>> <mailto:stefan.karlsson at oracle.com>> wrote:
>>>
>>> Hi Thomas,
>>>
>>> Thanks for fixing this bug.
>>>
>>> Some comments inlined:
>>>
>>> On 2015-04-09 17:08, Thomas Stüfe wrote:
>>>
>>> Hi all,
>>>
>>> please review this fix to huge page allocation on Linux.
>>>
>>> bug: https://bugs.openjdk.java.net/browse/JDK-8077276
>>> webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/>
>>>
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/src/os/linux/vm/os_linux.cpp.udiff.html>
>>>
>>> + const int prot = exec ?
>>> PROT_READ|PROT_WRITE|PROT_EXEC : PROT_READ|PROT_WRITE;
>>> char* start;
>>> if (req_addr != NULL) {
>>> assert(is_ptr_aligned(req_addr, alignment), "Must be");
>>> assert(is_size_aligned(bytes, alignment), "Must be");
>>> - start = os::reserve_memory(bytes, req_addr);
>>> - assert(start == NULL || start == req_addr, "Must be");
>>> + // do NOT use MAP_FIXED here - do not trash
>>> existing mappings.
>>> + start = (char*) ::mmap(req_addr, bytes, prot,
>>> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
>>> + if (start == MAP_FAILED) {
>>> + start = NULL;
>>> + }
>>> + if (!is_ptr_aligned(start, alignment)) {
>>> + ::munmap(start, bytes);
>>> + start = NULL;
>>> + }
>>>
>>>
>>> Previously, the small page area was mapped with
>>> PROT_NONE, but now you change it to RW. Is there a
>>> reason why you changed that?
>>>
>>>
>>> result = ::mmap(lp_start, lp_bytes, prot,
>>> MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED|MAP_HUGETLB,
>>> -1, 0);
>>> if (result == MAP_FAILED) {
>>> - warn_on_large_pages_failure(req_addr, bytes, errno);
>>> + warn_on_large_pages_failure(start, bytes, errno);
>>>
>>> Shouldn't we report lp_start instead of start here?
>>>
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp
>>> <http://cr.openjdk.java.net/%7Estuefe/webrevs/8077276/webrev.00/webrev/raw_files/new/src/os/linux/vm/os_linux.cpp>
>>>
>>> There are a number of things that I would like to change
>>> with the the test.
>>>
>>> Can you use mapping == NULL in null checks instead of
>>> just if (mapping). The same for the asserts.
>>>
>>> I think the test will be easier to understand if you get
>>> rid of the for loop and extract code into helper
>>> functions, or do a small amount of code duplication:
>>>
>>> // do the test twice: once with a req_addr=NULL,
>>> once with a non-NULL
>>> // req_addr; the latter shall test if we trash
>>> existing mappings
>>> for (int run = 0; run < 2; run ++) {
>>>
>>> I think you should fail or exit the test if you can't
>>> get the expected mapping, instead of continuing and have :
>>>
>>> if (addr) {
>>> test_log("... mapped [" PTR_FORMAT "-"
>>> PTR_FORMAT ").", addr, addr + size);
>>> } else {
>>> test_log("... failed.");
>>> }
>>>
>>> if (addr != NULL) {
>>>
>>> Thanks,
>>> StefanK
>>>
>>>
>>>
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>> first establishes a
>>> mapping with small pages over the whole requested
>>> range, then exchanges the
>>> parts of the mapping which are aligned to large page
>>> size with large pages.
>>>
>>> Now, when establishing the first mapping, it uses
>>> os::reserve_memory() with
>>> a potentially non-null req_addr. In that case,
>>> os::reserve_memory() will do
>>> a mmap(MAP_FIXED).
>>>
>>> This will trash any pre-existing mappings.
>>>
>>> Note that I could not willingly reproduce the error
>>> with an unmodified VM.
>>> But I added a reproduction case which demonstrated
>>> that if one were to call
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>> with a non-NULL
>>> req_addr, existing mappings would be trashed.
>>> Depending on where we are in
>>> address space, we also would overwrite libc
>>> structures and crash
>>> immediately.
>>>
>>> The repro case is part of the change, see changes in
>>> test_reserve_memory_special_huge_tlbfs_mixed(), and
>>> can be executed with:
>>>
>>> ./images/jdk/bin/java -XX:+UseLargePages
>>> -XX:+UseHugeTLBFS -XX:-UseSHM
>>> -XX:+ExecuteInternalVMTests -XX:+VerboseInternalVMTests
>>>
>>> The fix: instead of using os::reserve_memory(),
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed()
>>> now calls mmap()
>>> directly with the non-NULL req_addr, but without
>>> MAP_FIXED. This means the
>>> OS may do its best to allocate at req_addr, but will
>>> not trash any
>>> mappings. This also follows the pattern in
>>> os::Linux::reserve_memory_special_huge_tlbfs_only(),
>>> which is a sister
>>> function of
>>> os::Linux::reserve_memory_special_huge_tlbfs_mixed().
>>>
>>> Note also discussion at:
>>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2015-April/017823.html.
>>>
>>> Thanks for reviewing!
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>>
>>
>>
>
>
More information about the hotspot-runtime-dev
mailing list