RFR(M): 8152995: Solaris os::available_memory() doesn't do what we think it does

Fri Apr 8 16:52:57 UTC 2016

Hi Daniel,

Thanks for the long reply.

> On 8 apr. 2016, at 17:18, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
> 
> Hi Erik,
> 
> Lots of info here. Thanks!
> 
> Just a few embedded replies below...
> 
> 
>> On 4/8/16 6:36 AM, Erik Österlund wrote:
>> Hi Daniel,
>> 
>> Thanks for having a look at this.
>> 
>>> On 2016-04-06 18:31, Daniel D. Daugherty wrote:
>>> Erik,
>>> 
>>> Thanks for adding Runtime to this discussion. The topic is definitely
>>> of interest to Runtime folks...
>>> 
>>> More below...
>>> 
>>> 
>>>> On 2016-04-06 16:09, Erik Österlund wrote:
>>>> Hi,
>>>> 
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8152995
>>>> CR: http://cr.openjdk.java.net/~eosterlund/8152995/webrev.00/
>>>> 
>>>> On Solaris, the os::available_memory() function is currently
>>>> calculated with sysconf(_SC_AVPHYS_PAGES).
>>> 
>>> The Solaris man page for sysconf(_SC_AVPHYS_PAGES):
>>> 
>>> _SC_AVPHYS_PAGES  Number of physical memory pages not
>>>                  currently in use by system
>> 
>> Yes. After drilling down the details, it returns the amount of physical memory not used by the virtual memory system. But when you mmap without NORESERVE, no physical memory is actually paged in until memory starts to be touched. Therefore this metric corresponds to how much memory is available *for the kernel* to satisfy page faults, rather than how much memory is available *for applications* to satisfy allocations. These two interpretations are completely orthogonal: the amount of memory available for the kernel has no relation to the amount of memory available for applications, and the amount of memory available for applications has no relation to the amount of memory available in the kernel. It is safe to say that what is being asked for, is available memory to satisfy allocations, and not how much memory the kernel has at its disposal for internal use.
> 
> These are two interesting distinctions:
> 
> - how much memory is available *for the kernel* to satisfy page faults
> - how much memory is available *for applications* to satisfy allocations
> 
> and I'm having difficulty understanding why you are splitting the hair
> this way. Yes, I'm saying there's not much difference here.
> 
> If the kernel has a page available to satisfy a page fault, then an
> application has that page available for use to satisfy an allocation.
> In my mind, an allocation is a use of the page and not a reservation
> of a page. Let's take a couple of steps back...
> 
> The JVM does advanced memory allocation. By that I mean that the JVM
> does not simply use malloc()/free() to allocate and deallocate memory.
> The malloc()/free() world is simple and you won't run into swap space
> issues. OK, you can run into swap space issues, but not the same swap
> space issues we're talking about.
> 
> So the JVM's use of advanced memory allocation means that we have to
> be prepared for more complicated memory allocation failures. We also
> have to be run in a system that is properly configured for applications
> that use advance memory allocation. The JVM's use of a reserve, commit
> and touch model means that we have to be prepared for things to go
> wrong at each stage of the advanced memory allocation model.
> 
> When the JVM reserves a page:
> - if the reservation works, then all that means is that the kernel
>  _thinks_ that it can satisfy a _future_ use of that page
> - if the reservation doesn't work, then the JVM dies (and it should)
> 
> When the JVM commits a page:
> - if the commit works, then all that means is that the kernel has
>  been able to allocate a backing page (swap) for the reserved page.
>  Note: the RAM page may not have been touched yet, all the application
>  did so far is change the mapping on the page that it had reserved
> - On Solaris, if a backing page (swap) cannot be allocated at this
>  time, then the commit attempt fails and we get:
> 
>  #define EAGAIN  11      /* Resource temporarily unavailable */
> 
>  On Solaris and Linux, if the attempt to change a reserved page to
>  a committed page fails, then we have to kill the JVM because we may
>  have lost the reservation. MacOS X is clear that the reservation is
>  not lost and I think Windows is the same as MacOS X.
> - Note that just because we have a reserved a page, that does not mean
>  we'll be able to convert it to a committed page in the future. When
>  the kernel allowed us to reserve the page, that was its best guess
>  at the time of the reservation.
> 
> When the JVM touches a page:
> - At this point, the reserved RAM page is finally used and the Solaris
>  kernel won't count it in sysconf(_SC_AVPHYS_PAGES).
> 
> 
> So how does the above relate to these two statements:
> 
> - how much memory is available *for the kernel* to satisfy page faults
> - how much memory is available *for applications* to satisfy allocations

By allocations I mean malloc/sbrk/mmap without NORESERVE - which is used when committing memory.

When the JVM commits memory, the kernel "reserves" backing storage of some sort (not to be confused with JVM notion of reserve, which means hogging virtual address space rather than actual memory), so that this memory may safely be touched without running out of memory.

The current number does *not* correspond or have any relation to how much memory can be committed by the Java heap or malloc as it has no connection to how much more memory can be reserved in the keenel with backing storage. That is what I meant with satisfying allocations.

A useful example how it works now:
1) There is 100G "available".
2) The JVM expands its heap by committing in JVM terminology (mmap without NORESERVE), let's say 50G memory which is promised backing storage
3) There is still 100G "available".

And what is actually reported is that there are 100G worth of pages that can be paged into the kernel reserved memory that was committed. One would expect to instead see 50G being available, as the kernel can only reserve backing storage for another 50G.

You can run my little program to verify this.

> My point is that if the kernel tells you it has a page available,
> then the application can presume that it has a page available to
> satisfy an allocation. However, the application has to be prepared
> for the case where that available page is taken by another process
> and react accordingly.

No it can't - that's the problem. It only means the kernel can page that much in to virtual memory, but it doesn't have that much memory to reserve backing storage for commits etc. Again, my little program demonstrates this.

I hope we understand eachother better now. I will reply to the rest once we reach mutual understanding.

Thanks,
/Erik

> 
>> 
>>> 
>>>> Unfortunately this does not match intended semantics. The intended
>>>> semantics is to return how much memory can be allocated by mmap into
>>>> physical memory. But, what _SC_AVPHYS_PAGES does is to return how many
>>>> physical pages are available to be used by virtual memory as backing
>>>> storage on-demand once it is touched, without any connection
>>>> whatsoever to virtual memory.
>>> 
>>> This part has me curious:
>>> 
>>> > The intended semantics is to return how much memory can be allocated
>>> > by mmap into physical memory.
>>> 
>>> since I don't understand where you found the "intended semantics".
>> 
>> Unfortunately I can't tell you this in the open list.
> 
> Please add a confidential note to the bug report. :-)
> 
> 
>> 
>>> Only one of the platforms has any comments about available_memory:
>>> 
>>> src/os/bsd/vm/os_bsd.cpp:
>>> 
>>> // available here means free
>>> julong os::Bsd::available_memory() {
>>> 
>>> the rest just don't say...
>>> 
>>> Personally, I've always interpreted available_memory() to mean
>>> available physical pages, as in pages that are not in use now.
>>> This matches the definition of _SC_AVPHYS_PAGES above...
>> 
>> So your version of available memory depends on how much memory is "in use". In use by who? As you can see, this also leaves room for interpretations - use by the kernel to satisfy page faults or use by applications to satisfy allocations. You just flipped it.
>> 
>> And yes, interpretations have indeed been personal and not very well defined, which is obvious from the code. Let me summarize the current situation:
>> 
>> Windows: Returns the amount of physical memory (RAM) that is available and can be used to satisfy allocations.
>> 
>> Linux: Returns the amount of physical memory in the freelist that can be used to satisfy allocations. But there is in fact a lot more memory available. The freelist can be seen as memory wasted by the OS. It tries to use physical memory for things like file caches to boost performance, but that memory is in fact available if anyone starts allocating. So it will return less memory than is in fact available, contrary to the name of the function. This is also misleading and arguably wrong.
>> 
>> BSD: Seems to also return the amount of free rather than available memory, to satisfy allocations, contrary to the name of the function.
>> 
>> Solaris: Returns the amount of physical memory not paged in, to satisfy page faults, invariantly of the amount of memory available to satisfy allocations.
> 
> I'm not quite sure why you are describing Solaris this way. Based on
> the man page 1-liner:
> 
> _SC_AVPHYS_PAGES  Number of physical memory pages not
>                  currently in use by system
> 
> Solaris sounds the same as Windows above.
> 
> 
>> 
>> In summary:
>> *All OS implementations seem to consistently disregard swap page and not consider it available memory.
>> * The meaning is mixed between available memory to satisfy application allocations, available memory for the kernel to use to satisfy page faults, and immediately free memory (rather than available memory) to satisfy allocations.
>> 
>> I think that due to the lack of a clear definition, people have used any preferred interpretation.
> 
> The fact that the different platforms have different implied
> meanings for os::available_memory() is the real bug here. This
> kind of inconsistency is "not a good thing" (TM). :-)
> 
> 
>> 
>>> 
>>>> Even if we mmap to commit heap memory without NORESERVE, the
>>>> _SC_AVPHYS_PAGES metric does not change its value - at least not until
>>>> somebody actually touches the mmaped memory and it starts becoming
>>>> backed by actual physical memory. So the JVM can in theory commit the
>>>> whole physical memory, and _SC_AVPHYS_PAGES will still reply that all
>>>> that memory is still available given that it has not been touched yet.
>>> 
>>> Yes, I believe that is exactly how things work and I think
>>> that available_memory() is returning that information
>>> correctly.
>> 
>> Solaris is AFAIK the only OS to make this interpretation, and it is arguably not useful for the users as it does not correspond to memory that applications can use to satisfy allocations.
> 
> I'll have to wait for your next reply to understand your distinction
> between an available physical page and a page that can be used to
> satisfy an applications page allocation request. I'm just not seeing
> it yet...
> 
> 
>> But due to the lack of a definition, I can't say you are wrong, just have opinions about it and point out Solaris is the only OS with this interpretation, and that people have been very confused about it.
> 
> Based on what you've written here, it looks to me like Solaris and
> Windows have the same interpretation, but, again, I'll wait for your
> reply...
> 
> 
>> 
>>> 
>>>> It is likely that this is related to random swap-related test
>>>> failures, where too many JVMs are created based on this metric.
>>> 
>>> Please explain further. What do you mean by "too many JVMs are
>>> created based on this metric"?
>> 
>> Well I guess the problem is that people run many JVMs at the same time and then some JVM crashes complaining that it is out of swap memory and hence can't run any longer, while available_memory says there are multiple gigabytes of available memory. This is very confusing for users. The reason is, as I said, that the memory has not been paged in yet, and therefore there appears to be lots of memory available, but none of it can be used to satisfy allocations, leading to failures.
> 
> Hmmm... there might be the beginnings of something here that will
> help the two of us reach the same place... I'll have to come back
> to this paragraph after I see your next reply...
> 
> 
>> 
>>> 
>>>> Even
>>>> if it is not, the os::available_memory() call is still broken in its
>>>> current state and should be fixed regardless.
>>> 
>>> I'm not yet convinced that available_memory() is broken
>>> and needs to be fixed. I don't see available_memory() being
>>> used in a lot of places and those uses that I do see are
>>> mostly just reports of the value...
>>> 
>>> So what am I missing about how os::available_memory() is
>>> being used?
>> 
>> The reports are very problematic though. It's better to not report anything than to report misleading, incorrect numbers. If we report a number, then that number should be correct and useful.
> 
> Well... we haven't yet reached agreement on what correct means here.
> Hopefully we will soon...
> 
> 
>> I also think that whether available_memory() is broken or not has nothing to do with how often it is being used.
> 
> My point here was two fold:
> 
> 1) The two of us don't yet agree that it is broken.
> 2) Even if it was broken, I didn't and still don't see how the broken
>   information is leading to an issue. The JVM is not using the value
>   of os::available_memory() for any allocation policy what so ever.
> 
> 
>> It reports misleading numbers, and that leads to unnecessary confusion.
> 
> Again, we don't (yet) agree that the numbers are misleading. At this
> point, I'll agree that the different platforms give different answers
> and that's a problem.
> 
> 
>> People are very confused about Solaris swap issues.
> 
> Yes, they are. I've written about how advanced memory allocation
> algorithms require smart swap space management several times now.
> 
> 
>> A contributing reason is that reported values are not what people think they are. Therefore it needs a fix.
> 
> I agree that we need to do something about making what
> os::available_memory() reports consistent and documented.
> I agree that there is confusion. It will be difficult to
> get agreement on what os::available_memory() _should_
> report. There will be differing opinions... :-)
> 
> I think I've made my preference clear. I want the lowest level
> answer (free pages of memory from the kernel's POV). I want the
> system to trust me to know that the answer to my query can and
> will change right after I receive that answer. And the system
> should trust me to code my application accordingly.
> 
> Erik, this is a wonderful thread. I'll leave it up to you as
> for how much should be copied to the bug report.
> 
> Dan
> 
>> 
>> Thanks,
>> /Erik
>> 
>>> Dan
>>> 
>>> 
>>> 
>>>> My proposed fix uses kstat to get the available memory that can be
>>>> mmapped (which actually relates to virtual memory). It then uses
>>>> swapctl() to find out the amount of free swap, subtracting that from
>>>> the kstat value, to make sure we do not count swap memory as being
>>>> available for grabbing, to mimick the current behaviour of other
>>>> platforms. The code iterates over the potentially many swap resources
>>>> and adds up the free swap memory.
>>>> 
>>>> kstat gives us all memory that can be made available, including memory
>>>> already used by the OS for things like file caches, and swap memory.
>>>> When this value is 0, mmap will fail. That's why I calculate the
>>>> amount of swap and remove that, assuming it is okay to use memory that
>>>> isn't immediately available but can be made available, as long as it
>>>> does not involve paging to the swap memory.
>>>> 
>>>> Testing:
>>>> * JPRT
>>>> * Made my own test program that can be found in the comments of the
>>>> BUG to report on memory values, so I could verify what is going on and
>>>> that when the new os::available_memory() becomes 0, is indeed when
>>>> paging to swap starts happening using vmstat.
>>>> 
>>>> I need a sponsor to push this if anyone is interested.
>>>> 
>>>> Thanks,
>>>> /Erik
>