RFR(M): 8152995: Solaris os::available_memory() doesn't do what we think it does

Fri Apr 8 15:18:15 UTC 2016

Hi Erik,

Lots of info here. Thanks!

Just a few embedded replies below...

On 4/8/16 6:36 AM, Erik Österlund wrote:
> Hi Daniel,
>
> Thanks for having a look at this.
>
> On 2016-04-06 18:31, Daniel D. Daugherty wrote:
>> Erik,
>>
>> Thanks for adding Runtime to this discussion. The topic is definitely
>> of interest to Runtime folks...
>>
>> More below...
>>
>>
>> On 2016-04-06 16:09, Erik Österlund wrote:
>>> Hi,
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8152995
>>> CR: http://cr.openjdk.java.net/~eosterlund/8152995/webrev.00/
>>>
>>> On Solaris, the os::available_memory() function is currently
>>> calculated with sysconf(_SC_AVPHYS_PAGES).
>>
>> The Solaris man page for sysconf(_SC_AVPHYS_PAGES):
>>
>> _SC_AVPHYS_PAGES  Number of physical memory pages not
>>                   currently in use by system
>
> Yes. After drilling down the details, it returns the amount of 
> physical memory not used by the virtual memory system. But when you 
> mmap without NORESERVE, no physical memory is actually paged in until 
> memory starts to be touched. Therefore this metric corresponds to how 
> much memory is available *for the kernel* to satisfy page faults, 
> rather than how much memory is available *for applications* to satisfy 
> allocations. These two interpretations are completely orthogonal: the 
> amount of memory available for the kernel has no relation to the 
> amount of memory available for applications, and the amount of memory 
> available for applications has no relation to the amount of memory 
> available in the kernel. It is safe to say that what is being asked 
> for, is available memory to satisfy allocations, and not how much 
> memory the kernel has at its disposal for internal use.

These are two interesting distinctions:

- how much memory is available *for the kernel* to satisfy page faults
- how much memory is available *for applications* to satisfy allocations

and I'm having difficulty understanding why you are splitting the hair
this way. Yes, I'm saying there's not much difference here.

If the kernel has a page available to satisfy a page fault, then an
application has that page available for use to satisfy an allocation.
In my mind, an allocation is a use of the page and not a reservation
of a page. Let's take a couple of steps back...

The JVM does advanced memory allocation. By that I mean that the JVM
does not simply use malloc()/free() to allocate and deallocate memory.
The malloc()/free() world is simple and you won't run into swap space
issues. OK, you can run into swap space issues, but not the same swap
space issues we're talking about.

So the JVM's use of advanced memory allocation means that we have to
be prepared for more complicated memory allocation failures. We also
have to be run in a system that is properly configured for applications
that use advance memory allocation. The JVM's use of a reserve, commit
and touch model means that we have to be prepared for things to go
wrong at each stage of the advanced memory allocation model.

When the JVM reserves a page:
- if the reservation works, then all that means is that the kernel
   _thinks_ that it can satisfy a _future_ use of that page
- if the reservation doesn't work, then the JVM dies (and it should)

When the JVM commits a page:
- if the commit works, then all that means is that the kernel has
   been able to allocate a backing page (swap) for the reserved page.
   Note: the RAM page may not have been touched yet, all the application
   did so far is change the mapping on the page that it had reserved
- On Solaris, if a backing page (swap) cannot be allocated at this
   time, then the commit attempt fails and we get:

   #define EAGAIN  11      /* Resource temporarily unavailable */

   On Solaris and Linux, if the attempt to change a reserved page to
   a committed page fails, then we have to kill the JVM because we may
   have lost the reservation. MacOS X is clear that the reservation is
   not lost and I think Windows is the same as MacOS X.
- Note that just because we have a reserved a page, that does not mean
   we'll be able to convert it to a committed page in the future. When
   the kernel allowed us to reserve the page, that was its best guess
   at the time of the reservation.

When the JVM touches a page:
- At this point, the reserved RAM page is finally used and the Solaris
   kernel won't count it in sysconf(_SC_AVPHYS_PAGES).

So how does the above relate to these two statements:

- how much memory is available *for the kernel* to satisfy page faults
- how much memory is available *for applications* to satisfy allocations

My point is that if the kernel tells you it has a page available,
then the application can presume that it has a page available to
satisfy an allocation. However, the application has to be prepared
for the case where that available page is taken by another process
and react accordingly.

>
>>
>>> Unfortunately this does not match intended semantics. The intended
>>> semantics is to return how much memory can be allocated by mmap into
>>> physical memory. But, what _SC_AVPHYS_PAGES does is to return how many
>>> physical pages are available to be used by virtual memory as backing
>>> storage on-demand once it is touched, without any connection
>>> whatsoever to virtual memory.
>>
>> This part has me curious:
>>
>> > The intended semantics is to return how much memory can be allocated
>> > by mmap into physical memory.
>>
>> since I don't understand where you found the "intended semantics".
>
> Unfortunately I can't tell you this in the open list.

Please add a confidential note to the bug report. :-)

>
>> Only one of the platforms has any comments about available_memory:
>>
>> src/os/bsd/vm/os_bsd.cpp:
>>
>> // available here means free
>> julong os::Bsd::available_memory() {
>>
>> the rest just don't say...
>>
>> Personally, I've always interpreted available_memory() to mean
>> available physical pages, as in pages that are not in use now.
>> This matches the definition of _SC_AVPHYS_PAGES above...
>
> So your version of available memory depends on how much memory is "in 
> use". In use by who? As you can see, this also leaves room for 
> interpretations - use by the kernel to satisfy page faults or use by 
> applications to satisfy allocations. You just flipped it.
>
> And yes, interpretations have indeed been personal and not very well 
> defined, which is obvious from the code. Let me summarize the current 
> situation:
>
> Windows: Returns the amount of physical memory (RAM) that is available 
> and can be used to satisfy allocations.
>
> Linux: Returns the amount of physical memory in the freelist that can 
> be used to satisfy allocations. But there is in fact a lot more memory 
> available. The freelist can be seen as memory wasted by the OS. It 
> tries to use physical memory for things like file caches to boost 
> performance, but that memory is in fact available if anyone starts 
> allocating. So it will return less memory than is in fact available, 
> contrary to the name of the function. This is also misleading and 
> arguably wrong.
>
> BSD: Seems to also return the amount of free rather than available 
> memory, to satisfy allocations, contrary to the name of the function.
>
> Solaris: Returns the amount of physical memory not paged in, to 
> satisfy page faults, invariantly of the amount of memory available to 
> satisfy allocations.

I'm not quite sure why you are describing Solaris this way. Based on
the man page 1-liner:

_SC_AVPHYS_PAGES  Number of physical memory pages not
                   currently in use by system

Solaris sounds the same as Windows above.

>
> In summary:
> *All OS implementations seem to consistently disregard swap page and 
> not consider it available memory.
> * The meaning is mixed between available memory to satisfy application 
> allocations, available memory for the kernel to use to satisfy page 
> faults, and immediately free memory (rather than available memory) to 
> satisfy allocations.
>
> I think that due to the lack of a clear definition, people have used 
> any preferred interpretation.

The fact that the different platforms have different implied
meanings for os::available_memory() is the real bug here. This
kind of inconsistency is "not a good thing" (TM). :-)

>
>>
>>> Even if we mmap to commit heap memory without NORESERVE, the
>>> _SC_AVPHYS_PAGES metric does not change its value - at least not until
>>> somebody actually touches the mmaped memory and it starts becoming
>>> backed by actual physical memory. So the JVM can in theory commit the
>>> whole physical memory, and _SC_AVPHYS_PAGES will still reply that all
>>> that memory is still available given that it has not been touched yet.
>>
>> Yes, I believe that is exactly how things work and I think
>> that available_memory() is returning that information
>> correctly.
>
> Solaris is AFAIK the only OS to make this interpretation, and it is 
> arguably not useful for the users as it does not correspond to memory 
> that applications can use to satisfy allocations.

I'll have to wait for your next reply to understand your distinction
between an available physical page and a page that can be used to
satisfy an applications page allocation request. I'm just not seeing
it yet...

> But due to the lack of a definition, I can't say you are wrong, just 
> have opinions about it and point out Solaris is the only OS with this 
> interpretation, and that people have been very confused about it.

Based on what you've written here, it looks to me like Solaris and
Windows have the same interpretation, but, again, I'll wait for your
reply...

>
>>
>>> It is likely that this is related to random swap-related test
>>> failures, where too many JVMs are created based on this metric.
>>
>> Please explain further. What do you mean by "too many JVMs are
>> created based on this metric"?
>
> Well I guess the problem is that people run many JVMs at the same time 
> and then some JVM crashes complaining that it is out of swap memory 
> and hence can't run any longer, while available_memory says there are 
> multiple gigabytes of available memory. This is very confusing for 
> users. The reason is, as I said, that the memory has not been paged in 
> yet, and therefore there appears to be lots of memory available, but 
> none of it can be used to satisfy allocations, leading to failures.

Hmmm... there might be the beginnings of something here that will
help the two of us reach the same place... I'll have to come back
to this paragraph after I see your next reply...

>
>>
>>> Even
>>> if it is not, the os::available_memory() call is still broken in its
>>> current state and should be fixed regardless.
>>
>> I'm not yet convinced that available_memory() is broken
>> and needs to be fixed. I don't see available_memory() being
>> used in a lot of places and those uses that I do see are
>> mostly just reports of the value...
>>
>> So what am I missing about how os::available_memory() is
>> being used?
>
> The reports are very problematic though. It's better to not report 
> anything than to report misleading, incorrect numbers. If we report a 
> number, then that number should be correct and useful.

Well... we haven't yet reached agreement on what correct means here.
Hopefully we will soon...

> I also think that whether available_memory() is broken or not has 
> nothing to do with how often it is being used.

My point here was two fold:

1) The two of us don't yet agree that it is broken.
2) Even if it was broken, I didn't and still don't see how the broken
    information is leading to an issue. The JVM is not using the value
    of os::available_memory() for any allocation policy what so ever.

> It reports misleading numbers, and that leads to unnecessary confusion.

Again, we don't (yet) agree that the numbers are misleading. At this
point, I'll agree that the different platforms give different answers
and that's a problem.

> People are very confused about Solaris swap issues.

Yes, they are. I've written about how advanced memory allocation
algorithms require smart swap space management several times now.

> A contributing reason is that reported values are not what people 
> think they are. Therefore it needs a fix.

I agree that we need to do something about making what
os::available_memory() reports consistent and documented.
I agree that there is confusion. It will be difficult to
get agreement on what os::available_memory() _should_
report. There will be differing opinions... :-)

I think I've made my preference clear. I want the lowest level
answer (free pages of memory from the kernel's POV). I want the
system to trust me to know that the answer to my query can and
will change right after I receive that answer. And the system
should trust me to code my application accordingly.

Erik, this is a wonderful thread. I'll leave it up to you as
for how much should be copied to the bug report.

Dan

>
> Thanks,
> /Erik
>
>> Dan
>>
>>
>>
>>> My proposed fix uses kstat to get the available memory that can be
>>> mmapped (which actually relates to virtual memory). It then uses
>>> swapctl() to find out the amount of free swap, subtracting that from
>>> the kstat value, to make sure we do not count swap memory as being
>>> available for grabbing, to mimick the current behaviour of other
>>> platforms. The code iterates over the potentially many swap resources
>>> and adds up the free swap memory.
>>>
>>> kstat gives us all memory that can be made available, including memory
>>> already used by the OS for things like file caches, and swap memory.
>>> When this value is 0, mmap will fail. That's why I calculate the
>>> amount of swap and remove that, assuming it is okay to use memory that
>>> isn't immediately available but can be made available, as long as it
>>> does not involve paging to the swap memory.
>>>
>>> Testing:
>>> * JPRT
>>> * Made my own test program that can be found in the comments of the
>>> BUG to report on memory values, so I could verify what is going on and
>>> that when the new os::available_memory() becomes 0, is indeed when
>>> paging to swap starts happening using vmstat.
>>>
>>> I need a sponsor to push this if anyone is interested.
>>>
>>> Thanks,
>>> /Erik
>>
>