Thoughts about improving the metaspace allocator

Fri Feb 1 12:53:00 UTC 2019

Hi all,

(not sure which mailing list is the best fit, I start with hs-gc. Please
feel free to move it.)

JEP "Dynamic Max Memory Limit" has the aim to increase elasticity of java
heap memory consumption. I wonder whether the same would make sense for
metaspace? Granted, we  typically use way less memory for metaspace than
for the heap, but there are quite a few corners where memory is wasted -
mainly in situations where many classloaders come and go leaving metaspace
chunks marooned in the VM.

In particular, the following two areas waste the most memory:
- metaspace memory in freelist (not owned by any loader)
- metaspace wasted where chunks in use by loaders do not allocate anymore,
so the memory is pinned to the loader.

All this memory is wasted in the sense that though it could be reused in
the future should new classes be loaded, this may never happen and the
memory is still part of the VM process.

--

How is metaspace currently returned to the OS:

Memory for metaspace is allocated in 2MB sized mappings (VirtualSpaceNode)
and kept in a chain. Chain grows if more memory is allocated. When a Loader
requests metaspace, a chunk (Metachunk) is carved off the top
VirtualSpaceNode and handed out. These Metachunks exist in various sizes
between 1K and 64K in size.

When a classloader dies, it returns all its Metachunks to the metaspace
allocator, which puts them into a freelist for possible reuse by a future
class loader. Should all chunks in a VirtualSpaceNode become free, the
VirtualSpaceNode itself is removed from its chain and unmapped.

This means memory is returned a bit arbitrarily: all chunks within a 2MB
area must be freed, only then is the node unmapped. Whether or not this
works highly depends on the fragmentation. A single classloader holding a
1K chunk in this node hostage will keep the whole 2MB node alive.

In addition, this does not work at all for the compressed class space.
There, we do not have a chain of mappings but just one large mapping, which
never gets unmapped. So, memory for the compressed class space is never
returned to the OS.

-----

First idea: uncommit free meta chunks

Metachunks are returned to the freelist and there they do no good, so one
could theoretically uncommit them as long as they are not needed, no? While
keeping the address range still intact?

But the problem is that Metachunks are not guaranteed to span multiple
pages, may often in fact be smaller than one page. Also the Metachunk
header must not be compromised, so we cannot uncommit the first page of a
metachunk since it contains its header. So, in reality we would only be
able to uncommit the payload area of larger chunks (medium and humongous)
which are 32K or larger.

Fortunately all this has been greatly simplified - more out of accident -
by "8198423: Improve metaspace chunk allocation": There, we made it so that
chunks which are returned to the freelist are automatically fused with
neighboring chunks to form larger chunks. Also, with that change we
introduced the rule that all chunks must be aligned to their size, so e.g.
4K chunks are 4K aligned etc.

This means that we have a natural tendency for free metachunks to form
larger chunks, and that those are aligned nicely. That makes uncommitting
their payload easy and rewarding.

Here is a patch which does just that. The patch is very minimal:

http://cr.openjdk.java.net/~stuefe/webrevs/autouncommit-metachunks/webrev.00/webrev/index.html

To test whether this works, I wrote a small test which creates 1000 class
loaders, each loading 10 classes, which uses up ~200M of metaspace. Then I
started unloading them in a random fashion, until all are unloaded. The
random unloading causes high fragmentation.

In the stock hotspot, we can see that the released memory is kept in the
freelist, but almost no memory is given back to the OS until almost to the
end:

Alive RSS(kb) freelist(kb)
1000 377780 28
900 378412 18428
800 375168 37012
700 375240 55412
600 375328 73996
500 375328 92028
400 375328 110428
300 372136 128758
200 372008 145110
100 357672 149357

That is not surprising, since the memory is highly fragmented and only at
the last step a node was completely free and could be unmapped.

With my patch, one sees RSS dipping way more early:

Alive RSS(kb) freelist(kb)
1000 390464 18
900 380564 18418
800 366232 36818
700 351504 55218
600 326172 73618
500 310928 92570
400 296396 110418
300 280360 128748
200 264948 145110
100 245540 149357

The freelist content is identical, but it is now filled with chunks whose
payload was uncommitted, therefore RSS starts going down. At the last step,
with 100 loaded still alive, we have given about 100MB back to the system.

Of course this random scenario benefits most from my patch. Savings are
smaller when classloaders are released in a lifo fashion, because metaspace
is more clustered and the chance of Metachunks neighboring with chunks of
the same loaders is higher.

(We may improve this patch by moving the headers out of the Metachunks
alltogether, keeping chunk information separate from the payloads)

(I did not look closely at the cost of commiting/uncommiting. One may have
to do this a bit smarter than I did in this patch to avoid expensive
commit/uncommit cycles, e.g. always leave a certain number of free chunks
committed.)

So, this may be a valid - more fluid and smooth - alternative way to give
memory back to the OS than unmapping VirtualSpaceNode nodes.

-----

Thinking further: do we then even need the virtual space list?

IIUC the VirtualSpaceList exists for two reasons:

1) to make it possible to grow infinitely without having to deal an upper
limit.
2) to make it possible to give freed memory back to the OS

(1) one could argue this is a goal we never really reached. Most of our
customers actually specify MaxMetaspaceSize to limit the metaspace. More
importantly, we have to specify CompressedClassSpaceSize in any case, and
that limits metaspace growth even if MaxMetaspaceSize is not specified.
(2) would arguably be not needed anymore with my patch - especially if we
moved the Metachunk headers somewhere else.

So, instead of the virtual space list we could allocate the non-class
metaspace portion as one contiguous region upfront, same as the class
space, and then commit them as needed. We only have to sacrifice the notion
of limitless expansion.

Getting rid of VirtualSpaceList in favor of one large mapping would have
the following advantages:

- Simplicity. The metaspace coding has gotten quite complex over time and
every bit we retire is nice for maintenance.
- Fewer mappings: The virtual space list can get quite large and that shows
up as a lot of memory mappings, at least on Linux. There is actually a
limit to the number of mappings a process may have and we have hit this in
the past with customers. These mappings also cause overhead in the linux
kernel.
- Waste at the VirtualSpaceNode level. Not large by any means but it still
counts.

----------

Thinking even further: Do we still need the class/non-class dichotomy?
(This is more of an actual question, I am really unsure about this)

Lets say we get rid of the virtual space list and now have two large memory
mappings side by side, the non-class and the class space. Why do we need
two?

We could theoretically combine them to just one area, which would be just
"the metaspace" and contain both class and non-class data.

This would have the following pros and cons:

+ Again, Simplicity. Getting rid of this dichotomy would really simplify
the coding. Also easier to understand, explain to customers. Only one
switch needed for sizing.
+ We would save quite a bit of wasted memory, especially with many small
loaders which load many small chunks. Currently, each loader has to
allocate at least two chunks, which effectively doubles the overhead.

But I see some cons too:

- For compressed class pointers to work, the total size of the class space
must not exceed 3G. This limit would now apply to the combined size of
class and non-class metadata. I do not know - do we ever exceed 3G total
metaspace?
- Increasing the size may make it less probable to fit into the lower 32gb
address space and use zero based addressing for the compressed Klass*
pointers.

---

Thank you for your time. What do you think?

Kind Regards, Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20190201/932b6f16/attachment.htm>