Proposal for improvements to the metaspace chunk allocator

Mon Feb 19 14:05:16 UTC 2018

Hi Thomas,

thanks for posting this change. I think it will help a 
lot, especially with the class space.

I agree that this not necessarily requires a JEP. So could 
you please open a bug and post a RFR to hotspot-runtime-dev?

Thanks for the laborious documentation, the code is well
to understand that way!
Maybe put the text from this your mail into the bug?
It's very helpful and easier to locate there than to find it in 
the mail archive.

My comments:

take_from_committed():
Do I understand correctly that this only takes the next
needed piece of memory? And because if the size passed to 
the current call is bigger than that of the last call, 
the alignment must be fixed you add what you call padding?

Is this also called for humongous chunks?
If not, for simplicity, I would have implemented this by just taking
the next medium chunk (which would always be aligned) and
split it into the needed size and add all the rest to 
the corresponding free lists.  But no change needed here, 
I just want to understand. (Probably this is not feasible 
because the humongous ones are not aliged to medium chunks size...)

I think the naming "padding chunks" is a bit misleading. 
It sounds as if the chunks would be wasted, but as they 
are added to the free lists they are not lost. 
dict.leo gives "offcut" for "Verschnitt" ... not a word
common to me, but at least the german translation and the
wordwise translation better fit the situation I think.
Feel free to keep it as is, though.

In your mail you are discussing the additional fields you
add. In case adding _is_class to metachunk is considered 
a problem (I don't think so), can't you compute the property
"is_class()" by comparing the metachunk address with the
possible range of the compressed class space? These 3GB are
only reserved for the class space ...

TestVirtualSpaceNode_test() is empty. Maybe remove it altogether?

A lot of the methods are passed 'true' or 'false' to indicate
whether it is for the class or metaspace manager. Maybe you 
could define enum is_class and is_metaspace or the like, to 
make these calls more speaking?

Minor nit: as you anyways normalize #defines to ASSERT, you 
might want to fix the remaining two or three #defines in metaspace.cpp
from PRODUCT to ASSERT/DEBUG, too.

Best regards,
  Goetz.

-----Original Message-----
From: hotspot-dev [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Thomas Stüfe
Sent: Thursday, February 8, 2018 12:58 PM
To: HotSpot Open Source Developers <hotspot-dev at openjdk.java.net>
Subject: RFR: Proposal for improvements to the metaspace chunk allocator

Hi,

We would like to contribute a patch developed at SAP which has been live in
our VM for some time. It improves the metaspace chunk allocation: reduces
fragmentation and raises the chance of reusing free metaspace chunks.

The patch: http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-
coalescation/2018-02-05--2/webrev/

In very short, this patch helps with a number of pathological cases where
metaspace chunks are free but cannot be reused because they are of the
wrong size. For example, the metaspace freelist could be full of small
chunks, which would not be reusable if we need larger chunks. So, we could
get metaspace OOMs even in situations where the metaspace was far from
exhausted. Our patch adds the ability to split and merge metaspace chunks
dynamically and thus remove the "size-lock-in" problem.

Note that there have been other attempts to get a grip on this problem, see
e.g. "SpaceManager::get_small_chunks_and_allocate()". But arguably our
patch attempts a more complete solution.

In 2016 I discussed the idea for this patch with some folks off-list, among
them Jon Matsimutso. He then did advice me to create a JEP. So I did: [1].
However, meanwhile changes to the JEP process were discussed [2], and I am
not sure anymore this patch needs even needs a JEP. It may be moderately
complex and hence carries the risk inherent in any patch, but its effects
would not be externally visible (if you discount seeing fewer metaspace
OOMs). So, I'd prefer to handle this as a simple RFE.

--

How this patch works:

1) When a class loader dies, its metaspace chunks are freed and returned to
the freelist for reuse by the next class loader. With the patch, upon
returning a chunk to the freelist, an attempt is made to merge it with its
neighboring chunks - should they happen to be free too - to form a larger
chunk. Which then is placed in the free list.

As a result, the freelist should be populated by larger chunks at the
expense of smaller chunks. In other words, all free chunks should always be
as "coalesced as possible".

2) When a class loader needs a new chunk and a chunk of the requested size
cannot be found in the free list, before carving out a new chunk from the
virtual space, we first check if there is a larger chunk in the free list.
If there is, that larger chunk is chopped up into n smaller chunks. One of
them is returned to the callers, the others are re-added to the freelist.

(1) and (2) together have the effect of removing the size-lock-in for
chunks. If fragmentation allows it, small chunks are dynamically combined
to form larger chunks, and larger chunks are split on demand.

--

What this patch does not:

This is not a rewrite of the chunk allocator - most of the mechanisms stay
intact. Specifically, chunk sizes remain unchanged, and so do chunk
allocation processes (when do which class loaders get handed which chunk
size). Almost everthing this patch does affects only internal workings of
the ChunkManager.

Also note that I refrained from doing any cleanups, since I wanted
reviewers to be able to gauge this patch without filtering noise.
Unfortunately this patch adds some complexity. But there are many future
opportunities for code cleanup and simplification, some of which we already
discussed in existing RFEs ([3], [4]). All of them are out of the scope for
this particular patch.

--

Details:

Before the patch, the following rules held:
- All chunk sizes are multiples of the smallest chunk size ("specialized
chunks")
- All chunk sizes of larger chunks are also clean multiples of the next
smaller chunk size (e.g. for class space, the ratio of
specialized/small/medium chunks is 1:2:32)
- All chunk start addresses are aligned to the smallest chunk size (more or
less accidentally, see metaspace_reserve_alignment).
The patch makes the last rule explicit and more strict:
- All (non-humongous) chunk start addresses are now aligned to their own
chunk size. So, e.g. medium chunks are allocated at addresses which are a
multiple of medium chunk size. This rule is not extended to humongous
chunks, whose start addresses continue to be aligned to the smallest chunk
size.

The reason for this new alignment rule is that it makes it cheap both to
find chunk predecessors of a chunk and to check which chunks are free.

When a class loader dies and its chunk is returned to the freelist, all we
have is its address. In order to merge it with its neighbors to form a
larger chunk, we need to find those neighbors, including those preceding
the returned chunk. Prior to this patch that was not easy - one would have
to iterate chunks starting at the beginning of the VirtualSpaceNode. But
due to the new alignment rule, we now know where the prospective larger
chunk must start - at the next lower larger-chunk-size-aligned boundary. We
also know that currently a smaller chunk must start there (*).

In order to check the free-ness of chunks quickly, each VirtualSpaceNode
now keeps a bitmap which describes its occupancy. One bit in this bitmap
corresponds to a range the size of the smallest chunk size and starting at
an address aligned to the smallest chunk size. Because of the alignment
rules above, such a range belongs to one single chunk. The bit is 1 if the
associated chunk is in use by a class loader, 0 if it is free.

When we have calculated the address range a prospective larger chunk would
span, we now need to check if all chunks in that range are free. Only then
we can merge them. We do that by querying the bitmap. Note that the most
common use case here is forming medium chunks from smaller chunks. With the
new alignment rules, the bitmap portion covering a medium chunk now always
happens to be 16- or 32bit in size and is 16- or 32bit aligned, so reading
the bitmap in many cases becomes a simple 16- or 32bit load.

If the range is free, only then we need to iterate the chunks in that
range: pull them from the freelist, combine them to one new larger chunk,
re-add that one to the freelist.

(*) Humongous chunks make this a bit more complicated. Since the new
alignment rule does not extend to them, a humongous chunk could still
straddle the lower or upper boundary of the prospective larger chunk. So I
gave the occupancy map a second layer, which is used to mark the start of
chunks.
An alternative approach could have been to make humongous chunks size and
start address always a multiple of the largest non-humongous chunk size
(medium chunks). That would have caused a bit of waste per humongous chunk
(<64K) in exchange for simpler coding and a simpler occupancy map.

--

The patch shows its best results in scenarios where a lot of smallish class
loaders are alive simultaneously. When dying, they leave continuous
expanses of metaspace covered in small chunks, which can be merged nicely.
However, if class loader life times vary more, we have more interleaving of
dead and alive small chunks, and hence chunk merging does not work as well
as it could.

For an example of a pathological case like this see example program: [5]

Executed like this: "java -XX:CompressedClassSpaceSize=10M -cp test3
test3.Example2" the test will load 3000 small classes in separate class
loaders, then throw them away and start loading large classes. The small
classes will have flooded the metaspace with small chunks, which are
unusable for the large classes. When executing with the rather limited
CompressedClassSpaceSize=10M, we will run into an OOM after loading about
800 large classes, having used only 40% of the class space, the rest is
wasted to unused small chunks. However, with our patch the example program
will manage to allocate ~2900 large classes before running into an OOM, and
class space will show almost no waste.

Do demonstrate this, add -Xlog:gc+metaspace+freelist. After running into an
OOM, statistics and an ASCII representation of the class space will be
shown. The unpatched version will show large expanses of unused small
chunks, the patched variant will show almost no waste.

Note that the patch could be made more effective with a different size
ratio between small and medium chunks: in class space, that ratio is 1:16,
so 16 small chunks must happen to be free to form one larger chunk. With a
smaller ratio the chance for coalescation would be larger. So there may be
room for future improvement here: Since we now can merge and split chunks
on demand, we could introduce more chunk sizes. Potentially arriving at a
buddy-ish allocator style where we drop hard-wired chunk sizes for a
dynamic model where the ratio between chunk sizes is always 1:2 and we
could in theory have no limit to the chunk size? But this is just a thought
and well out of the scope of this patch.

--

What does this patch cost (memory):

 - the occupancy bitmap adds 1 byte per 4K metaspace.
 - MetaChunk headers get larger, since we add an enum and two bools to it.
Depending on what the c++ compiler does with that, chunk headers grow by
one or two MetaWords, reducing the payload size by that amount.
- The new alignment rules mean we may need to create padding chunks to
precede larger chunks. But since these padding chunks are added to the
freelist, they should be used up before the need for new padding chunks
arises. So, the maximally possible number of unused padding chunks should
be limited by design to about 64K.

The expectation is that the memory savings by this patch far outweighs its
added memory costs.

.. (performance):

We did not see measurable drops in standard benchmarks raising over the
normal noise. I also measured times for a program which stresses metaspace
chunk coalescation, with the same result.

I am open to suggestions what else I should measure, and/or independent
measurements.

--

Other details:

I removed SpaceManager::get_small_chunk_and_allocate() to reduce complexity
somewhat, because it was made mostly obsolete by this patch: since small
chunks are combined to larger chunks upon return to the freelist, in theory
we should not have that many free small chunks anymore anyway. However,
there may be still cases where we could benefit from this workaround, so I
am asking your opinion on this one.

About tests: There were two native tests - ChunkManagerReturnTest and
TestVirtualSpaceNode (the former was added by me last year) - which did not
make much sense anymore, since they relied heavily on internal behavior
which was made unpredictable with this patch.
To make up for these lost tests,  I added a new gtest which attempts to
stress the many combinations of allocation pattern but does so from a layer
above the old tests. It now uses Metaspace::allocate() and friends. By
using that point as entry for tests, I am less dependent on implementation
internals and still cover a lot of scenarios.

--

Review pointers:

Good points to start are
- ChunkManager::return_single_chunk() - specifically,
ChunkManager::attempt_to_coalesce_around_chunk() - here we merge chunks
upon return to the free list
- ChunkManager::free_chunks_get(): Here we now split large chunks into
smaller chunks on demand
- VirtualSpaceNode::take_from_committed() : chunks are allocated according
to align rules now, padding chunks are handles
- The OccupancyMap class is the helper class implementing the new occupancy
bitmap

The rest is mostly chaff: helper functions, added tests and verifications.

--

Thanks and Best Regards, Thomas

[1] https://bugs.openjdk.java.net/browse/JDK-8166690
[2] http://mail.openjdk.java.net/pipermail/jdk-dev/2017-November/000128.html
[3] https://bugs.openjdk.java.net/browse/JDK-8185034
[4] https://bugs.openjdk.java.net/browse/JDK-8176808
[5] https://bugs.openjdk.java.net/secure/attachment/63532/test3.zip