RFR(L): 8198423: Improve metaspace chunk allocation (was: Proposal for improvements to the metaspace chunk allocator)

Thu Mar 1 10:36:37 UTC 2018

Hi Coleen,

thanks a lot for the review and the sponsoring offer!

New version (full):
http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-03-01/webrev-full/webrev/
incremental:
http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-03-01/webrev-incr/webrev/

Please find remarks inline:

On Tue, Feb 27, 2018 at 11:22 PM, <coleen.phillimore at oracle.com> wrote:

>
> Thomas, review comments:
>
> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
> ation/2018-02-26/webrev/src/hotspot/share/memory/metachunk.hpp.udiff.html
>
> +// ChunkIndex (todo: rename?) defines the type of chunk. Chunk types
>
>
> It's really both, isn't it?  The type is the index into the free list or
> in use lists.  The name seems fine.
>
>
You are right. What I meant was that a lot of code needs to know about the
different chunk sizes, but naming it "Index" and adding enum values like
"NumberOfFreeLists" we expose implementation details no-one outside of
SpaceManager and ChunkManager cares about (namely, the fact that these
values are internally used as indices into arrays). A more neutral naming
would be something like "enum ChunkTypes { spec,small, .... ,
NumberOfNonHumongousChunkTypes, NumberOfChunkTypes }.

However, I can leave this out for a possible future cleanup. The change is
big enough as it is.

> Can you add comments on the #endifs if the #ifdef is more than a couple
> 2-3 lines above (it's a nit that bothers me).
>
> +#ifdef ASSERT
> + // A 32bit sentinel for debugging purposes.
> +#define CHUNK_SENTINEL 0x4d4554EF // "MET"
> +#define CHUNK_SENTINEL_INVALID 0xFEEEEEEF
> + uint32_t _sentinel;
> +#endif
> + const ChunkIndex _chunk_type;
> + const bool _is_class;
> + // Whether the chunk is free (in freelist) or in use by some class
> loader.
>    bool _is_tagged_free;
>  +#ifdef ASSERT
> + ChunkOrigin _origin;
> + int _use_count;
> +#endif
> +
>
>
I removed the asserts completely, following your suggestion below that
"origin" would be valuable in customer scenarios too. By that logic, the
other members are valuable too: the sentinel is valuable when examining
memory dumps to see the start of chunks, and the in-use counter is useful
too. What do you think?

So, I leave the members in - which, depending what the C++ compiler does to
enums and bools, may cost up to 128bit additional header space. I think
that is ok. In one of my earlier versions of this patch I hand-crafted the
header using chars and bitfields to be as small as possible, but that
seemed over-engineered.

However, I left out any automatic verifications accessing these debug
members. These are still only done in debug builds.

>
> It seems that if you could move origin and _use_count into the ASSERT
> block above (maybe putting use_count before _origin.
>
> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
> ation/2018-02-26/webrev/src/hotspot/share/memory/metaspace.cpp.udiff.html
>
> In take_from_committed, can the allocation of padding chunks be its own
> function like add_chunks_to_aligment() lines 1574-1615? The function is too
> long now.
>
>
I moved the padding chunk allocation into an own function as you suggested.

> I don't think coalescation is a word in English, at least my dictionary
> cannot find it.  Although it makes sense in the context, just distracting.
>
>
I replaced "coalescation" with "chunk merging" throughout the code. Also
less of a tongue breaker.

> + // Now check if in the coalescation area there are still life chunks.
>
>
> "live" chunks I guess.   A sentence you won't read often :).
>

Now that I read it it almost sounded sinister :) Fixed.

>
> In free_chunks_get() can you handle the Humongous case first? The else for
> humongous chunk size is buried tons of lines below.
>
> Otherwise it might be helpful to the logic to make your addition to this
> function be a function you call like
>   chunk = split_from_larger_free_chunk();
>

I did the latter. I moved the splitting of a larger chunk to an own
function. This causes a slight logic change: the new function
(ChunkManager::split_chunk()) splits an existing large free chunks into n
smaller free chunks and adds them all back to the freelist - that includes
the chunk we are about to return. That allows us to use the same exit path
- which removes the chunk from the freelist and adjusts all counters - in
the caller function "ChunkManager::free_chunks_get" instead of having to
return in the middle of the function.

To make the test more readable, I also remove the
"test-that-free-chunks-are-optimally-merged" verification - which was quite
lengthy - from VirtualSpaceNode::verify() to a new function,
VirtualSpaceNode::verify_free_chunks_are_ideally_merged().

> You might want to keep the origin in product mode if it doesn't add to the
> chunk footprint.   Might help with customer debugging.
>
>
See above

> Awesome looking test...
>
>
Thanks, I was worried it would be too complicated.
I changed it a bit because there were sporadic errors. Not a "real" error,
just the test itself was faulty. The "metaspaces_in_use" counter was
slightly wrong in one corner case.

> I've read through most of this and thank you for adding this to at least
> partially solve the fragmentation problem.  The irony is that we
> templatized the Dictionary from CMS so that we could use it for Metaspace
> and that has splitting and coalescing but it seems this code makes more
> sense than adapting that code (if it's even possible).
>

Well, it helps other metadata use cases too, no.

>
> Thank you for working on this.  I'll sponsor this for you.
>
Coleen
>
>

Thanks again!

I also updated my jdk-submit branch to include these latest changes; tests
are still runnning.

Kind Regards, Thomas

>
> On 2/26/18 9:20 AM, Thomas Stüfe wrote:
>
>> Hi all,
>>
>> I know this patch is a bit larger, but may I please have reviews and/or
>> other input?
>>
>> Issue: https://bugs.openjdk.java.net/browse/JDK-8198423
>> Latest version:
>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>> ation/2018-02-26/webrev/
>>
>> For those who followed the mail thread, this is the incremental diff to
>> the
>> last changes (included feedback Goetz gave me on- and off-list):
>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>> ation/2018-02-26/webrev-incr/webrev/
>>
>> Thank you!
>>
>> Kind Regards, Thomas Stuefe
>>
>>
>>
>> On Thu, Feb 8, 2018 at 12:58 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
>> wrote:
>>
>> Hi,
>>>
>>> We would like to contribute a patch developed at SAP which has been live
>>> in our VM for some time. It improves the metaspace chunk allocation:
>>> reduces fragmentation and raises the chance of reusing free metaspace
>>> chunks.
>>>
>>> The patch: http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>> ation/2018-02-05--2/webrev/
>>>
>>> In very short, this patch helps with a number of pathological cases where
>>> metaspace chunks are free but cannot be reused because they are of the
>>> wrong size. For example, the metaspace freelist could be full of small
>>> chunks, which would not be reusable if we need larger chunks. So, we
>>> could
>>> get metaspace OOMs even in situations where the metaspace was far from
>>> exhausted. Our patch adds the ability to split and merge metaspace chunks
>>> dynamically and thus remove the "size-lock-in" problem.
>>>
>>> Note that there have been other attempts to get a grip on this problem,
>>> see e.g. "SpaceManager::get_small_chunks_and_allocate()". But arguably
>>> our patch attempts a more complete solution.
>>>
>>> In 2016 I discussed the idea for this patch with some folks off-list,
>>> among them Jon Matsimutso. He then did advice me to create a JEP. So I
>>> did:
>>> [1]. However, meanwhile changes to the JEP process were discussed [2],
>>> and
>>> I am not sure anymore this patch needs even needs a JEP. It may be
>>> moderately complex and hence carries the risk inherent in any patch, but
>>> its effects would not be externally visible (if you discount seeing fewer
>>> metaspace OOMs). So, I'd prefer to handle this as a simple RFE.
>>>
>>> --
>>>
>>> How this patch works:
>>>
>>> 1) When a class loader dies, its metaspace chunks are freed and returned
>>> to the freelist for reuse by the next class loader. With the patch, upon
>>> returning a chunk to the freelist, an attempt is made to merge it with
>>> its
>>> neighboring chunks - should they happen to be free too - to form a larger
>>> chunk. Which then is placed in the free list.
>>>
>>> As a result, the freelist should be populated by larger chunks at the
>>> expense of smaller chunks. In other words, all free chunks should always
>>> be
>>> as "coalesced as possible".
>>>
>>> 2) When a class loader needs a new chunk and a chunk of the requested
>>> size
>>> cannot be found in the free list, before carving out a new chunk from the
>>> virtual space, we first check if there is a larger chunk in the free
>>> list.
>>> If there is, that larger chunk is chopped up into n smaller chunks. One
>>> of
>>> them is returned to the callers, the others are re-added to the freelist.
>>>
>>> (1) and (2) together have the effect of removing the size-lock-in for
>>> chunks. If fragmentation allows it, small chunks are dynamically combined
>>> to form larger chunks, and larger chunks are split on demand.
>>>
>>> --
>>>
>>> What this patch does not:
>>>
>>> This is not a rewrite of the chunk allocator - most of the mechanisms
>>> stay
>>> intact. Specifically, chunk sizes remain unchanged, and so do chunk
>>> allocation processes (when do which class loaders get handed which chunk
>>> size). Almost everthing this patch does affects only internal workings of
>>> the ChunkManager.
>>>
>>> Also note that I refrained from doing any cleanups, since I wanted
>>> reviewers to be able to gauge this patch without filtering noise.
>>> Unfortunately this patch adds some complexity. But there are many future
>>> opportunities for code cleanup and simplification, some of which we
>>> already
>>> discussed in existing RFEs ([3], [4]). All of them are out of the scope
>>> for
>>> this particular patch.
>>>
>>> --
>>>
>>> Details:
>>>
>>> Before the patch, the following rules held:
>>> - All chunk sizes are multiples of the smallest chunk size ("specialized
>>> chunks")
>>> - All chunk sizes of larger chunks are also clean multiples of the next
>>> smaller chunk size (e.g. for class space, the ratio of
>>> specialized/small/medium chunks is 1:2:32)
>>> - All chunk start addresses are aligned to the smallest chunk size (more
>>> or less accidentally, see metaspace_reserve_alignment).
>>> The patch makes the last rule explicit and more strict:
>>> - All (non-humongous) chunk start addresses are now aligned to their own
>>> chunk size. So, e.g. medium chunks are allocated at addresses which are a
>>> multiple of medium chunk size. This rule is not extended to humongous
>>> chunks, whose start addresses continue to be aligned to the smallest
>>> chunk
>>> size.
>>>
>>> The reason for this new alignment rule is that it makes it cheap both to
>>> find chunk predecessors of a chunk and to check which chunks are free.
>>>
>>> When a class loader dies and its chunk is returned to the freelist, all
>>> we
>>> have is its address. In order to merge it with its neighbors to form a
>>> larger chunk, we need to find those neighbors, including those preceding
>>> the returned chunk. Prior to this patch that was not easy - one would
>>> have
>>> to iterate chunks starting at the beginning of the VirtualSpaceNode. But
>>> due to the new alignment rule, we now know where the prospective larger
>>> chunk must start - at the next lower larger-chunk-size-aligned boundary.
>>> We
>>> also know that currently a smaller chunk must start there (*).
>>>
>>> In order to check the free-ness of chunks quickly, each VirtualSpaceNode
>>> now keeps a bitmap which describes its occupancy. One bit in this bitmap
>>> corresponds to a range the size of the smallest chunk size and starting
>>> at
>>> an address aligned to the smallest chunk size. Because of the alignment
>>> rules above, such a range belongs to one single chunk. The bit is 1 if
>>> the
>>> associated chunk is in use by a class loader, 0 if it is free.
>>>
>>> When we have calculated the address range a prospective larger chunk
>>> would
>>> span, we now need to check if all chunks in that range are free. Only
>>> then
>>> we can merge them. We do that by querying the bitmap. Note that the most
>>> common use case here is forming medium chunks from smaller chunks. With
>>> the
>>> new alignment rules, the bitmap portion covering a medium chunk now
>>> always
>>> happens to be 16- or 32bit in size and is 16- or 32bit aligned, so
>>> reading
>>> the bitmap in many cases becomes a simple 16- or 32bit load.
>>>
>>> If the range is free, only then we need to iterate the chunks in that
>>> range: pull them from the freelist, combine them to one new larger chunk,
>>> re-add that one to the freelist.
>>>
>>> (*) Humongous chunks make this a bit more complicated. Since the new
>>> alignment rule does not extend to them, a humongous chunk could still
>>> straddle the lower or upper boundary of the prospective larger chunk. So
>>> I
>>> gave the occupancy map a second layer, which is used to mark the start of
>>> chunks.
>>> An alternative approach could have been to make humongous chunks size and
>>> start address always a multiple of the largest non-humongous chunk size
>>> (medium chunks). That would have caused a bit of waste per humongous
>>> chunk
>>> (<64K) in exchange for simpler coding and a simpler occupancy map.
>>>
>>> --
>>>
>>> The patch shows its best results in scenarios where a lot of smallish
>>> class loaders are alive simultaneously. When dying, they leave continuous
>>> expanses of metaspace covered in small chunks, which can be merged
>>> nicely.
>>> However, if class loader life times vary more, we have more interleaving
>>> of
>>> dead and alive small chunks, and hence chunk merging does not work as
>>> well
>>> as it could.
>>>
>>> For an example of a pathological case like this see example program: [5]
>>>
>>> Executed like this: "java -XX:CompressedClassSpaceSize=10M -cp test3
>>> test3.Example2" the test will load 3000 small classes in separate class
>>> loaders, then throw them away and start loading large classes. The small
>>> classes will have flooded the metaspace with small chunks, which are
>>> unusable for the large classes. When executing with the rather limited
>>> CompressedClassSpaceSize=10M, we will run into an OOM after loading about
>>> 800 large classes, having used only 40% of the class space, the rest is
>>> wasted to unused small chunks. However, with our patch the example
>>> program
>>> will manage to allocate ~2900 large classes before running into an OOM,
>>> and
>>> class space will show almost no waste.
>>>
>>> Do demonstrate this, add -Xlog:gc+metaspace+freelist. After running into
>>> an OOM, statistics and an ASCII representation of the class space will be
>>> shown. The unpatched version will show large expanses of unused small
>>> chunks, the patched variant will show almost no waste.
>>>
>>> Note that the patch could be made more effective with a different size
>>> ratio between small and medium chunks: in class space, that ratio is
>>> 1:16,
>>> so 16 small chunks must happen to be free to form one larger chunk. With
>>> a
>>> smaller ratio the chance for coalescation would be larger. So there may
>>> be
>>> room for future improvement here: Since we now can merge and split chunks
>>> on demand, we could introduce more chunk sizes. Potentially arriving at a
>>> buddy-ish allocator style where we drop hard-wired chunk sizes for a
>>> dynamic model where the ratio between chunk sizes is always 1:2 and we
>>> could in theory have no limit to the chunk size? But this is just a
>>> thought
>>> and well out of the scope of this patch.
>>>
>>> --
>>>
>>> What does this patch cost (memory):
>>>
>>>   - the occupancy bitmap adds 1 byte per 4K metaspace.
>>>   - MetaChunk headers get larger, since we add an enum and two bools to
>>> it.
>>> Depending on what the c++ compiler does with that, chunk headers grow by
>>> one or two MetaWords, reducing the payload size by that amount.
>>> - The new alignment rules mean we may need to create padding chunks to
>>> precede larger chunks. But since these padding chunks are added to the
>>> freelist, they should be used up before the need for new padding chunks
>>> arises. So, the maximally possible number of unused padding chunks should
>>> be limited by design to about 64K.
>>>
>>> The expectation is that the memory savings by this patch far outweighs
>>> its
>>> added memory costs.
>>>
>>> .. (performance):
>>>
>>> We did not see measurable drops in standard benchmarks raising over the
>>> normal noise. I also measured times for a program which stresses
>>> metaspace
>>> chunk coalescation, with the same result.
>>>
>>> I am open to suggestions what else I should measure, and/or independent
>>> measurements.
>>>
>>> --
>>>
>>> Other details:
>>>
>>> I removed SpaceManager::get_small_chunk_and_allocate() to reduce
>>> complexity somewhat, because it was made mostly obsolete by this patch:
>>> since small chunks are combined to larger chunks upon return to the
>>> freelist, in theory we should not have that many free small chunks
>>> anymore
>>> anyway. However, there may be still cases where we could benefit from
>>> this
>>> workaround, so I am asking your opinion on this one.
>>>
>>> About tests: There were two native tests - ChunkManagerReturnTest and
>>> TestVirtualSpaceNode (the former was added by me last year) - which did
>>> not
>>> make much sense anymore, since they relied heavily on internal behavior
>>> which was made unpredictable with this patch.
>>> To make up for these lost tests,  I added a new gtest which attempts to
>>> stress the many combinations of allocation pattern but does so from a
>>> layer
>>> above the old tests. It now uses Metaspace::allocate() and friends. By
>>> using that point as entry for tests, I am less dependent on
>>> implementation
>>> internals and still cover a lot of scenarios.
>>>
>>> --
>>>
>>> Review pointers:
>>>
>>> Good points to start are
>>> - ChunkManager::return_single_chunk() - specifically,
>>> ChunkManager::attempt_to_coalesce_around_chunk() - here we merge chunks
>>> upon return to the free list
>>> - ChunkManager::free_chunks_get(): Here we now split large chunks into
>>> smaller chunks on demand
>>> - VirtualSpaceNode::take_from_committed() : chunks are allocated
>>> according to align rules now, padding chunks are handles
>>> - The OccupancyMap class is the helper class implementing the new
>>> occupancy bitmap
>>>
>>> The rest is mostly chaff: helper functions, added tests and
>>> verifications.
>>>
>>> --
>>>
>>> Thanks and Best Regards, Thomas
>>>
>>> [1] https://bugs.openjdk.java.net/browse/JDK-8166690
>>> [2] http://mail.openjdk.java.net/pipermail/jdk-dev/2017-November
>>> /000128.html
>>> [3] https://bugs.openjdk.java.net/browse/JDK-8185034
>>> [4] https://bugs.openjdk.java.net/browse/JDK-8176808
>>> [5] https://bugs.openjdk.java.net/secure/attachment/63532/test3.zip
>>>
>>>
>>>
>>>
>