RFR(L): 8198423: Improve metaspace chunk allocation (was: Proposal for improvements to the metaspace chunk allocator)

Mon Mar 5 05:20:52 UTC 2018

On Thu, Mar 1, 2018 at 11:36 AM, Thomas Stüfe <thomas.stuefe at gmail.com>
wrote:

> Hi Coleen,
>
> thanks a lot for the review and the sponsoring offer!
>
> New version (full): http://cr.openjdk.java.net/~stuefe/webrevs/
> metaspace-coalescation/2018-03-01/webrev-full/webrev/
> incremental: http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-
> coalescation/2018-03-01/webrev-incr/webrev/
>
> Please find remarks inline:
>
>
> On Tue, Feb 27, 2018 at 11:22 PM, <coleen.phillimore at oracle.com> wrote:
>
>>
>> Thomas, review comments:
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>> ation/2018-02-26/webrev/src/hotspot/share/memory/metachunk.hpp.udiff.html
>>
>> +// ChunkIndex (todo: rename?) defines the type of chunk. Chunk types
>>
>>
>> It's really both, isn't it?  The type is the index into the free list or
>> in use lists.  The name seems fine.
>>
>>
> You are right. What I meant was that a lot of code needs to know about the
> different chunk sizes, but naming it "Index" and adding enum values like
> "NumberOfFreeLists" we expose implementation details no-one outside of
> SpaceManager and ChunkManager cares about (namely, the fact that these
> values are internally used as indices into arrays). A more neutral naming
> would be something like "enum ChunkTypes { spec,small, .... ,
> NumberOfNonHumongousChunkTypes, NumberOfChunkTypes }.
>
> However, I can leave this out for a possible future cleanup. The change is
> big enough as it is.
>
>
>> Can you add comments on the #endifs if the #ifdef is more than a couple
>> 2-3 lines above (it's a nit that bothers me).
>>
>> +#ifdef ASSERT
>> + // A 32bit sentinel for debugging purposes.
>> +#define CHUNK_SENTINEL 0x4d4554EF // "MET"
>> +#define CHUNK_SENTINEL_INVALID 0xFEEEEEEF
>> + uint32_t _sentinel;
>> +#endif
>> + const ChunkIndex _chunk_type;
>> + const bool _is_class;
>> + // Whether the chunk is free (in freelist) or in use by some class
>> loader.
>>    bool _is_tagged_free;
>>  +#ifdef ASSERT
>> + ChunkOrigin _origin;
>> + int _use_count;
>> +#endif
>> +
>>
>>
> I removed the asserts completely, following your suggestion below that
> "origin" would be valuable in customer scenarios too. By that logic, the
> other members are valuable too: the sentinel is valuable when examining
> memory dumps to see the start of chunks, and the in-use counter is useful
> too. What do you think?
>
> So, I leave the members in - which, depending what the C++ compiler does
> to enums and bools, may cost up to 128bit additional header space. I think
> that is ok. In one of my earlier versions of this patch I hand-crafted the
> header using chars and bitfields to be as small as possible, but that
> seemed over-engineered.
>
> However, I left out any automatic verifications accessing these debug
> members. These are still only done in debug builds.
>
>
>>
>> It seems that if you could move origin and _use_count into the ASSERT
>> block above (maybe putting use_count before _origin.
>>
>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>> ation/2018-02-26/webrev/src/hotspot/share/memory/metaspace.cpp.udiff.html
>>
>> In take_from_committed, can the allocation of padding chunks be its own
>> function like add_chunks_to_aligment() lines 1574-1615? The function is too
>> long now.
>>
>>
> I moved the padding chunk allocation into an own function as you suggested.
>
>
>> I don't think coalescation is a word in English, at least my dictionary
>> cannot find it.  Although it makes sense in the context, just distracting.
>>
>>
> I replaced "coalescation" with "chunk merging" throughout the code. Also
> less of a tongue breaker.
>
>
>> + // Now check if in the coalescation area there are still life chunks.
>>
>>
>> "live" chunks I guess.   A sentence you won't read often :).
>>
>
> Now that I read it it almost sounded sinister :) Fixed.
>
>
>>
>> In free_chunks_get() can you handle the Humongous case first? The else
>> for humongous chunk size is buried tons of lines below.
>>
>> Otherwise it might be helpful to the logic to make your addition to this
>> function be a function you call like
>>   chunk = split_from_larger_free_chunk();
>>
>
> I did the latter. I moved the splitting of a larger chunk to an own
> function. This causes a slight logic change: the new function
> (ChunkManager::split_chunk()) splits an existing large free chunks into n
> smaller free chunks and adds them all back to the freelist - that includes
> the chunk we are about to return. That allows us to use the same exit path
> - which removes the chunk from the freelist and adjusts all counters - in
> the caller function "ChunkManager::free_chunks_get" instead of having to
> return in the middle of the function.
>
> To make the test more readable, I also remove the
> "test-that-free-chunks-are-optimally-merged" verification - which was
> quite lengthy - from VirtualSpaceNode::verify() to a new function,
> VirtualSpaceNode::verify_free_chunks_are_ideally_merged().
>
>
>> You might want to keep the origin in product mode if it doesn't add to
>> the chunk footprint.   Might help with customer debugging.
>>
>>
> See above
>
>
>> Awesome looking test...
>>
>>
> Thanks, I was worried it would be too complicated.
> I changed it a bit because there were sporadic errors. Not a "real" error,
> just the test itself was faulty. The "metaspaces_in_use" counter was
> slightly wrong in one corner case.
>
>
>> I've read through most of this and thank you for adding this to at least
>> partially solve the fragmentation problem.  The irony is that we
>> templatized the Dictionary from CMS so that we could use it for Metaspace
>> and that has splitting and coalescing but it seems this code makes more
>> sense than adapting that code (if it's even possible).
>>
>
> Well, it helps other metadata use cases too, no.
>
>
>>
>> Thank you for working on this.  I'll sponsor this for you.
>>
> Coleen
>>
>>
>
> Thanks again!
>
> I also updated my jdk-submit branch to include these latest changes; tests
> are still runnning.
>
>
Tests in jdk-submit ran without errors.
Our nightly tests do not show errors on any of our platforms.

Best Regards, Thomas

> Kind Regards, Thomas
>
>
>
>>
>> On 2/26/18 9:20 AM, Thomas Stüfe wrote:
>>
>>> Hi all,
>>>
>>> I know this patch is a bit larger, but may I please have reviews and/or
>>> other input?
>>>
>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8198423
>>> Latest version:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>> ation/2018-02-26/webrev/
>>>
>>> For those who followed the mail thread, this is the incremental diff to
>>> the
>>> last changes (included feedback Goetz gave me on- and off-list):
>>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>> ation/2018-02-26/webrev-incr/webrev/
>>>
>>> Thank you!
>>>
>>> Kind Regards, Thomas Stuefe
>>>
>>>
>>>
>>> On Thu, Feb 8, 2018 at 12:58 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>>
>>>> We would like to contribute a patch developed at SAP which has been live
>>>> in our VM for some time. It improves the metaspace chunk allocation:
>>>> reduces fragmentation and raises the chance of reusing free metaspace
>>>> chunks.
>>>>
>>>> The patch: http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>>> ation/2018-02-05--2/webrev/
>>>>
>>>> In very short, this patch helps with a number of pathological cases
>>>> where
>>>> metaspace chunks are free but cannot be reused because they are of the
>>>> wrong size. For example, the metaspace freelist could be full of small
>>>> chunks, which would not be reusable if we need larger chunks. So, we
>>>> could
>>>> get metaspace OOMs even in situations where the metaspace was far from
>>>> exhausted. Our patch adds the ability to split and merge metaspace
>>>> chunks
>>>> dynamically and thus remove the "size-lock-in" problem.
>>>>
>>>> Note that there have been other attempts to get a grip on this problem,
>>>> see e.g. "SpaceManager::get_small_chunks_and_allocate()". But arguably
>>>> our patch attempts a more complete solution.
>>>>
>>>> In 2016 I discussed the idea for this patch with some folks off-list,
>>>> among them Jon Matsimutso. He then did advice me to create a JEP. So I
>>>> did:
>>>> [1]. However, meanwhile changes to the JEP process were discussed [2],
>>>> and
>>>> I am not sure anymore this patch needs even needs a JEP. It may be
>>>> moderately complex and hence carries the risk inherent in any patch, but
>>>> its effects would not be externally visible (if you discount seeing
>>>> fewer
>>>> metaspace OOMs). So, I'd prefer to handle this as a simple RFE.
>>>>
>>>> --
>>>>
>>>> How this patch works:
>>>>
>>>> 1) When a class loader dies, its metaspace chunks are freed and returned
>>>> to the freelist for reuse by the next class loader. With the patch, upon
>>>> returning a chunk to the freelist, an attempt is made to merge it with
>>>> its
>>>> neighboring chunks - should they happen to be free too - to form a
>>>> larger
>>>> chunk. Which then is placed in the free list.
>>>>
>>>> As a result, the freelist should be populated by larger chunks at the
>>>> expense of smaller chunks. In other words, all free chunks should
>>>> always be
>>>> as "coalesced as possible".
>>>>
>>>> 2) When a class loader needs a new chunk and a chunk of the requested
>>>> size
>>>> cannot be found in the free list, before carving out a new chunk from
>>>> the
>>>> virtual space, we first check if there is a larger chunk in the free
>>>> list.
>>>> If there is, that larger chunk is chopped up into n smaller chunks. One
>>>> of
>>>> them is returned to the callers, the others are re-added to the
>>>> freelist.
>>>>
>>>> (1) and (2) together have the effect of removing the size-lock-in for
>>>> chunks. If fragmentation allows it, small chunks are dynamically
>>>> combined
>>>> to form larger chunks, and larger chunks are split on demand.
>>>>
>>>> --
>>>>
>>>> What this patch does not:
>>>>
>>>> This is not a rewrite of the chunk allocator - most of the mechanisms
>>>> stay
>>>> intact. Specifically, chunk sizes remain unchanged, and so do chunk
>>>> allocation processes (when do which class loaders get handed which chunk
>>>> size). Almost everthing this patch does affects only internal workings
>>>> of
>>>> the ChunkManager.
>>>>
>>>> Also note that I refrained from doing any cleanups, since I wanted
>>>> reviewers to be able to gauge this patch without filtering noise.
>>>> Unfortunately this patch adds some complexity. But there are many future
>>>> opportunities for code cleanup and simplification, some of which we
>>>> already
>>>> discussed in existing RFEs ([3], [4]). All of them are out of the scope
>>>> for
>>>> this particular patch.
>>>>
>>>> --
>>>>
>>>> Details:
>>>>
>>>> Before the patch, the following rules held:
>>>> - All chunk sizes are multiples of the smallest chunk size ("specialized
>>>> chunks")
>>>> - All chunk sizes of larger chunks are also clean multiples of the next
>>>> smaller chunk size (e.g. for class space, the ratio of
>>>> specialized/small/medium chunks is 1:2:32)
>>>> - All chunk start addresses are aligned to the smallest chunk size (more
>>>> or less accidentally, see metaspace_reserve_alignment).
>>>> The patch makes the last rule explicit and more strict:
>>>> - All (non-humongous) chunk start addresses are now aligned to their own
>>>> chunk size. So, e.g. medium chunks are allocated at addresses which are
>>>> a
>>>> multiple of medium chunk size. This rule is not extended to humongous
>>>> chunks, whose start addresses continue to be aligned to the smallest
>>>> chunk
>>>> size.
>>>>
>>>> The reason for this new alignment rule is that it makes it cheap both to
>>>> find chunk predecessors of a chunk and to check which chunks are free.
>>>>
>>>> When a class loader dies and its chunk is returned to the freelist, all
>>>> we
>>>> have is its address. In order to merge it with its neighbors to form a
>>>> larger chunk, we need to find those neighbors, including those preceding
>>>> the returned chunk. Prior to this patch that was not easy - one would
>>>> have
>>>> to iterate chunks starting at the beginning of the VirtualSpaceNode. But
>>>> due to the new alignment rule, we now know where the prospective larger
>>>> chunk must start - at the next lower larger-chunk-size-aligned
>>>> boundary. We
>>>> also know that currently a smaller chunk must start there (*).
>>>>
>>>> In order to check the free-ness of chunks quickly, each VirtualSpaceNode
>>>> now keeps a bitmap which describes its occupancy. One bit in this bitmap
>>>> corresponds to a range the size of the smallest chunk size and starting
>>>> at
>>>> an address aligned to the smallest chunk size. Because of the alignment
>>>> rules above, such a range belongs to one single chunk. The bit is 1 if
>>>> the
>>>> associated chunk is in use by a class loader, 0 if it is free.
>>>>
>>>> When we have calculated the address range a prospective larger chunk
>>>> would
>>>> span, we now need to check if all chunks in that range are free. Only
>>>> then
>>>> we can merge them. We do that by querying the bitmap. Note that the most
>>>> common use case here is forming medium chunks from smaller chunks. With
>>>> the
>>>> new alignment rules, the bitmap portion covering a medium chunk now
>>>> always
>>>> happens to be 16- or 32bit in size and is 16- or 32bit aligned, so
>>>> reading
>>>> the bitmap in many cases becomes a simple 16- or 32bit load.
>>>>
>>>> If the range is free, only then we need to iterate the chunks in that
>>>> range: pull them from the freelist, combine them to one new larger
>>>> chunk,
>>>> re-add that one to the freelist.
>>>>
>>>> (*) Humongous chunks make this a bit more complicated. Since the new
>>>> alignment rule does not extend to them, a humongous chunk could still
>>>> straddle the lower or upper boundary of the prospective larger chunk.
>>>> So I
>>>> gave the occupancy map a second layer, which is used to mark the start
>>>> of
>>>> chunks.
>>>> An alternative approach could have been to make humongous chunks size
>>>> and
>>>> start address always a multiple of the largest non-humongous chunk size
>>>> (medium chunks). That would have caused a bit of waste per humongous
>>>> chunk
>>>> (<64K) in exchange for simpler coding and a simpler occupancy map.
>>>>
>>>> --
>>>>
>>>> The patch shows its best results in scenarios where a lot of smallish
>>>> class loaders are alive simultaneously. When dying, they leave
>>>> continuous
>>>> expanses of metaspace covered in small chunks, which can be merged
>>>> nicely.
>>>> However, if class loader life times vary more, we have more
>>>> interleaving of
>>>> dead and alive small chunks, and hence chunk merging does not work as
>>>> well
>>>> as it could.
>>>>
>>>> For an example of a pathological case like this see example program: [5]
>>>>
>>>> Executed like this: "java -XX:CompressedClassSpaceSize=10M -cp test3
>>>> test3.Example2" the test will load 3000 small classes in separate class
>>>> loaders, then throw them away and start loading large classes. The small
>>>> classes will have flooded the metaspace with small chunks, which are
>>>> unusable for the large classes. When executing with the rather limited
>>>> CompressedClassSpaceSize=10M, we will run into an OOM after loading
>>>> about
>>>> 800 large classes, having used only 40% of the class space, the rest is
>>>> wasted to unused small chunks. However, with our patch the example
>>>> program
>>>> will manage to allocate ~2900 large classes before running into an OOM,
>>>> and
>>>> class space will show almost no waste.
>>>>
>>>> Do demonstrate this, add -Xlog:gc+metaspace+freelist. After running into
>>>> an OOM, statistics and an ASCII representation of the class space will
>>>> be
>>>> shown. The unpatched version will show large expanses of unused small
>>>> chunks, the patched variant will show almost no waste.
>>>>
>>>> Note that the patch could be made more effective with a different size
>>>> ratio between small and medium chunks: in class space, that ratio is
>>>> 1:16,
>>>> so 16 small chunks must happen to be free to form one larger chunk.
>>>> With a
>>>> smaller ratio the chance for coalescation would be larger. So there may
>>>> be
>>>> room for future improvement here: Since we now can merge and split
>>>> chunks
>>>> on demand, we could introduce more chunk sizes. Potentially arriving at
>>>> a
>>>> buddy-ish allocator style where we drop hard-wired chunk sizes for a
>>>> dynamic model where the ratio between chunk sizes is always 1:2 and we
>>>> could in theory have no limit to the chunk size? But this is just a
>>>> thought
>>>> and well out of the scope of this patch.
>>>>
>>>> --
>>>>
>>>> What does this patch cost (memory):
>>>>
>>>>   - the occupancy bitmap adds 1 byte per 4K metaspace.
>>>>   - MetaChunk headers get larger, since we add an enum and two bools to
>>>> it.
>>>> Depending on what the c++ compiler does with that, chunk headers grow by
>>>> one or two MetaWords, reducing the payload size by that amount.
>>>> - The new alignment rules mean we may need to create padding chunks to
>>>> precede larger chunks. But since these padding chunks are added to the
>>>> freelist, they should be used up before the need for new padding chunks
>>>> arises. So, the maximally possible number of unused padding chunks
>>>> should
>>>> be limited by design to about 64K.
>>>>
>>>> The expectation is that the memory savings by this patch far outweighs
>>>> its
>>>> added memory costs.
>>>>
>>>> .. (performance):
>>>>
>>>> We did not see measurable drops in standard benchmarks raising over the
>>>> normal noise. I also measured times for a program which stresses
>>>> metaspace
>>>> chunk coalescation, with the same result.
>>>>
>>>> I am open to suggestions what else I should measure, and/or independent
>>>> measurements.
>>>>
>>>> --
>>>>
>>>> Other details:
>>>>
>>>> I removed SpaceManager::get_small_chunk_and_allocate() to reduce
>>>> complexity somewhat, because it was made mostly obsolete by this patch:
>>>> since small chunks are combined to larger chunks upon return to the
>>>> freelist, in theory we should not have that many free small chunks
>>>> anymore
>>>> anyway. However, there may be still cases where we could benefit from
>>>> this
>>>> workaround, so I am asking your opinion on this one.
>>>>
>>>> About tests: There were two native tests - ChunkManagerReturnTest and
>>>> TestVirtualSpaceNode (the former was added by me last year) - which did
>>>> not
>>>> make much sense anymore, since they relied heavily on internal behavior
>>>> which was made unpredictable with this patch.
>>>> To make up for these lost tests,  I added a new gtest which attempts to
>>>> stress the many combinations of allocation pattern but does so from a
>>>> layer
>>>> above the old tests. It now uses Metaspace::allocate() and friends. By
>>>> using that point as entry for tests, I am less dependent on
>>>> implementation
>>>> internals and still cover a lot of scenarios.
>>>>
>>>> --
>>>>
>>>> Review pointers:
>>>>
>>>> Good points to start are
>>>> - ChunkManager::return_single_chunk() - specifically,
>>>> ChunkManager::attempt_to_coalesce_around_chunk() - here we merge chunks
>>>> upon return to the free list
>>>> - ChunkManager::free_chunks_get(): Here we now split large chunks into
>>>> smaller chunks on demand
>>>> - VirtualSpaceNode::take_from_committed() : chunks are allocated
>>>> according to align rules now, padding chunks are handles
>>>> - The OccupancyMap class is the helper class implementing the new
>>>> occupancy bitmap
>>>>
>>>> The rest is mostly chaff: helper functions, added tests and
>>>> verifications.
>>>>
>>>> --
>>>>
>>>> Thanks and Best Regards, Thomas
>>>>
>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8166690
>>>> [2] http://mail.openjdk.java.net/pipermail/jdk-dev/2017-November
>>>> /000128.html
>>>> [3] https://bugs.openjdk.java.net/browse/JDK-8185034
>>>> [4] https://bugs.openjdk.java.net/browse/JDK-8176808
>>>> [5] https://bugs.openjdk.java.net/secure/attachment/63532/test3.zip
>>>>
>>>>
>>>>
>>>>
>>
>