RFR(L): 8198423: Improve metaspace chunk allocation (was: Proposal for improvements to the metaspace chunk allocator)

Tue Mar 6 18:33:57 UTC 2018

On Tue, Mar 6, 2018 at 7:00 PM, <coleen.phillimore at oracle.com> wrote:

>
>
> On 3/6/18 1:10 AM, Thomas Stüfe wrote:
>
> Hi Coleen,
>
> We test nightly in windows 32bit. I'll go and run some tests on 32bit
> linux too.
>
> Thanks for the sponsoring offer! Goetz already reviewed this patch, would
> that be sufficient or should I look for another reviewer from Oracle?
>
>
> That's sufficient.  I'll sponsor the patch for you.  Can you update the
> patch in the webrev?
> thanks,
> Coleen
>
>
Great!

Here you go:
http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-03-06/webrev/

I rebased to head and fixed the change description. This is still a mq
change though, do you need me to do a qfinish (I usually like to avoid that
because the local change would then clash with the change you are about to
push) ?

Kind Regards, Thomas

>
> Kind Regards, Thomas
>
>
> On Tue, Mar 6, 2018 at 12:59 AM, <coleen.phillimore at oracle.com> wrote:
>
>>
>> Hi Thomas,
>>
>> I've read through the new code.  I don't have any substantive comments.
>> Thank you for adding the functions.
>>
>> Has this been tested on any 32 bit platforms?   I will sponsor this when
>> you have another reviewer.
>>
>> Thanks for taking on the metaspace!
>>
>> Coleen
>>
>>
>> On 3/1/18 5:36 AM, Thomas Stüfe wrote:
>>
>> Hi Coleen,
>>
>> thanks a lot for the review and the sponsoring offer!
>>
>> New version (full): http://cr.openjdk.java.net/~stuefe/webrevs/metaspace
>> -coalescation/2018-03-01/webrev-full/webrev/
>> incremental: http://cr.openjdk.java.net/~stuefe/webrevs/
>> metaspace-coalescation/2018-03-01/webrev-incr/webrev/
>>
>> Please find remarks inline:
>>
>>
>> On Tue, Feb 27, 2018 at 11:22 PM, <coleen.phillimore at oracle.com> wrote:
>>
>>>
>>> Thomas, review comments:
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>> ation/2018-02-26/webrev/src/hotspot/share/memory/metachunk.h
>>> pp.udiff.html
>>>
>>> +// ChunkIndex (todo: rename?) defines the type of chunk. Chunk types
>>>
>>>
>>> It's really both, isn't it?  The type is the index into the free list or
>>> in use lists.  The name seems fine.
>>>
>>>
>> You are right. What I meant was that a lot of code needs to know about
>> the different chunk sizes, but naming it "Index" and adding enum values
>> like "NumberOfFreeLists" we expose implementation details no-one outside of
>> SpaceManager and ChunkManager cares about (namely, the fact that these
>> values are internally used as indices into arrays). A more neutral naming
>> would be something like "enum ChunkTypes { spec,small, .... ,
>> NumberOfNonHumongousChunkTypes, NumberOfChunkTypes }.
>>
>> However, I can leave this out for a possible future cleanup. The change
>> is big enough as it is.
>>
>>
>>> Can you add comments on the #endifs if the #ifdef is more than a couple
>>> 2-3 lines above (it's a nit that bothers me).
>>>
>>> +#ifdef ASSERT
>>> + // A 32bit sentinel for debugging purposes.
>>> +#define CHUNK_SENTINEL 0x4d4554EF // "MET"
>>> +#define CHUNK_SENTINEL_INVALID 0xFEEEEEEF
>>> + uint32_t _sentinel;
>>> +#endif
>>> + const ChunkIndex _chunk_type;
>>> + const bool _is_class;
>>> + // Whether the chunk is free (in freelist) or in use by some class
>>> loader.
>>>    bool _is_tagged_free;
>>>  +#ifdef ASSERT
>>> + ChunkOrigin _origin;
>>> + int _use_count;
>>> +#endif
>>> +
>>>
>>>
>> I removed the asserts completely, following your suggestion below that
>> "origin" would be valuable in customer scenarios too. By that logic, the
>> other members are valuable too: the sentinel is valuable when examining
>> memory dumps to see the start of chunks, and the in-use counter is useful
>> too. What do you think?
>>
>> So, I leave the members in - which, depending what the C++ compiler does
>> to enums and bools, may cost up to 128bit additional header space. I think
>> that is ok. In one of my earlier versions of this patch I hand-crafted the
>> header using chars and bitfields to be as small as possible, but that
>> seemed over-engineered.
>>
>> However, I left out any automatic verifications accessing these debug
>> members. These are still only done in debug builds.
>>
>>
>>>
>>> It seems that if you could move origin and _use_count into the ASSERT
>>> block above (maybe putting use_count before _origin.
>>>
>>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>> ation/2018-02-26/webrev/src/hotspot/share/memory/metaspace.c
>>> pp.udiff.html
>>>
>>> In take_from_committed, can the allocation of padding chunks be its own
>>> function like add_chunks_to_aligment() lines 1574-1615? The function is too
>>> long now.
>>>
>>>
>> I moved the padding chunk allocation into an own function as you
>> suggested.
>>
>>
>>> I don't think coalescation is a word in English, at least my dictionary
>>> cannot find it.  Although it makes sense in the context, just distracting.
>>>
>>>
>> I replaced "coalescation" with "chunk merging" throughout the code. Also
>> less of a tongue breaker.
>>
>>
>>> + // Now check if in the coalescation area there are still life chunks.
>>>
>>>
>>> "live" chunks I guess.   A sentence you won't read often :).
>>>
>>
>> Now that I read it it almost sounded sinister :) Fixed.
>>
>>
>>>
>>> In free_chunks_get() can you handle the Humongous case first? The else
>>> for humongous chunk size is buried tons of lines below.
>>>
>>> Otherwise it might be helpful to the logic to make your addition to this
>>> function be a function you call like
>>>   chunk = split_from_larger_free_chunk();
>>>
>>
>> I did the latter. I moved the splitting of a larger chunk to an own
>> function. This causes a slight logic change: the new function
>> (ChunkManager::split_chunk()) splits an existing large free chunks into n
>> smaller free chunks and adds them all back to the freelist - that includes
>> the chunk we are about to return. That allows us to use the same exit path
>> - which removes the chunk from the freelist and adjusts all counters - in
>> the caller function "ChunkManager::free_chunks_get" instead of having to
>> return in the middle of the function.
>>
>> To make the test more readable, I also remove the
>> "test-that-free-chunks-are-optimally-merged" verification - which was
>> quite lengthy - from VirtualSpaceNode::verify() to a new function,
>> VirtualSpaceNode::verify_free_chunks_are_ideally_merged().
>>
>>
>>> You might want to keep the origin in product mode if it doesn't add to
>>> the chunk footprint.   Might help with customer debugging.
>>>
>>>
>> See above
>>
>>
>>> Awesome looking test...
>>>
>>>
>> Thanks, I was worried it would be too complicated.
>> I changed it a bit because there were sporadic errors. Not a "real"
>> error, just the test itself was faulty. The "metaspaces_in_use" counter was
>> slightly wrong in one corner case.
>>
>>
>>> I've read through most of this and thank you for adding this to at least
>>> partially solve the fragmentation problem.  The irony is that we
>>> templatized the Dictionary from CMS so that we could use it for Metaspace
>>> and that has splitting and coalescing but it seems this code makes more
>>> sense than adapting that code (if it's even possible).
>>>
>>
>> Well, it helps other metadata use cases too, no.
>>
>>
>>>
>>> Thank you for working on this.  I'll sponsor this for you.
>>>
>> Coleen
>>>
>>>
>>
>> Thanks again!
>>
>> I also updated my jdk-submit branch to include these latest changes;
>> tests are still runnning.
>>
>> Kind Regards, Thomas
>>
>>
>>
>>>
>>> On 2/26/18 9:20 AM, Thomas Stüfe wrote:
>>>
>>>> Hi all,
>>>>
>>>> I know this patch is a bit larger, but may I please have reviews and/or
>>>> other input?
>>>>
>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8198423
>>>> Latest version:
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>>> ation/2018-02-26/webrev/
>>>>
>>>> For those who followed the mail thread, this is the incremental diff to
>>>> the
>>>> last changes (included feedback Goetz gave me on- and off-list):
>>>> http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>>> ation/2018-02-26/webrev-incr/webrev/
>>>>
>>>> Thank you!
>>>>
>>>> Kind Regards, Thomas Stuefe
>>>>
>>>>
>>>>
>>>> On Thu, Feb 8, 2018 at 12:58 PM, Thomas Stüfe <thomas.stuefe at gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> We would like to contribute a patch developed at SAP which has been
>>>>> live
>>>>> in our VM for some time. It improves the metaspace chunk allocation:
>>>>> reduces fragmentation and raises the chance of reusing free metaspace
>>>>> chunks.
>>>>>
>>>>> The patch: http://cr.openjdk.java.net/~st
>>>>> uefe/webrevs/metaspace-coalesc
>>>>> ation/2018-02-05--2/webrev/
>>>>>
>>>>> In very short, this patch helps with a number of pathological cases
>>>>> where
>>>>> metaspace chunks are free but cannot be reused because they are of the
>>>>> wrong size. For example, the metaspace freelist could be full of small
>>>>> chunks, which would not be reusable if we need larger chunks. So, we
>>>>> could
>>>>> get metaspace OOMs even in situations where the metaspace was far from
>>>>> exhausted. Our patch adds the ability to split and merge metaspace
>>>>> chunks
>>>>> dynamically and thus remove the "size-lock-in" problem.
>>>>>
>>>>> Note that there have been other attempts to get a grip on this problem,
>>>>> see e.g. "SpaceManager::get_small_chunks_and_allocate()". But arguably
>>>>> our patch attempts a more complete solution.
>>>>>
>>>>> In 2016 I discussed the idea for this patch with some folks off-list,
>>>>> among them Jon Matsimutso. He then did advice me to create a JEP. So I
>>>>> did:
>>>>> [1]. However, meanwhile changes to the JEP process were discussed [2],
>>>>> and
>>>>> I am not sure anymore this patch needs even needs a JEP. It may be
>>>>> moderately complex and hence carries the risk inherent in any patch,
>>>>> but
>>>>> its effects would not be externally visible (if you discount seeing
>>>>> fewer
>>>>> metaspace OOMs). So, I'd prefer to handle this as a simple RFE.
>>>>>
>>>>> --
>>>>>
>>>>> How this patch works:
>>>>>
>>>>> 1) When a class loader dies, its metaspace chunks are freed and
>>>>> returned
>>>>> to the freelist for reuse by the next class loader. With the patch,
>>>>> upon
>>>>> returning a chunk to the freelist, an attempt is made to merge it with
>>>>> its
>>>>> neighboring chunks - should they happen to be free too - to form a
>>>>> larger
>>>>> chunk. Which then is placed in the free list.
>>>>>
>>>>> As a result, the freelist should be populated by larger chunks at the
>>>>> expense of smaller chunks. In other words, all free chunks should
>>>>> always be
>>>>> as "coalesced as possible".
>>>>>
>>>>> 2) When a class loader needs a new chunk and a chunk of the requested
>>>>> size
>>>>> cannot be found in the free list, before carving out a new chunk from
>>>>> the
>>>>> virtual space, we first check if there is a larger chunk in the free
>>>>> list.
>>>>> If there is, that larger chunk is chopped up into n smaller chunks.
>>>>> One of
>>>>> them is returned to the callers, the others are re-added to the
>>>>> freelist.
>>>>>
>>>>> (1) and (2) together have the effect of removing the size-lock-in for
>>>>> chunks. If fragmentation allows it, small chunks are dynamically
>>>>> combined
>>>>> to form larger chunks, and larger chunks are split on demand.
>>>>>
>>>>> --
>>>>>
>>>>> What this patch does not:
>>>>>
>>>>> This is not a rewrite of the chunk allocator - most of the mechanisms
>>>>> stay
>>>>> intact. Specifically, chunk sizes remain unchanged, and so do chunk
>>>>> allocation processes (when do which class loaders get handed which
>>>>> chunk
>>>>> size). Almost everthing this patch does affects only internal workings
>>>>> of
>>>>> the ChunkManager.
>>>>>
>>>>> Also note that I refrained from doing any cleanups, since I wanted
>>>>> reviewers to be able to gauge this patch without filtering noise.
>>>>> Unfortunately this patch adds some complexity. But there are many
>>>>> future
>>>>> opportunities for code cleanup and simplification, some of which we
>>>>> already
>>>>> discussed in existing RFEs ([3], [4]). All of them are out of the
>>>>> scope for
>>>>> this particular patch.
>>>>>
>>>>> --
>>>>>
>>>>> Details:
>>>>>
>>>>> Before the patch, the following rules held:
>>>>> - All chunk sizes are multiples of the smallest chunk size
>>>>> ("specialized
>>>>> chunks")
>>>>> - All chunk sizes of larger chunks are also clean multiples of the next
>>>>> smaller chunk size (e.g. for class space, the ratio of
>>>>> specialized/small/medium chunks is 1:2:32)
>>>>> - All chunk start addresses are aligned to the smallest chunk size
>>>>> (more
>>>>> or less accidentally, see metaspace_reserve_alignment).
>>>>> The patch makes the last rule explicit and more strict:
>>>>> - All (non-humongous) chunk start addresses are now aligned to their
>>>>> own
>>>>> chunk size. So, e.g. medium chunks are allocated at addresses which
>>>>> are a
>>>>> multiple of medium chunk size. This rule is not extended to humongous
>>>>> chunks, whose start addresses continue to be aligned to the smallest
>>>>> chunk
>>>>> size.
>>>>>
>>>>> The reason for this new alignment rule is that it makes it cheap both
>>>>> to
>>>>> find chunk predecessors of a chunk and to check which chunks are free.
>>>>>
>>>>> When a class loader dies and its chunk is returned to the freelist,
>>>>> all we
>>>>> have is its address. In order to merge it with its neighbors to form a
>>>>> larger chunk, we need to find those neighbors, including those
>>>>> preceding
>>>>> the returned chunk. Prior to this patch that was not easy - one would
>>>>> have
>>>>> to iterate chunks starting at the beginning of the VirtualSpaceNode.
>>>>> But
>>>>> due to the new alignment rule, we now know where the prospective larger
>>>>> chunk must start - at the next lower larger-chunk-size-aligned
>>>>> boundary. We
>>>>> also know that currently a smaller chunk must start there (*).
>>>>>
>>>>> In order to check the free-ness of chunks quickly, each
>>>>> VirtualSpaceNode
>>>>> now keeps a bitmap which describes its occupancy. One bit in this
>>>>> bitmap
>>>>> corresponds to a range the size of the smallest chunk size and
>>>>> starting at
>>>>> an address aligned to the smallest chunk size. Because of the alignment
>>>>> rules above, such a range belongs to one single chunk. The bit is 1 if
>>>>> the
>>>>> associated chunk is in use by a class loader, 0 if it is free.
>>>>>
>>>>> When we have calculated the address range a prospective larger chunk
>>>>> would
>>>>> span, we now need to check if all chunks in that range are free. Only
>>>>> then
>>>>> we can merge them. We do that by querying the bitmap. Note that the
>>>>> most
>>>>> common use case here is forming medium chunks from smaller chunks.
>>>>> With the
>>>>> new alignment rules, the bitmap portion covering a medium chunk now
>>>>> always
>>>>> happens to be 16- or 32bit in size and is 16- or 32bit aligned, so
>>>>> reading
>>>>> the bitmap in many cases becomes a simple 16- or 32bit load.
>>>>>
>>>>> If the range is free, only then we need to iterate the chunks in that
>>>>> range: pull them from the freelist, combine them to one new larger
>>>>> chunk,
>>>>> re-add that one to the freelist.
>>>>>
>>>>> (*) Humongous chunks make this a bit more complicated. Since the new
>>>>> alignment rule does not extend to them, a humongous chunk could still
>>>>> straddle the lower or upper boundary of the prospective larger chunk.
>>>>> So I
>>>>> gave the occupancy map a second layer, which is used to mark the start
>>>>> of
>>>>> chunks.
>>>>> An alternative approach could have been to make humongous chunks size
>>>>> and
>>>>> start address always a multiple of the largest non-humongous chunk size
>>>>> (medium chunks). That would have caused a bit of waste per humongous
>>>>> chunk
>>>>> (<64K) in exchange for simpler coding and a simpler occupancy map.
>>>>>
>>>>> --
>>>>>
>>>>> The patch shows its best results in scenarios where a lot of smallish
>>>>> class loaders are alive simultaneously. When dying, they leave
>>>>> continuous
>>>>> expanses of metaspace covered in small chunks, which can be merged
>>>>> nicely.
>>>>> However, if class loader life times vary more, we have more
>>>>> interleaving of
>>>>> dead and alive small chunks, and hence chunk merging does not work as
>>>>> well
>>>>> as it could.
>>>>>
>>>>> For an example of a pathological case like this see example program:
>>>>> [5]
>>>>>
>>>>> Executed like this: "java -XX:CompressedClassSpaceSize=10M -cp test3
>>>>> test3.Example2" the test will load 3000 small classes in separate class
>>>>> loaders, then throw them away and start loading large classes. The
>>>>> small
>>>>> classes will have flooded the metaspace with small chunks, which are
>>>>> unusable for the large classes. When executing with the rather limited
>>>>> CompressedClassSpaceSize=10M, we will run into an OOM after loading
>>>>> about
>>>>> 800 large classes, having used only 40% of the class space, the rest is
>>>>> wasted to unused small chunks. However, with our patch the example
>>>>> program
>>>>> will manage to allocate ~2900 large classes before running into an
>>>>> OOM, and
>>>>> class space will show almost no waste.
>>>>>
>>>>> Do demonstrate this, add -Xlog:gc+metaspace+freelist. After running
>>>>> into
>>>>> an OOM, statistics and an ASCII representation of the class space will
>>>>> be
>>>>> shown. The unpatched version will show large expanses of unused small
>>>>> chunks, the patched variant will show almost no waste.
>>>>>
>>>>> Note that the patch could be made more effective with a different size
>>>>> ratio between small and medium chunks: in class space, that ratio is
>>>>> 1:16,
>>>>> so 16 small chunks must happen to be free to form one larger chunk.
>>>>> With a
>>>>> smaller ratio the chance for coalescation would be larger. So there
>>>>> may be
>>>>> room for future improvement here: Since we now can merge and split
>>>>> chunks
>>>>> on demand, we could introduce more chunk sizes. Potentially arriving
>>>>> at a
>>>>> buddy-ish allocator style where we drop hard-wired chunk sizes for a
>>>>> dynamic model where the ratio between chunk sizes is always 1:2 and we
>>>>> could in theory have no limit to the chunk size? But this is just a
>>>>> thought
>>>>> and well out of the scope of this patch.
>>>>>
>>>>> --
>>>>>
>>>>> What does this patch cost (memory):
>>>>>
>>>>>   - the occupancy bitmap adds 1 byte per 4K metaspace.
>>>>>   - MetaChunk headers get larger, since we add an enum and two bools
>>>>> to it.
>>>>> Depending on what the c++ compiler does with that, chunk headers grow
>>>>> by
>>>>> one or two MetaWords, reducing the payload size by that amount.
>>>>> - The new alignment rules mean we may need to create padding chunks to
>>>>> precede larger chunks. But since these padding chunks are added to the
>>>>> freelist, they should be used up before the need for new padding chunks
>>>>> arises. So, the maximally possible number of unused padding chunks
>>>>> should
>>>>> be limited by design to about 64K.
>>>>>
>>>>> The expectation is that the memory savings by this patch far outweighs
>>>>> its
>>>>> added memory costs.
>>>>>
>>>>> .. (performance):
>>>>>
>>>>> We did not see measurable drops in standard benchmarks raising over the
>>>>> normal noise. I also measured times for a program which stresses
>>>>> metaspace
>>>>> chunk coalescation, with the same result.
>>>>>
>>>>> I am open to suggestions what else I should measure, and/or independent
>>>>> measurements.
>>>>>
>>>>> --
>>>>>
>>>>> Other details:
>>>>>
>>>>> I removed SpaceManager::get_small_chunk_and_allocate() to reduce
>>>>> complexity somewhat, because it was made mostly obsolete by this patch:
>>>>> since small chunks are combined to larger chunks upon return to the
>>>>> freelist, in theory we should not have that many free small chunks
>>>>> anymore
>>>>> anyway. However, there may be still cases where we could benefit from
>>>>> this
>>>>> workaround, so I am asking your opinion on this one.
>>>>>
>>>>> About tests: There were two native tests - ChunkManagerReturnTest and
>>>>> TestVirtualSpaceNode (the former was added by me last year) - which
>>>>> did not
>>>>> make much sense anymore, since they relied heavily on internal behavior
>>>>> which was made unpredictable with this patch.
>>>>> To make up for these lost tests,  I added a new gtest which attempts to
>>>>> stress the many combinations of allocation pattern but does so from a
>>>>> layer
>>>>> above the old tests. It now uses Metaspace::allocate() and friends. By
>>>>> using that point as entry for tests, I am less dependent on
>>>>> implementation
>>>>> internals and still cover a lot of scenarios.
>>>>>
>>>>> --
>>>>>
>>>>> Review pointers:
>>>>>
>>>>> Good points to start are
>>>>> - ChunkManager::return_single_chunk() - specifically,
>>>>> ChunkManager::attempt_to_coalesce_around_chunk() - here we merge
>>>>> chunks
>>>>> upon return to the free list
>>>>> - ChunkManager::free_chunks_get(): Here we now split large chunks into
>>>>> smaller chunks on demand
>>>>> - VirtualSpaceNode::take_from_committed() : chunks are allocated
>>>>> according to align rules now, padding chunks are handles
>>>>> - The OccupancyMap class is the helper class implementing the new
>>>>> occupancy bitmap
>>>>>
>>>>> The rest is mostly chaff: helper functions, added tests and
>>>>> verifications.
>>>>>
>>>>> --
>>>>>
>>>>> Thanks and Best Regards, Thomas
>>>>>
>>>>> [1] https://bugs.openjdk.java.net/browse/JDK-8166690
>>>>> [2] http://mail.openjdk.java.net/pipermail/jdk-dev/2017-November
>>>>> /000128.html
>>>>> [3] https://bugs.openjdk.java.net/browse/JDK-8185034
>>>>> [4] https://bugs.openjdk.java.net/browse/JDK-8176808
>>>>> [5] https://bugs.openjdk.java.net/secure/attachment/63532/test3.zip
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>>
>
>