RFR(L): 8198423: Improve metaspace chunk allocation (was: Proposal for improvements to the metaspace chunk allocator)

Tue Mar 6 18:00:05 UTC 2018


On 3/6/18 1:10 AM, Thomas Stüfe wrote:
> Hi Coleen,
>
> We test nightly in windows 32bit. I'll go and run some tests on 32bit 
> linux too.
>
> Thanks for the sponsoring offer! Goetz already reviewed this patch, 
> would that be sufficient or should I look for another reviewer from 
> Oracle?

That's sufficient.  I'll sponsor the patch for you.  Can you update the 
patch in the webrev?
thanks,
Coleen
>
> Kind Regards, Thomas
>
>
> On Tue, Mar 6, 2018 at 12:59 AM, <coleen.phillimore at oracle.com 
> <mailto:coleen.phillimore at oracle.com>> wrote:
>
>
>     Hi Thomas,
>
>     I've read through the new code.  I don't have any substantive
>     comments.  Thank you for adding the functions.
>
>     Has this been tested on any 32 bit platforms?   I will sponsor
>     this when you have another reviewer.
>
>     Thanks for taking on the metaspace!
>
>     Coleen
>
>
>     On 3/1/18 5:36 AM, Thomas Stüfe wrote:
>>     Hi Coleen,
>>
>>     thanks a lot for the review and the sponsoring offer!
>>
>>     New version (full):
>>     http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-03-01/webrev-full/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/metaspace-coalescation/2018-03-01/webrev-full/webrev/>
>>     incremental:
>>     http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-03-01/webrev-incr/webrev/
>>     <http://cr.openjdk.java.net/%7Estuefe/webrevs/metaspace-coalescation/2018-03-01/webrev-incr/webrev/>
>>
>>     Please find remarks inline:
>>
>>
>>     On Tue, Feb 27, 2018 at 11:22 PM, <coleen.phillimore at oracle.com
>>     <mailto:coleen.phillimore at oracle.com>> wrote:
>>
>>
>>         Thomas, review comments:
>>
>>         http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-02-26/webrev/src/hotspot/share/memory/metachunk.hpp.udiff.html
>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/metaspace-coalescation/2018-02-26/webrev/src/hotspot/share/memory/metachunk.hpp.udiff.html>
>>
>>         +// ChunkIndex (todo: rename?) defines the type of chunk.
>>         Chunk types
>>
>>
>>         It's really both, isn't it?  The type is the index into the
>>         free list or in use lists. The name seems fine.
>>
>>
>>     You are right. What I meant was that a lot of code needs to know
>>     about the different chunk sizes, but naming it "Index" and adding
>>     enum values like "NumberOfFreeLists" we expose implementation
>>     details no-one outside of SpaceManager and ChunkManager cares
>>     about (namely, the fact that these values are internally used as
>>     indices into arrays). A more neutral naming would be something
>>     like "enum ChunkTypes { spec,small, .... ,
>>     NumberOfNonHumongousChunkTypes, NumberOfChunkTypes }.
>>
>>     However, I can leave this out for a possible future cleanup. The
>>     change is big enough as it is.
>>
>>         Can you add comments on the #endifs if the #ifdef is more
>>         than a couple 2-3 lines above (it's a nit that bothers me).
>>
>>         +#ifdef ASSERT
>>         + // A 32bit sentinel for debugging purposes.
>>         +#define CHUNK_SENTINEL 0x4d4554EF // "MET"
>>         +#define CHUNK_SENTINEL_INVALID 0xFEEEEEEF
>>         + uint32_t _sentinel;
>>         +#endif
>>         + const ChunkIndex _chunk_type;
>>         + const bool _is_class;
>>         + // Whether the chunk is free (in freelist) or in use by
>>         some class loader.
>>            bool _is_tagged_free;
>>          +#ifdef ASSERT
>>         + ChunkOrigin _origin;
>>         + int _use_count;
>>         +#endif
>>         +
>>
>>
>>     I removed the asserts completely, following your suggestion below
>>     that "origin" would be valuable in customer scenarios too. By
>>     that logic, the other members are valuable too: the sentinel is
>>     valuable when examining memory dumps to see the start of chunks,
>>     and the in-use counter is useful too. What do you think?
>>
>>     So, I leave the members in - which, depending what the C++
>>     compiler does to enums and bools, may cost up to 128bit
>>     additional header space. I think that is ok. In one of my earlier
>>     versions of this patch I hand-crafted the header using chars and
>>     bitfields to be as small as possible, but that seemed
>>     over-engineered.
>>
>>     However, I left out any automatic verifications accessing these
>>     debug members. These are still only done in debug builds.
>>
>>
>>         It seems that if you could move origin and _use_count into
>>         the ASSERT block above (maybe putting use_count before _origin.
>>
>>         http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-02-26/webrev/src/hotspot/share/memory/metaspace.cpp.udiff.html
>>         <http://cr.openjdk.java.net/%7Estuefe/webrevs/metaspace-coalescation/2018-02-26/webrev/src/hotspot/share/memory/metaspace.cpp.udiff.html>
>>
>>         In take_from_committed, can the allocation of padding chunks
>>         be its own function like add_chunks_to_aligment() lines
>>         1574-1615? The function is too long now.
>>
>>
>>     I moved the padding chunk allocation into an own function as you
>>     suggested.
>>
>>         I don't think coalescation is a word in English, at least my
>>         dictionary cannot find it. Although it makes sense in the
>>         context, just distracting.
>>
>>
>>     I replaced "coalescation" with "chunk merging" throughout the
>>     code. Also less of a tongue breaker.
>>
>>         + // Now check if in the coalescation area there are still
>>         life chunks.
>>
>>
>>         "live" chunks I guess.   A sentence you won't read often :).
>>
>>
>>     Now that I read it it almost sounded sinister :) Fixed.
>>
>>
>>         In free_chunks_get() can you handle the Humongous case first?
>>         The else for humongous chunk size is buried tons of lines below.
>>
>>         Otherwise it might be helpful to the logic to make your
>>         addition to this function be a function you call like
>>           chunk = split_from_larger_free_chunk();
>>
>>
>>     I did the latter. I moved the splitting of a larger chunk to an
>>     own function. This causes a slight logic change: the new function
>>     (ChunkManager::split_chunk()) splits an existing large free
>>     chunks into n smaller free chunks and adds them all back to the
>>     freelist - that includes the chunk we are about to return. That
>>     allows us to use the same exit path - which removes the chunk
>>     from the freelist and adjusts all counters - in the caller
>>     function "ChunkManager::free_chunks_get" instead of having to
>>     return in the middle of the function.
>>
>>     To make the test more readable, I also remove the
>>     "test-that-free-chunks-are-optimally-merged" verification - which
>>     was quite lengthy - from VirtualSpaceNode::verify() to a new
>>     function, VirtualSpaceNode::verify_free_chunks_are_ideally_merged().
>>
>>
>>         You might want to keep the origin in product mode if it
>>         doesn't add to the chunk footprint.   Might help with
>>         customer debugging.
>>
>>
>>     See above
>>
>>         Awesome looking test...
>>
>>
>>     Thanks, I was worried it would be too complicated.
>>     I changed it a bit because there were sporadic errors. Not a
>>     "real" error, just the test itself was faulty. The
>>     "metaspaces_in_use" counter was slightly wrong in one corner case.
>>
>>         I've read through most of this and thank you for adding this
>>         to at least partially solve the fragmentation problem.  The
>>         irony is that we templatized the Dictionary from CMS so that
>>         we could use it for Metaspace and that has splitting and
>>         coalescing but it seems this code makes more sense than
>>         adapting that code (if it's even possible).
>>
>>
>>     Well, it helps other metadata use cases too, no.
>>
>>
>>         Thank you for working on this.  I'll sponsor this for you.
>>
>>         Coleen
>>
>>
>>
>>     Thanks again!
>>
>>     I also updated my jdk-submit branch to include these latest
>>     changes; tests are still runnning.
>>
>>     Kind Regards, Thomas
>>
>>
>>         On 2/26/18 9:20 AM, Thomas Stüfe wrote:
>>
>>             Hi all,
>>
>>             I know this patch is a bit larger, but may I please have
>>             reviews and/or
>>             other input?
>>
>>             Issue: https://bugs.openjdk.java.net/browse/JDK-8198423
>>             <https://bugs.openjdk.java.net/browse/JDK-8198423>
>>             Latest version:
>>             http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-02-26/webrev/
>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/metaspace-coalescation/2018-02-26/webrev/>
>>
>>             For those who followed the mail thread, this is the
>>             incremental diff to the
>>             last changes (included feedback Goetz gave me on- and
>>             off-list):
>>             http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalescation/2018-02-26/webrev-incr/webrev/
>>             <http://cr.openjdk.java.net/%7Estuefe/webrevs/metaspace-coalescation/2018-02-26/webrev-incr/webrev/>
>>
>>             Thank you!
>>
>>             Kind Regards, Thomas Stuefe
>>
>>
>>
>>             On Thu, Feb 8, 2018 at 12:58 PM, Thomas Stüfe
>>             <thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com>>
>>             wrote:
>>
>>                 Hi,
>>
>>                 We would like to contribute a patch developed at SAP
>>                 which has been live
>>                 in our VM for some time. It improves the metaspace
>>                 chunk allocation:
>>                 reduces fragmentation and raises the chance of
>>                 reusing free metaspace
>>                 chunks.
>>
>>                 The patch:
>>                 http://cr.openjdk.java.net/~stuefe/webrevs/metaspace-coalesc
>>                 <http://cr.openjdk.java.net/%7Estuefe/webrevs/metaspace-coalesc>
>>                 ation/2018-02-05--2/webrev/
>>
>>                 In very short, this patch helps with a number of
>>                 pathological cases where
>>                 metaspace chunks are free but cannot be reused
>>                 because they are of the
>>                 wrong size. For example, the metaspace freelist could
>>                 be full of small
>>                 chunks, which would not be reusable if we need larger
>>                 chunks. So, we could
>>                 get metaspace OOMs even in situations where the
>>                 metaspace was far from
>>                 exhausted. Our patch adds the ability to split and
>>                 merge metaspace chunks
>>                 dynamically and thus remove the "size-lock-in" problem.
>>
>>                 Note that there have been other attempts to get a
>>                 grip on this problem,
>>                 see e.g.
>>                 "SpaceManager::get_small_chunks_and_allocate()". But
>>                 arguably
>>                 our patch attempts a more complete solution.
>>
>>                 In 2016 I discussed the idea for this patch with some
>>                 folks off-list,
>>                 among them Jon Matsimutso. He then did advice me to
>>                 create a JEP. So I did:
>>                 [1]. However, meanwhile changes to the JEP process
>>                 were discussed [2], and
>>                 I am not sure anymore this patch needs even needs a
>>                 JEP. It may be
>>                 moderately complex and hence carries the risk
>>                 inherent in any patch, but
>>                 its effects would not be externally visible (if you
>>                 discount seeing fewer
>>                 metaspace OOMs). So, I'd prefer to handle this as a
>>                 simple RFE.
>>
>>                 --
>>
>>                 How this patch works:
>>
>>                 1) When a class loader dies, its metaspace chunks are
>>                 freed and returned
>>                 to the freelist for reuse by the next class loader.
>>                 With the patch, upon
>>                 returning a chunk to the freelist, an attempt is made
>>                 to merge it with its
>>                 neighboring chunks - should they happen to be free
>>                 too - to form a larger
>>                 chunk. Which then is placed in the free list.
>>
>>                 As a result, the freelist should be populated by
>>                 larger chunks at the
>>                 expense of smaller chunks. In other words, all free
>>                 chunks should always be
>>                 as "coalesced as possible".
>>
>>                 2) When a class loader needs a new chunk and a chunk
>>                 of the requested size
>>                 cannot be found in the free list, before carving out
>>                 a new chunk from the
>>                 virtual space, we first check if there is a larger
>>                 chunk in the free list.
>>                 If there is, that larger chunk is chopped up into n
>>                 smaller chunks. One of
>>                 them is returned to the callers, the others are
>>                 re-added to the freelist.
>>
>>                 (1) and (2) together have the effect of removing the
>>                 size-lock-in for
>>                 chunks. If fragmentation allows it, small chunks are
>>                 dynamically combined
>>                 to form larger chunks, and larger chunks are split on
>>                 demand.
>>
>>                 --
>>
>>                 What this patch does not:
>>
>>                 This is not a rewrite of the chunk allocator - most
>>                 of the mechanisms stay
>>                 intact. Specifically, chunk sizes remain unchanged,
>>                 and so do chunk
>>                 allocation processes (when do which class loaders get
>>                 handed which chunk
>>                 size). Almost everthing this patch does affects only
>>                 internal workings of
>>                 the ChunkManager.
>>
>>                 Also note that I refrained from doing any cleanups,
>>                 since I wanted
>>                 reviewers to be able to gauge this patch without
>>                 filtering noise.
>>                 Unfortunately this patch adds some complexity. But
>>                 there are many future
>>                 opportunities for code cleanup and simplification,
>>                 some of which we already
>>                 discussed in existing RFEs ([3], [4]). All of them
>>                 are out of the scope for
>>                 this particular patch.
>>
>>                 --
>>
>>                 Details:
>>
>>                 Before the patch, the following rules held:
>>                 - All chunk sizes are multiples of the smallest chunk
>>                 size ("specialized
>>                 chunks")
>>                 - All chunk sizes of larger chunks are also clean
>>                 multiples of the next
>>                 smaller chunk size (e.g. for class space, the ratio of
>>                 specialized/small/medium chunks is 1:2:32)
>>                 - All chunk start addresses are aligned to the
>>                 smallest chunk size (more
>>                 or less accidentally, see metaspace_reserve_alignment).
>>                 The patch makes the last rule explicit and more strict:
>>                 - All (non-humongous) chunk start addresses are now
>>                 aligned to their own
>>                 chunk size. So, e.g. medium chunks are allocated at
>>                 addresses which are a
>>                 multiple of medium chunk size. This rule is not
>>                 extended to humongous
>>                 chunks, whose start addresses continue to be aligned
>>                 to the smallest chunk
>>                 size.
>>
>>                 The reason for this new alignment rule is that it
>>                 makes it cheap both to
>>                 find chunk predecessors of a chunk and to check which
>>                 chunks are free.
>>
>>                 When a class loader dies and its chunk is returned to
>>                 the freelist, all we
>>                 have is its address. In order to merge it with its
>>                 neighbors to form a
>>                 larger chunk, we need to find those neighbors,
>>                 including those preceding
>>                 the returned chunk. Prior to this patch that was not
>>                 easy - one would have
>>                 to iterate chunks starting at the beginning of the
>>                 VirtualSpaceNode. But
>>                 due to the new alignment rule, we now know where the
>>                 prospective larger
>>                 chunk must start - at the next lower
>>                 larger-chunk-size-aligned boundary. We
>>                 also know that currently a smaller chunk must start
>>                 there (*).
>>
>>                 In order to check the free-ness of chunks quickly,
>>                 each VirtualSpaceNode
>>                 now keeps a bitmap which describes its occupancy. One
>>                 bit in this bitmap
>>                 corresponds to a range the size of the smallest chunk
>>                 size and starting at
>>                 an address aligned to the smallest chunk size.
>>                 Because of the alignment
>>                 rules above, such a range belongs to one single
>>                 chunk. The bit is 1 if the
>>                 associated chunk is in use by a class loader, 0 if it
>>                 is free.
>>
>>                 When we have calculated the address range a
>>                 prospective larger chunk would
>>                 span, we now need to check if all chunks in that
>>                 range are free. Only then
>>                 we can merge them. We do that by querying the bitmap.
>>                 Note that the most
>>                 common use case here is forming medium chunks from
>>                 smaller chunks. With the
>>                 new alignment rules, the bitmap portion covering a
>>                 medium chunk now always
>>                 happens to be 16- or 32bit in size and is 16- or
>>                 32bit aligned, so reading
>>                 the bitmap in many cases becomes a simple 16- or
>>                 32bit load.
>>
>>                 If the range is free, only then we need to iterate
>>                 the chunks in that
>>                 range: pull them from the freelist, combine them to
>>                 one new larger chunk,
>>                 re-add that one to the freelist.
>>
>>                 (*) Humongous chunks make this a bit more
>>                 complicated. Since the new
>>                 alignment rule does not extend to them, a humongous
>>                 chunk could still
>>                 straddle the lower or upper boundary of the
>>                 prospective larger chunk. So I
>>                 gave the occupancy map a second layer, which is used
>>                 to mark the start of
>>                 chunks.
>>                 An alternative approach could have been to make
>>                 humongous chunks size and
>>                 start address always a multiple of the largest
>>                 non-humongous chunk size
>>                 (medium chunks). That would have caused a bit of
>>                 waste per humongous chunk
>>                 (<64K) in exchange for simpler coding and a simpler
>>                 occupancy map.
>>
>>                 --
>>
>>                 The patch shows its best results in scenarios where a
>>                 lot of smallish
>>                 class loaders are alive simultaneously. When dying,
>>                 they leave continuous
>>                 expanses of metaspace covered in small chunks, which
>>                 can be merged nicely.
>>                 However, if class loader life times vary more, we
>>                 have more interleaving of
>>                 dead and alive small chunks, and hence chunk merging
>>                 does not work as well
>>                 as it could.
>>
>>                 For an example of a pathological case like this see
>>                 example program: [5]
>>
>>                 Executed like this: "java
>>                 -XX:CompressedClassSpaceSize=10M -cp test3
>>                 test3.Example2" the test will load 3000 small classes
>>                 in separate class
>>                 loaders, then throw them away and start loading large
>>                 classes. The small
>>                 classes will have flooded the metaspace with small
>>                 chunks, which are
>>                 unusable for the large classes. When executing with
>>                 the rather limited
>>                 CompressedClassSpaceSize=10M, we will run into an OOM
>>                 after loading about
>>                 800 large classes, having used only 40% of the class
>>                 space, the rest is
>>                 wasted to unused small chunks. However, with our
>>                 patch the example program
>>                 will manage to allocate ~2900 large classes before
>>                 running into an OOM, and
>>                 class space will show almost no waste.
>>
>>                 Do demonstrate this, add -Xlog:gc+metaspace+freelist.
>>                 After running into
>>                 an OOM, statistics and an ASCII representation of the
>>                 class space will be
>>                 shown. The unpatched version will show large expanses
>>                 of unused small
>>                 chunks, the patched variant will show almost no waste.
>>
>>                 Note that the patch could be made more effective with
>>                 a different size
>>                 ratio between small and medium chunks: in class
>>                 space, that ratio is 1:16,
>>                 so 16 small chunks must happen to be free to form one
>>                 larger chunk. With a
>>                 smaller ratio the chance for coalescation would be
>>                 larger. So there may be
>>                 room for future improvement here: Since we now can
>>                 merge and split chunks
>>                 on demand, we could introduce more chunk sizes.
>>                 Potentially arriving at a
>>                 buddy-ish allocator style where we drop hard-wired
>>                 chunk sizes for a
>>                 dynamic model where the ratio between chunk sizes is
>>                 always 1:2 and we
>>                 could in theory have no limit to the chunk size? But
>>                 this is just a thought
>>                 and well out of the scope of this patch.
>>
>>                 --
>>
>>                 What does this patch cost (memory):
>>
>>                   - the occupancy bitmap adds 1 byte per 4K metaspace.
>>                   - MetaChunk headers get larger, since we add an
>>                 enum and two bools to it.
>>                 Depending on what the c++ compiler does with that,
>>                 chunk headers grow by
>>                 one or two MetaWords, reducing the payload size by
>>                 that amount.
>>                 - The new alignment rules mean we may need to create
>>                 padding chunks to
>>                 precede larger chunks. But since these padding chunks
>>                 are added to the
>>                 freelist, they should be used up before the need for
>>                 new padding chunks
>>                 arises. So, the maximally possible number of unused
>>                 padding chunks should
>>                 be limited by design to about 64K.
>>
>>                 The expectation is that the memory savings by this
>>                 patch far outweighs its
>>                 added memory costs.
>>
>>                 .. (performance):
>>
>>                 We did not see measurable drops in standard
>>                 benchmarks raising over the
>>                 normal noise. I also measured times for a program
>>                 which stresses metaspace
>>                 chunk coalescation, with the same result.
>>
>>                 I am open to suggestions what else I should measure,
>>                 and/or independent
>>                 measurements.
>>
>>                 --
>>
>>                 Other details:
>>
>>                 I removed
>>                 SpaceManager::get_small_chunk_and_allocate() to reduce
>>                 complexity somewhat, because it was made mostly
>>                 obsolete by this patch:
>>                 since small chunks are combined to larger chunks upon
>>                 return to the
>>                 freelist, in theory we should not have that many free
>>                 small chunks anymore
>>                 anyway. However, there may be still cases where we
>>                 could benefit from this
>>                 workaround, so I am asking your opinion on this one.
>>
>>                 About tests: There were two native tests -
>>                 ChunkManagerReturnTest and
>>                 TestVirtualSpaceNode (the former was added by me last
>>                 year) - which did not
>>                 make much sense anymore, since they relied heavily on
>>                 internal behavior
>>                 which was made unpredictable with this patch.
>>                 To make up for these lost tests,  I added a new gtest
>>                 which attempts to
>>                 stress the many combinations of allocation pattern
>>                 but does so from a layer
>>                 above the old tests. It now uses
>>                 Metaspace::allocate() and friends. By
>>                 using that point as entry for tests, I am less
>>                 dependent on implementation
>>                 internals and still cover a lot of scenarios.
>>
>>                 --
>>
>>                 Review pointers:
>>
>>                 Good points to start are
>>                 - ChunkManager::return_single_chunk() - specifically,
>>                 ChunkManager::attempt_to_coalesce_around_chunk() -
>>                 here we merge chunks
>>                 upon return to the free list
>>                 - ChunkManager::free_chunks_get(): Here we now split
>>                 large chunks into
>>                 smaller chunks on demand
>>                 - VirtualSpaceNode::take_from_committed() : chunks
>>                 are allocated
>>                 according to align rules now, padding chunks are handles
>>                 - The OccupancyMap class is the helper class
>>                 implementing the new
>>                 occupancy bitmap
>>
>>                 The rest is mostly chaff: helper functions, added
>>                 tests and verifications.
>>
>>                 --
>>
>>                 Thanks and Best Regards, Thomas
>>
>>                 [1] https://bugs.openjdk.java.net/browse/JDK-8166690
>>                 <https://bugs.openjdk.java.net/browse/JDK-8166690>
>>                 [2]
>>                 http://mail.openjdk.java.net/pipermail/jdk-dev/2017-November
>>                 <http://mail.openjdk.java.net/pipermail/jdk-dev/2017-November>
>>                 /000128.html
>>                 [3] https://bugs.openjdk.java.net/browse/JDK-8185034
>>                 <https://bugs.openjdk.java.net/browse/JDK-8185034>
>>                 [4] https://bugs.openjdk.java.net/browse/JDK-8176808
>>                 <https://bugs.openjdk.java.net/browse/JDK-8176808>
>>                 [5]
>>                 https://bugs.openjdk.java.net/secure/attachment/63532/test3.zip
>>                 <https://bugs.openjdk.java.net/secure/attachment/63532/test3.zip>
>>
>>
>>
>>
>>
>
>