RFR (L): 8058354: SPECjvm2008-Derby -2.7% performance regression on Solaris-X64 starting with 9-b29

Mon Feb 2 21:21:23 UTC 2015

Thomas,

Thank you for the very detailed explanation.
You answered all my questions and more.

Jon

On 02/02/2015 08:58 AM, Thomas Schatzl wrote:
> Hi,
>
>    there have been multiple questions about this because my explanation
> was too confusing. Sorry. I will try to clear this up in the following.
>
> The main reason as to why there is the limitation of requiring large
> page alignment of auxiliary data structures is that it simplifies
> management of the pages a lot with little downside.
>
> Very long, sorry.
>
> On Fri, 2015-01-30 at 15:18 -0800, Jon Masamitsu wrote:
>> On 1/29/2015 2:30 AM, Thomas Schatzl wrote:
>>> ...
>>> AAAAAA AAAAAA AAAAAA  // heap area, each AAAAAA is a single region
>>>      |      |      |    // area covered by auxilary pages
>>>         1       2       // auxiliary data pages
>> Are bitmaps an example of the auxiliary data?
> Prev/next bitmaps, BOT, card table, and card counts table (some helper
> structure for the hot card cache).
>
>> AAAAAA are 6 regions?
>>
> Yes, as an example.
>
>> And the preferred layout would be
>>
>>   AAAAAA AAAAAA AAAAAA  // heap area, each AAAAAA is a single region
>>
>> |      |      |      | // area covered by auxilary pages
>>
>>      1       2          // auxiliary data pages
>>
>> so that (regions replace by 0 have been uncommitted)
>>
>>   AA00AA 000000 A000AA  // heap area, each AAAAAA is a single region
>>
>> |      |      |      |    // area covered by auxilary pages
>>
>>     1       2       // auxiliary data pages
>>
>>
>> then the page for 2 can be uncommitted.
> Yes.
>
>>> So if auxiliary data pages were unaligned, so that they correspond to
>>> uneven multiples of the heap, when uncommitting e.g. the second region
>>> (second set of AAAAAA), we would have to split the auxiliary data pages
>>> 1 and 2 into smaller ones.
>> You mean we would have to use small pages for the auxiliary data?
>> So we could uncommit the auxiliary data pages corresponding to the
>> heap uncommit more easily?
> This is option (1). Which is bad performance-wise and caused the
> regression.
>
> ----------
>
> The other is option (2) that is to track to which pages a particular
> region corresponds to (which might be a mix of small and large pages) in
> detail, and commit and uncommit them accordingly.
>
> Example that shows a mix of small/large pages used:
>
> (I will use the BOT as representative for one of the auxiliary data
> structures mentioned above in the following. The problem is the same for
> them).
>
>   1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3  // heap region layout
> |  S  |    L      |     L     |       // area covered by BOT page below
>    b1       b2          b3             // BOT page number
>
> I.e. the first row shows heap regions, the same numbers indicate that
> this is the same region. I.e. the first six "1" represent the memory of
> region 1.
> The second row is how the pages representing e.g. the BOT map to heap
> regions. In our setup there is one small region (S) followed by a
> sequence of large regions. (For simplicity I made small regions exactly
> the half the size of large ones).
> The third row is the BOT page number. E.g. page b1 is small, covering
> half of region 1, page b2 covers half of region 1 and 2, and so on.
>
> Now, if I uncommit region 1, I can only uncommit b1, and need to
> remember that half of b2 is also unused.
>
> Then, when trying to uncommit b2, you need not only check whether region
> 2 is uncommitted, but also whether it might be in use for region 1 or 3.
> (This might lead into situation where we cannot uncommit anything btw as
> there might always something holding on to the BOT pages if only every
> other region is committed).
>
> At the same time, when uncommitting b3, it might also cascade into b2
> and any following b4, i.e. need to uncommit that too.
>
> The reverse situation, when committing new regions, we also add
> complexity.
>
> E.g. assume that everything is uncommitted, and we want to commit region
> 2. Now I need to make sure to commit b2 and b3. In the same way, if you
> then want to commit region 3, you must make sure that you do not commit
> b2 again.
>
> ----------
>
> Another option (3) would be to always use small pages in the border
> areas between heap regions to avoid a single page in the BOT
> corresponding to some heap region.
>
> Still you need to do some address calculations, splitting all regions
> into left/middle/right areas that need to be (un-)committed separately.
>
> That increases the number of required TLB entries by at least a factor
> of 33 (32M pages, 15 large pages + 512 small pages/region vs. 16 large
> pages) compared to being aligned. Much worse for smaller region sizes.
>
> (e.g.
> 4M region size, 1 large + 512 small vs. 2 large pages, factor 256
> 2M region size, 512 small vs. 1 large, factor 512
> 1M region size, 512 small vs. 1/2 large)
>
> I do not think option (3) is a good trade-off. It would particularly
> hurt small heaps (e.g. 1 or 2M region size, eg. on 32 bit systems),
> making enabling large pages basically useless because due to splitting
> we only use small pages anyway there as shown above. On large heaps
> (e.g. using 64 bit systems) it should not matter to waste at most one
> large page in committed and reserved space.
>
> Now we could play tricks like splitting/merging pages on demand, but
> that is even more complicated for no particular gain imo.
>
> ----------
>
> Option (4): Align the reserved space to a multiple of the large page
> size, ie. what is implemented right now.
>
> Examples for this case:
>
>   1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3  // heap region layout
> |     L     |    L      |     L     | // area covered by BOT page below
>        b1         b2          b3       // BOT page number
>
>
> So in this case the operations on the auxiliary data is too simple, but
> the other cases are simple too:
>
> One BOT page corresponds to multiple regions:
>
>   1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 // heap region layout
> |     L     |    L      |     L     |  // area covered by BOT page below
>        b1         b2          b3        // BOT page number
>
> or if one region corresponds to multiple BOT pages:
>
>   1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2... // heap region layout
> |     L     |    L      |     L     |  // area covered by BOT page below
>        b1         b2          b3        // BOT page number
>
> In both of these cases there is absolutely no need to do any sort of
> complicated calculations about what to commit or uncommit.
>
> ------------
>
> Maybe there are more possibilities.
>
>>> That does not seem to be a good tradeoff in complexity, given that the
>>> waste is at most one large page in reserved space (and unfortunately,
>>> due to the Linux large page implementation also in actually used space).
>> The waste is 1 large page for each auxiliary data structure?
> Yes, at most.
>
> Here is a summary of the downsides of this approach to align the start
> of the auxiliary memory to large page size (if large page enabled):
>
> - on Solaris (or everything else that does not use the Linux large page
> code): the _reserved_ space is be one large page too large (if large
> pages used). The change never commits the additional reserved space.
>
> E.g.: we need a BOT of 2M+4k size, so we reserve a space of 4M, and only
> ever commit the 2M+4k.
>
> I can see if I can optimize this a little; the problem is that when
> trying to reserve memory, the code paths in the linux code have asserts
> that fail if the passed size is not aligned to the required alignment
> (UseHugeTLBFS) and requesting large pages, or just silently fall back to
> small pages (UseSHM).
>
> I did not want to have a Solaris specific code path in the g1 code and
> imo the larger reservation does not really matter.
>
> - on Linux:
>    - when SHM is used: we reserve and possibly commit up to a single
> large page too much (actually large page - commit granularity page
> size).
> The code in G1CollectedHeap::create_aux_memory_mapper() tries to always
> get large pages by aligning up the requested size to large page size,
> because otherwise, if the requested size is not aligned to commit size,
> it falls back to the small pages code (i.e. ignores the request to get
> large page memory outright).
>
>    - when HugeTLBFS is used (default): we reserve and commit up to a
> single large page too much.
>
> There is no code path to actually get large page granularity address
> aligned memory without having the size aligned to that granularity
> without asserts failing.
>
> ---------------
>
> Worst case memory impact:
>
> The BOT (and other similar sized data structures) only try to use large
> pages if the heap is >= 1G (on x86, large page size * 512), so only in
> case of such large heaps, you waste almost large page size (times three
> for the three data structures) - in Solaris only reserved space, on
> Linux also committed space. I.e. 0.6% of memory in total.
>
> The bitmaps only try to use large pages if >= 128M heap. This means that
> if you allocate a 129M heap, you waste almost 2M, i.e. additional 3% of
> heap for the two bitmaps.
>
> We might want to tune that by e.g. only using large pages if there are
> at least X full large pages to use. Currently this X is one (the same
> value as before).
>
> Also, the question is whether it is extremely useful to use large pages
> with a 128M heap. Not sure, I would guess it does not matter.
>
> Difference to previous behavior:
>
> Before that change of virtual space handling (<9b29), on Solaris, we
> reserved and committed only the exact amount of memory afaics.
> On Linux, we reserved and committed an exact number of large pages, or
> used small pages in fallback paths. That means, if the size of a
> particular data structure was not aligned to large page size, on Linux
> we always used small pages.
>
> This difference in behavior could be changed back if wanted, i.e. if the
> size we want to allocate is not aligned to large page size, use small
> pages (and not spend extra memory).
>
> -------------------
>
> Performance impact analysis:
>
> - on Solaris: we are back to original performance at least.
>
> - on Linux: if large pages are available, G1 should be faster than
> before as G1 uses large pages more aggressively now.
>
> -------------------
>
> Summarizing, I think the alignment requirement does not have a lot of
> impact on committed memory while keeping the code reasonably simple (and
> maybe somebody knows why it is not possible to get a reservation which
> address is large page aligned but is not a multiple of large page size).
>
> There is the problem that G1 uses slightly more reserved memory, i.e. an
> extra large page at the end of the heap at most. This does not seem to
> be a problem. On 64 bit systems reservations are not an issue. On 32 bit
> systems I do not consider this a problem either as the increase is small
> in absolute terms (at most one large page).
>
> The gain to implement option (2) or (3) above seems too small to me
> considering the downsides listed.
>
> I think the open question is on if and how to handle Linux better, i.e.
> either (on Linux only?) falling back to small pages on memory sizes that
> are not divisible by large page size (old behavior), or improving the
> Linux ReservedSpace code to allow "special" kind aligned reservations
> using large pages at the beginning and small pages at the end (that are
> pre-committed as before).
>
> Thanks,
>    Thomas
>
>