RFR (L): 8058354: SPECjvm2008-Derby -2.7% performance regression on Solaris-X64 starting with 9-b29

Mon Feb 2 16:58:23 UTC 2015

Hi,

  there have been multiple questions about this because my explanation
was too confusing. Sorry. I will try to clear this up in the following.

The main reason as to why there is the limitation of requiring large
page alignment of auxiliary data structures is that it simplifies
management of the pages a lot with little downside.

Very long, sorry.

On Fri, 2015-01-30 at 15:18 -0800, Jon Masamitsu wrote:
> On 1/29/2015 2:30 AM, Thomas Schatzl wrote:
> > ...
> > AAAAAA AAAAAA AAAAAA  // heap area, each AAAAAA is a single region
> >     |      |      |    // area covered by auxilary pages
> >        1       2       // auxiliary data pages
> 
> Are bitmaps an example of the auxiliary data?

Prev/next bitmaps, BOT, card table, and card counts table (some helper
structure for the hot card cache).

> AAAAAA are 6 regions?
> 

Yes, as an example.

> And the preferred layout would be
> 
>  AAAAAA AAAAAA AAAAAA  // heap area, each AAAAAA is a single region
> 
> |      |      |      | // area covered by auxilary pages
> 
>     1       2          // auxiliary data pages
> 
> so that (regions replace by 0 have been uncommitted)
> 
>  AA00AA 000000 A000AA  // heap area, each AAAAAA is a single region
> 
> |      |      |      |    // area covered by auxilary pages
> 
>    1       2       // auxiliary data pages
> 
> 
> then the page for 2 can be uncommitted.

Yes.

> > So if auxiliary data pages were unaligned, so that they correspond to
> > uneven multiples of the heap, when uncommitting e.g. the second region
> > (second set of AAAAAA), we would have to split the auxiliary data pages
> > 1 and 2 into smaller ones.
> 
> You mean we would have to use small pages for the auxiliary data?
> So we could uncommit the auxiliary data pages corresponding to the
> heap uncommit more easily?

This is option (1). Which is bad performance-wise and caused the
regression.

----------

The other is option (2) that is to track to which pages a particular
region corresponds to (which might be a mix of small and large pages) in
detail, and commit and uncommit them accordingly.

Example that shows a mix of small/large pages used:

(I will use the BOT as representative for one of the auxiliary data
structures mentioned above in the following. The problem is the same for
them).

 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3  // heap region layout
|  S  |    L      |     L     |       // area covered by BOT page below
  b1       b2          b3             // BOT page number

I.e. the first row shows heap regions, the same numbers indicate that
this is the same region. I.e. the first six "1" represent the memory of
region 1.
The second row is how the pages representing e.g. the BOT map to heap
regions. In our setup there is one small region (S) followed by a
sequence of large regions. (For simplicity I made small regions exactly
the half the size of large ones).
The third row is the BOT page number. E.g. page b1 is small, covering
half of region 1, page b2 covers half of region 1 and 2, and so on.

Now, if I uncommit region 1, I can only uncommit b1, and need to
remember that half of b2 is also unused.

Then, when trying to uncommit b2, you need not only check whether region
2 is uncommitted, but also whether it might be in use for region 1 or 3.
(This might lead into situation where we cannot uncommit anything btw as
there might always something holding on to the BOT pages if only every
other region is committed).

At the same time, when uncommitting b3, it might also cascade into b2
and any following b4, i.e. need to uncommit that too.

The reverse situation, when committing new regions, we also add
complexity.

E.g. assume that everything is uncommitted, and we want to commit region
2. Now I need to make sure to commit b2 and b3. In the same way, if you
then want to commit region 3, you must make sure that you do not commit
b2 again.

----------

Another option (3) would be to always use small pages in the border
areas between heap regions to avoid a single page in the BOT
corresponding to some heap region.

Still you need to do some address calculations, splitting all regions
into left/middle/right areas that need to be (un-)committed separately.

That increases the number of required TLB entries by at least a factor
of 33 (32M pages, 15 large pages + 512 small pages/region vs. 16 large
pages) compared to being aligned. Much worse for smaller region sizes.

(e.g.
4M region size, 1 large + 512 small vs. 2 large pages, factor 256
2M region size, 512 small vs. 1 large, factor 512
1M region size, 512 small vs. 1/2 large)

I do not think option (3) is a good trade-off. It would particularly
hurt small heaps (e.g. 1 or 2M region size, eg. on 32 bit systems),
making enabling large pages basically useless because due to splitting
we only use small pages anyway there as shown above. On large heaps
(e.g. using 64 bit systems) it should not matter to waste at most one
large page in committed and reserved space.

Now we could play tricks like splitting/merging pages on demand, but
that is even more complicated for no particular gain imo.

----------

Option (4): Align the reserved space to a multiple of the large page
size, ie. what is implemented right now.

Examples for this case:

 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3  // heap region layout
|     L     |    L      |     L     | // area covered by BOT page below
      b1         b2          b3       // BOT page number

So in this case the operations on the auxiliary data is too simple, but
the other cases are simple too:

One BOT page corresponds to multiple regions:

 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 // heap region layout
|     L     |    L      |     L     |  // area covered by BOT page below
      b1         b2          b3        // BOT page number

or if one region corresponds to multiple BOT pages:

 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2... // heap region layout
|     L     |    L      |     L     |  // area covered by BOT page below
      b1         b2          b3        // BOT page number

In both of these cases there is absolutely no need to do any sort of
complicated calculations about what to commit or uncommit.

------------

Maybe there are more possibilities.

> > That does not seem to be a good tradeoff in complexity, given that the
> > waste is at most one large page in reserved space (and unfortunately,
> > due to the Linux large page implementation also in actually used space).
> 
> The waste is 1 large page for each auxiliary data structure?

Yes, at most.

Here is a summary of the downsides of this approach to align the start
of the auxiliary memory to large page size (if large page enabled):

- on Solaris (or everything else that does not use the Linux large page
code): the _reserved_ space is be one large page too large (if large
pages used). The change never commits the additional reserved space.

E.g.: we need a BOT of 2M+4k size, so we reserve a space of 4M, and only
ever commit the 2M+4k.

I can see if I can optimize this a little; the problem is that when
trying to reserve memory, the code paths in the linux code have asserts
that fail if the passed size is not aligned to the required alignment
(UseHugeTLBFS) and requesting large pages, or just silently fall back to
small pages (UseSHM).

I did not want to have a Solaris specific code path in the g1 code and
imo the larger reservation does not really matter.

- on Linux: 
  - when SHM is used: we reserve and possibly commit up to a single
large page too much (actually large page - commit granularity page
size).
The code in G1CollectedHeap::create_aux_memory_mapper() tries to always
get large pages by aligning up the requested size to large page size,
because otherwise, if the requested size is not aligned to commit size,
it falls back to the small pages code (i.e. ignores the request to get
large page memory outright).

  - when HugeTLBFS is used (default): we reserve and commit up to a
single large page too much.

There is no code path to actually get large page granularity address
aligned memory without having the size aligned to that granularity
without asserts failing.

---------------

Worst case memory impact:

The BOT (and other similar sized data structures) only try to use large
pages if the heap is >= 1G (on x86, large page size * 512), so only in
case of such large heaps, you waste almost large page size (times three
for the three data structures) - in Solaris only reserved space, on
Linux also committed space. I.e. 0.6% of memory in total.

The bitmaps only try to use large pages if >= 128M heap. This means that
if you allocate a 129M heap, you waste almost 2M, i.e. additional 3% of
heap for the two bitmaps.

We might want to tune that by e.g. only using large pages if there are
at least X full large pages to use. Currently this X is one (the same
value as before). 

Also, the question is whether it is extremely useful to use large pages
with a 128M heap. Not sure, I would guess it does not matter.

Difference to previous behavior:

Before that change of virtual space handling (<9b29), on Solaris, we
reserved and committed only the exact amount of memory afaics.
On Linux, we reserved and committed an exact number of large pages, or
used small pages in fallback paths. That means, if the size of a
particular data structure was not aligned to large page size, on Linux
we always used small pages.

This difference in behavior could be changed back if wanted, i.e. if the
size we want to allocate is not aligned to large page size, use small
pages (and not spend extra memory).

-------------------

Performance impact analysis:

- on Solaris: we are back to original performance at least.

- on Linux: if large pages are available, G1 should be faster than
before as G1 uses large pages more aggressively now.

-------------------

Summarizing, I think the alignment requirement does not have a lot of
impact on committed memory while keeping the code reasonably simple (and
maybe somebody knows why it is not possible to get a reservation which
address is large page aligned but is not a multiple of large page size).

There is the problem that G1 uses slightly more reserved memory, i.e. an
extra large page at the end of the heap at most. This does not seem to
be a problem. On 64 bit systems reservations are not an issue. On 32 bit
systems I do not consider this a problem either as the increase is small
in absolute terms (at most one large page).

The gain to implement option (2) or (3) above seems too small to me
considering the downsides listed.

I think the open question is on if and how to handle Linux better, i.e.
either (on Linux only?) falling back to small pages on memory sizes that
are not divisible by large page size (old behavior), or improving the
Linux ReservedSpace code to allow "special" kind aligned reservations
using large pages at the beginning and small pages at the end (that are
pre-committed as before).

Thanks,
  Thomas