RFR (M): 8136681: Factor out IHOP calculation from G1CollectorPolicy

Mon Nov 16 13:31:50 UTC 2015

Hi Jon,

  thanks a lot for all these reminders for better documentation. I have
been working too long on this functionality so that "everything is
clear" to me :)

New webrevs with hopefully more complete explanations at:
http://cr.openjdk.java.net/~tschatzl/8136681/webrev.1_to_2/
(incremental)
http://cr.openjdk.java.net/~tschatzl/8136681/webrev.2/ (changes)

On Fri, 2015-11-13 at 07:58 -0800, Jon Masamitsu wrote:
>Thomas,
> 
> This is partial.  If you send out a second webrev based on Mikael's
> review, I'll finish with that.
> 
> 
> http://cr.openjdk.java.net/~tschatzl/8136681/webrev/src/share/vm/gc/g1/g1CollectedHeap.hpp.frames.html
> 
> > 1370 // Returns the number of regions the humongous object of the 
> > given word size
> > 1371 // covers.
> 
> "covers" is not quite right since to me it says that the humongous 
> object completely uses the
> region.  I'd use "requires".

Fixed.

> http://cr.openjdk.java.net/~tschatzl/8136681/webrev/src/share/vm/gc/g1/g1IHOPControl.hpp.html
> 
>    49   // Update information about recent time during which allocations happened,
>    50   // how many allocations happened and an additional safety buffer.
> 
> // Update information about
> 
> I DON'T KNOW WHICH OF THESE IS MORE PRECISE.
> //   Time during which allocations occurred (sum of mutator execution time + GC pause times)
> OR
> //   Concurrent marking time (concurrent mark end - concurrent mark start)
> 
> //   Allocations in bytes during that time
> //   Safety buffer ???
> 
> I couldn't figure out what the safety buffer is supposed to be.  It seems to
> be the young gen size but don't know why.

I tried to explain in in the text. In short, in G1 the IHOP value is
based on old gen occupancy only. The problem is that the young gen also
needs to be allocated somewhere too.

Now you could just say, use the maximum young gen size. However this is
60% of the heap by default... so the adaptive IHOP algorithm uses a
measure of the young gen that is not bounded by G1ReservePercent.

The reason to use the unbounded value is because if the code used the
bounded one, it would cancel out with G1ReservePercent, because the
closer we get to G1ReservePercent, the smaller that bounded value would
get, which would make the current IHOP value rise etc, which would delay
the initiation of the marking.

That would end up loosing throughput as then the young gen gets smaller
and smaller (and GC frequency increases), it can take a long time until
G1 gets close enough to G1ReservePercent so that the other factors
(allocation rate, marking time) are used.

Basically initial mark will be delayed until young gen reaches its
minimum size, at which time G1 will continue to use that young gen size
until marking occurs. Which means that typically G1 will eat into
G1ReservePercent, which we also do not want.

Additionally it would get G1 more in trouble in regards to pause time,
giving it less room during that time.

Note that G1 young gen sizing already has some issues with young gen
sizing if it operates close to G1ReservePercent. If there is time I will
look at it again later.

>    33 // Manages the decision about the threshold when concurrent marking should start.
> 
> // IHOP = Initiating heap occupancy in percent of the entire heap
> 
>    34 class G1IHOPControl : public CHeapObj<mtGC> {
>    35  protected:
>    36   double _ihop_percent;
> // Occupancy in bytes.  This is always less than or equal to the (maximum or current maximum?) heap occupancy
>    37   size_t _target_occupancy;
> 
> 
> 	// Most recent complete allocation period (time from first young-only GC to
> 	// end of marking) in seconds.  Any period which is interrupted by a full GC
> 	// is ignored.
>    65    _last_allocation_time_s(0.0),
> 	// Number of bytes allocated during that period (mutator allocation + promotions)
>    66    _last_allocated_bytes(0),
> 	// Length of most recent marking cycle in seconds
>    67    _last_marking_length_s(0.0) { }
> 
> 	// Occupany in bytes at which a concurrent marking cycle is started. This count
> 	// includes the waste in regions occupied by humongous objects and waste at
> 	// the end of any old region that is not the currently being used for promotions
> 	// and the waste at the end of survivor regions.

I skipped that paragraph about which waste e.g. at end of TLABs or
humongous regions is used, and which not. IHOP calculation needs to
consider actual free space for allocation.

Unfortunately G1 is in two minds about this, i.e. used() for humongous
objects does not contain the "waste" at the end of the las region, but
used() for regular regions does.

> 
>    69   size_t get_conc_mark_start_threshold() { return (size_t) (_ihop_percent * _target_occupancy / 100.0); }
> 
> This might be better named update_marking_length() but can be done in a later cleanup.
> 

Done.

Thanks a lot for all these reminders for documentation. I have been
working too long on this functionality so that "everything is clear" to
me :)

New webrevs with hopefully more complete explanations at:
http://cr.openjdk.java.net/~tschatzl/8136681/webrev.1_to_2/
(incremental)
http://cr.openjdk.java.net/~tschatzl/8136681/webrev.2/ (changes)

>    	// Update marking length (concurrent marking end - concurret marking start)
>    77   virtual void update_time_to_mixed(double marking_length_s) {
> 
> Have you given much thought to what affect a to-space exhaustion should 
> have on
> IHOP?  I understand that it is not the design center for this but I 
> think that to-space
> exhaustion can confuse the statistics.   Maybe a reset of the statistics 
> and a dropping
> the IHOP to a small value (current heap occupancy or even 0) until you 
> get 3 successful
> marking cycles.  Or think about it later.

I already thought a little about how this should interact with regards
to the calculation: the idea is that basically the algorithm will notice
that there is a significant amount of additional allocation, and will
lower the IHOP threshold automatically. (Looking through the code I
think I saw some problems here, I will see to fix that)

If evacuation failure happens in a gc during marking, there are a few
options, I have not decided on what's best:

- do nothing because the user asked to run an impossible to manage
workload (combination of live set, allocation rate, and other options).

  - there is already some log output which that information can be
derived from.

- allow the user to set a maximum IHOP threshold. He could base this
value on the log messages he gets.
  - the user can already do that by increasing G1ReservePercent btw

- make sure that marking completes in time

  - let the mutator threads (or during young gcs while marking is
running) do some marking if we notice that we do not have enough time.
Not sure if it is worth the effort.

  - fix some bugs in marking :) that prevent that in extraordinary
conditions.

  - make sure that we always start marking early enough by making sure
that mixed gc reclaims enough memory. Planning some work here as part of
generic work on improving gc policy.

  - improve predictions: some work being done here.

- make evacuation failure really fast and hope it works out

  - do not immediately move young gen regions with evacuation failures
into young gen regions, but make them survivor.

  - there have already been lots of improvements here as you might have
noticed.

Thanks,
  Thomas