RFR: bug: Timely Reducing Unused Committed Memory

Thu Sep 6 15:51:09 UTC 2018

Hi again,

On Wed, 2018-09-05 at 23:52 +0000, Rodrigo Bruno wrote:
> Hi Thomas,
> 
> thank you for your fast reply!
> 
> Thomas Schatzl <thomas.schatzl at oracle.com> escreveu no dia quarta,
> 5/09/2018 à(s) 20:13:
> > Hi Rodrigo,
> > 
> > On Tue, 2018-09-04 at 15:04 +0000, Rodrigo Bruno wrote:
> > > Hi Thomas,
> > > 
> > 
> > [...]
> > 
> > > 
> > > I didn't devise any experiments to measure memory usage. However,
> > > we can obviously do that.

That's not required.

> > > 
> > > The new version of the patch regarding 
> > > 
> > > http://openjdk.java.net/jeps/8204089
> > > 
> > > can be found at
> > > 
> > > http://www.gsd.inesc-id.pt/~rbruno/webrev.zip
> > > 
> > > This version already includes idle compaction (Full GC is 
> > > optional) and your previous comments as well.
> > > 
> > > Let me know what you think. 
> > 
> > - small tidbit: the change did not update the type of the
> > _last_heap_resize variable; the webrev does not compile as is
> > because of the recent introduction to use -Wreorder.
> > 
> 
> Oh... I will fix that and send a new version very soon.

Don't worry. When trying this myselves, I found that we are lacking
some helper methods to the Ticks/Tickspan classes that I intend to
introduce in a bit... so I kept it.

> > - the test crashes here with some assert during resizing (after
> > concurrent marking) in our regression testing - it does not give
> > any new errors though, so that is good at least. 
> > I will definitely need to look into this - this may be an existing
> > issue with (in this case heap shrinking) :)
> > 
> 
> Which test? I was running the new test in Linux x86_64 slowdebug
> build and not hitting any assert. How did you run it?
> 
> Still regardind tests, it there a way to run the same test with
> multiple (different) command line arguments?

The problem is some assert in the resizing code that checks whether
some cached value of amount of bytes used in the heap is equal to the
actual bytes used.

This does not match if there is some region we are currently allocating
into, as we are only updating that cached value when that region is
full for various regions.

The resizing (resize_if_necessary_after_full_collection()) had the
implict assumption that there is no such regions, so that cached value
did not contain it.

It triggers e.g. with your test when running with -Xcomp.

>  
> > - it would have been a lot nicer if the sleep_before_next_cycle()
> > returned some integer/enum which can be acted upon outside of it.
> > I.e. it is imho much more readable to separate control/decision
> > making from acting on it compared to separating both in this case.
> > The current sleep_before_next_cycle() is kind of hacky :) 
> > I will look into this as well and propose an alternative. This may
> > take a day or two.
> 
> Okay, we can definitely improve the current version of this method.

So there is a new version of the change at http://cr.openjdk.java.net/~
tschatzl/jelastic/pgc-webrev.2/ (full) and http://cr.openjdk.java.net/~
tschatzl/jelastic/pgc-webrev.1_to_2 (diff) that fixes a few bugs, and
also makes that method nicer, actually reverting it to the former.

Let me explain what I changed:

- there is a race between execution of the VM operation to start the
concurrent phase (calling the VM_G1CollectForAllocation) and the actual
work done.
I.e. actual work outside of
G1ConcurrentMarkThread::sleep_before_next_cycle() could start before
the actual VM operation started.

Finally, it is possible (but really unlikely) that the work outside
completes before returning to
G1ConcurrentMarkThread::sleep_before_next_cycle(), where the CGC_lock
is grabbed again. However the VM operation itself also needs to
CGC_lock to notify G1ConcurrentMarkThread, deadlocking (at least some
code I had had that problem :))

So the only way out here unfortunately is to have a separate thread
doing idle detection and scheduling of the necessary VM operation. That
thread is only started if idle detection is active.

I think the code was sort-of aware of this issue by break'ing out of
the while loop and that ominous "TODO" comment :) 

- the mentioned failing assert.

- I think something bad would happen if the idle timeout triggered
while an existing concurrent marking would be under way (which could be
if you set the "load" threshold high enough). I did not actually try so
I may be wrong, but nevertheless.

- instead of the additional VM operation you can use
Universe::collect() and then using the UseFullGCForIdleCompaction flag
to schedule the correct VM operation (either full or concurrent start),
and simply always resize the heap in the concurrent cycle (in this case
in the Remark pause). There is a RFE for that out in the bug tracker,
and we've been asked about this just recently. I will split it out from
this webrev later.

That removes a lot of code.

- I added a second test case that tests UseFullGCForIdleCompaction to
the junit test.

The webrev I've uploaded is probably not complete, in fact I only just
submitted it to our test system.

At least the following work needs to be done:

- move G1DetectIdleThread out of the g1COncurrentMarkThread files into
separate ones.

- naming of the flags: I think at least GCFrequency is too generic, and
should be in ms too.

- debug messages for the idle detection to see in detail which
condition triggered and why.

- the current implementation for the lazy idle thread initialization is
probably wrong at the moment: GCFrequency that is used is manageable,
i.e. the current implementation does not consider that. If you turn it
on during runtime after it has initially been off, it does not start.
Btw, this should be tested too.

Probably there are some bugs still left, but please have a look.

Would you mind looking into the issues above?

Thanks,
  Thomas