G1: Abort concurrent at initial mark pause

Tue Mar 3 11:14:04 UTC 2020

Hi All,

As previous discusion, there're several ideas to improve the humongous
objects handling. We've made some experiments that canceling concurrent
 mark at initial mark pause is proved to be effective in the senario that 
frequent temporary humongous objects allocation leads to frequent concurrent
 mark and high CPU usage. The sub-test: scimark.fft.large in specjvm2008 is
also the exact case but not GC sensative so there's little difference
in score. 

The patch is small and shall we have a bug id for it?
http://cr.openjdk.java.net/~luchsh/g1hum/humongous.webrev/

Thanks,
Liang

------------------------------------------------------------------
From:Thomas Schatzl <thomas.schatzl at oracle.com>
Send Time:2020 Jan. 21 (Tue.) 18:20
To:"MAO, Liang" <maoliang.ml at alibaba-inc.com>; Man Cao <manc at google.com>; hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>
Subject:Re: Discussion: improve humongous objects handling for G1

Hi,

On 21.01.20 07:25, Liang Mao wrote:
> Hi Thomas,
> 
> In fact we saw this issue with 8u. One issue I forgot to tell is that when
> CPU usage is quite high which is nearly 100% the concurrent mark will
> get very slow so the to-space exhuasted happened. BTW, is there any 
> improvements for this point in JDK11 or higher versions? I didn't notice so far. 

JDK13 has some implicit increases in the thresholds to take more 
humongous candidate regions. Not a lot though.

> Increasing reserve percent could alleviate the problem but seems not a completed 
> solution.

It would be nicer if g1 automatically adjusted this reserve based on 
actual allocation of course. ;)

Which is another option btw - there are many ways to avoid the 
evacuation failure situation.

> Cancelling concurrent mark cycle in initial-mark pause seems a delicate 
> optimization which can cover some issues if a lot of humongous regions have been 
> reclaimed in this pause. It can avoid the unnecessary cm cycle and also trigger cm 
> earlier if neened.
> We will take this into the consideration. Thanks for the great idea:)
> 
> If there is a short-live humongous object array which also references other
> short-live objects the situation could be worse. If we increase the 
> G1HeapRegionSize, some humongous objects become normal objects and the behavior > is more like CMS then everything goes fine. I don't think we have to 
not allow humongous
> objects to behave as normal ones. A new allocated humongous object array can probably 
> reference objects in young generation and scanning the object array by remset 
> couldn't be better than directly iterating the array in evacuation because of possible 
> prefetch. We can have an alternative max survivor age for humongous object, maybe 5 or 8 

If I read this paragraph correctly you argue that keeping a large 
humongous objArray in young is okay because

a) if you increase the heap region size, it has a high chance that it 
would be below the thresholds anyway, so you would scan it anyway

b) scanning a humongous objArray with a few references is not much 
different performance wise than targeted scanning of the corresponding 
cards in the remembered set because of hardware.

Regarding a) Since I have yet to see logs, I can't tell what the typical 
size of these arrays are (and I have not seen a "typical" humongous 
object distribution graph for these applications). However regions sizes 
are kind of proportional with heap size which kind of corresponds to the 
hardware that you need to use. I.e. you likely won't see G1 using 100 
threads on 200m heap with 32m regions with current ergonomics.

Even then this limits objArrays to 16M (at 32m region size), which 
limits the time spent scanning the object (and if ergonomics select 32m 
regions, the heap and the machine are probably quite big anyway). From 
what you and Man were telling, you seem to have a significant amount of 
humongous objects of unknown type that are much(?) larger than that.

Regarding b) that has been wrong years ago when I did experiments on 
that (even the "limit age on humongous obj arrays" workaround - you can 
easily go as low as a max tenuring threshold of 1 to catch almost all of 
the relevant ones), and very likely still is.

Let me do some over-the-thumb calculations: Assuming that we have 32M 
objects (random number, i.e. ~8m references), with, say 1k references 
(which is more than a handful), the remembered set would make you scan 
only 1.5% max (1000*512 bytes/card) of the object. I seriously doubt 
that prefetching or some magic hardware will make that amount additional 
work disappear.

 From a performance POV, with 20 GB/s bandwidth available, (which I am 
not sure you will reach during GC for whatever reasons; random number), 
you are spending 1.5ms (if I calculated correctly) cpu time just for 
finding out that the 32M object is completely full of null-s in the 
worst case. That's also the minimum amount of time you need per such object.

Keeping it outside of young gen, and particularly if it has been 
allocated just recently it won't have a lot remembered set entries, 
would likely be much cheaper than that (as mentioned, G1 has a good 
measure of how long scanning a card will take so we could take this number).
Only if G1 is going to scan it almost completely anyway (which we agree 
on is unlikely to be the case as it has "just" been allocated), then 
keeping it outside is disadvantagous.

Note that its allocation could still be counted against the eden 
allowance in some situations. This could be seen as a way to slow down 
the mutator while it is busy trying to complete the marking.

I am however not sure if it helps a lot assuming that changes to perform 
eager reclaim on objArrays won't work during marking btw. There would be 
need for a different kind of enforcing such an allocation penalty.

Without more thinking and measurements I would not know when and how to 
account that, and what has to happen with existing mechanisms to absorb 
allocation spikes (i.e. G1ReservePercent). I just assume that you 
probably do not want both. Also something to consider.

> at most otherwise let eager reclam do it. A tradeoff can be made to balance the 
> pause time and reclamation possibility of short-live objects.
> 
> So the enhanced solution can be
> 1. Cancelling concurrent mark if not necessary.
> 2. Increase the reclamation possibility of short-live humongous objects.

These are valid possibilities to improve the overall situation without 
fixing actual fragmentation issues ;)

> An important reason for this issue is that Java developers easily 
> challenge CMS can handle the application without significant CPU usage increase > (caused by concurrent mark)
> but why G1 cannot. Personally I believe G1 can do anything not worse 
> than CMS:)
> This proposal aims for the throughput gap comparing to CMS. If works 
> with the barrier optimization which is proposed by Man and Google, imho the gap could be 
> obviously reduced.

Thanks,
   Thomas