G1: Abort concurrent at initial mark pause
Liang Mao
maoliang.ml at alibaba-inc.com
Tue Mar 3 11:14:04 UTC 2020
Hi All,
As previous discusion, there're several ideas to improve the humongous
objects handling. We've made some experiments that canceling concurrent
mark at initial mark pause is proved to be effective in the senario that
frequent temporary humongous objects allocation leads to frequent concurrent
mark and high CPU usage. The sub-test: scimark.fft.large in specjvm2008 is
also the exact case but not GC sensative so there's little difference
in score.
The patch is small and shall we have a bug id for it?
http://cr.openjdk.java.net/~luchsh/g1hum/humongous.webrev/
Thanks,
Liang
------------------------------------------------------------------
From:Thomas Schatzl <thomas.schatzl at oracle.com>
Send Time:2020 Jan. 21 (Tue.) 18:20
To:"MAO, Liang" <maoliang.ml at alibaba-inc.com>; Man Cao <manc at google.com>; hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>
Subject:Re: Discussion: improve humongous objects handling for G1
Hi,
On 21.01.20 07:25, Liang Mao wrote:
> Hi Thomas,
>
> In fact we saw this issue with 8u. One issue I forgot to tell is that when
> CPU usage is quite high which is nearly 100% the concurrent mark will
> get very slow so the to-space exhuasted happened. BTW, is there any
> improvements for this point in JDK11 or higher versions? I didn't notice so far.
JDK13 has some implicit increases in the thresholds to take more
humongous candidate regions. Not a lot though.
> Increasing reserve percent could alleviate the problem but seems not a completed
> solution.
It would be nicer if g1 automatically adjusted this reserve based on
actual allocation of course. ;)
Which is another option btw - there are many ways to avoid the
evacuation failure situation.
> Cancelling concurrent mark cycle in initial-mark pause seems a delicate
> optimization which can cover some issues if a lot of humongous regions have been
> reclaimed in this pause. It can avoid the unnecessary cm cycle and also trigger cm
> earlier if neened.
> We will take this into the consideration. Thanks for the great idea:)
>
> If there is a short-live humongous object array which also references other
> short-live objects the situation could be worse. If we increase the
> G1HeapRegionSize, some humongous objects become normal objects and the behavior > is more like CMS then everything goes fine. I don't think we have to
not allow humongous
> objects to behave as normal ones. A new allocated humongous object array can probably
> reference objects in young generation and scanning the object array by remset
> couldn't be better than directly iterating the array in evacuation because of possible
> prefetch. We can have an alternative max survivor age for humongous object, maybe 5 or 8
If I read this paragraph correctly you argue that keeping a large
humongous objArray in young is okay because
a) if you increase the heap region size, it has a high chance that it
would be below the thresholds anyway, so you would scan it anyway
b) scanning a humongous objArray with a few references is not much
different performance wise than targeted scanning of the corresponding
cards in the remembered set because of hardware.
Regarding a) Since I have yet to see logs, I can't tell what the typical
size of these arrays are (and I have not seen a "typical" humongous
object distribution graph for these applications). However regions sizes
are kind of proportional with heap size which kind of corresponds to the
hardware that you need to use. I.e. you likely won't see G1 using 100
threads on 200m heap with 32m regions with current ergonomics.
Even then this limits objArrays to 16M (at 32m region size), which
limits the time spent scanning the object (and if ergonomics select 32m
regions, the heap and the machine are probably quite big anyway). From
what you and Man were telling, you seem to have a significant amount of
humongous objects of unknown type that are much(?) larger than that.
Regarding b) that has been wrong years ago when I did experiments on
that (even the "limit age on humongous obj arrays" workaround - you can
easily go as low as a max tenuring threshold of 1 to catch almost all of
the relevant ones), and very likely still is.
Let me do some over-the-thumb calculations: Assuming that we have 32M
objects (random number, i.e. ~8m references), with, say 1k references
(which is more than a handful), the remembered set would make you scan
only 1.5% max (1000*512 bytes/card) of the object. I seriously doubt
that prefetching or some magic hardware will make that amount additional
work disappear.
From a performance POV, with 20 GB/s bandwidth available, (which I am
not sure you will reach during GC for whatever reasons; random number),
you are spending 1.5ms (if I calculated correctly) cpu time just for
finding out that the 32M object is completely full of null-s in the
worst case. That's also the minimum amount of time you need per such object.
Keeping it outside of young gen, and particularly if it has been
allocated just recently it won't have a lot remembered set entries,
would likely be much cheaper than that (as mentioned, G1 has a good
measure of how long scanning a card will take so we could take this number).
Only if G1 is going to scan it almost completely anyway (which we agree
on is unlikely to be the case as it has "just" been allocated), then
keeping it outside is disadvantagous.
Note that its allocation could still be counted against the eden
allowance in some situations. This could be seen as a way to slow down
the mutator while it is busy trying to complete the marking.
I am however not sure if it helps a lot assuming that changes to perform
eager reclaim on objArrays won't work during marking btw. There would be
need for a different kind of enforcing such an allocation penalty.
Without more thinking and measurements I would not know when and how to
account that, and what has to happen with existing mechanisms to absorb
allocation spikes (i.e. G1ReservePercent). I just assume that you
probably do not want both. Also something to consider.
> at most otherwise let eager reclam do it. A tradeoff can be made to balance the
> pause time and reclamation possibility of short-live objects.
>
> So the enhanced solution can be
> 1. Cancelling concurrent mark if not necessary.
> 2. Increase the reclamation possibility of short-live humongous objects.
These are valid possibilities to improve the overall situation without
fixing actual fragmentation issues ;)
> An important reason for this issue is that Java developers easily
> challenge CMS can handle the application without significant CPU usage increase > (caused by concurrent mark)
> but why G1 cannot. Personally I believe G1 can do anything not worse
> than CMS:)
> This proposal aims for the throughput gap comparing to CMS. If works
> with the barrier optimization which is proposed by Man and Google, imho the gap could be
> obviously reduced.
Thanks,
Thomas
More information about the hotspot-gc-dev
mailing list