[G1] Why humongous object bytes allocated after a collection are not recorded in G1Policy::_bytes_allocated_in_old_since_last_gc?

Wed May 27 12:05:56 UTC 2020

Hi,

On 25.05.20 18:39, Thomas Schatzl wrote:
> Hi Ziyi,
> 
> On 25.05.20 18:16, Luo, Ziyi wrote:
>> Hi Thomas,
>>
>>> Hi,
>>>
>>> On 23.05.20 00:30, Luo, Ziyi wrote:
>>>> Hi,
>>>>
>>>> I have a question about humongous object allocation result in 
>>>> g1CollectedHeap:
>>>> http://hg.openjdk.java.net/jdk/jdk/file/6d7c3a8bfab6/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#l874 
>>>>
>>>> http://hg.openjdk.java.net/jdk/jdk/file/6d7c3a8bfab6/src/hotspot/share/gc/g1/g1CollectedHeap.cpp#l893 
>>>>
>>>>
>>>> L874 returns the result after a successful humongous allocation. 
>>>> L893 returns
>>>> the result after a successful G1CollectForAllocation. In the former 
>>>> case, the
>>>> allocated bytes are recorded in
>>>> G1Policy::_bytes_allocated_in_old_since_last_gc in L873. But bytes 
>>>> in the
>>>> latter case are not. Is this intentional or a mistake?
>>>>
>>>> Best,
>>>> Ziyi
>>>>
>>>
>>>    the collection itself has some code to add the allocated bytes. In
>>> the first case since there is no collection, that value needs to be
>>> incremented manually.
>>>
>>> But you probably already know that, you filed JDK-8245511 :)
>>
>> Yes, this is about JDK-8245511. In IHOP, 
>> _bytes_allocated_in_old_since_last_gc
>> is used to calculate the old gen allocation rate. In some jtreg tests 
>> (e.g., gc
> 
> Afaik the only consumer for the old gen allocation rate (i.e. 
> _bytes_allocated_in_old_since_last_gc and the mutator phase time) is 
> IHOP calculation, so whatever makes it "correct" should be stored in there.
> 
> As mentioned in my comment to JDK-8245511, imho 
> _bytes_allocated_in_old_since_last_gc should contain the "surviving" 
> bytes allocated in the old gen, i.e. bytes_at_end_of_gc_X_in_old_gen - 
> bytes_at_end_of_gc_X_in_old_gen_minus_1.

To properly confuse you, a more comprehensive answer about the model in 
the back of adaptive IHOP :)

I decided to post this in this thread, not the other one as it fits 
better here. In the future, let's try to consolidate in one thread. :)

Adaptive IHOP needs

1) long term allocation rate to schedule mark start
2) short term allocation spikes (between two pauses) to avoid going OOME 
while marking/evacuation failure during the next pause
3) "survive" the first mixed gc without evacuation failure

These model terms are given the IHOPControl via

   void update_allocation_info(double allocation_time_s, size_t 
allocated_bytes);
   void update_additional_buffer(size_t additional_buffer_bytes);

Current code provides
- the first item above via update_allocation_info(), and just sees 
humongous allocations as regular allocations.

- nothing for the second item, the current mechanism conservatively just 
includes all in-between allocations.

- the third item by the young gen size. Let me explain: to "survive" 
until the first mixed gc after completing the marking you need two 
parts: the young gen for that last mixed gc and the space for any 
survivors (actually it is MAX(all-gcs-from-Remark-to-including-mixed) 
and the respective survivors as there can be more). The first term is 
supposed to be covered by passing in recent young gen as a very rough 
approximation, the second part is conceptually covered by the heap 
reserve (G1HeapReservePercent).
(I am currently preparing a fix for that because in some cases, updating 
young every gc leads to unexpected behavior; JDK-8238163)

The current long term allocation rate is very conservative, and this is 
where JDK-8245511 is trying to improve the situation afaiu: it suggests 
to lower the long term allocation rate to actual long term rate 
(discounting empty region eager reclamation, potentially other 
reclamation like during remark) - but in return the change imho needs to 
provide some measure of the spikyness of the allocation between gcs. 
Otherwise G1 will run into to-space exhaustions and potentially full gcs 
all the time with such loads.

The test program can do without this model term because current policy 
is fairly bad (too conservative) for this case, and second, the test 
does not keep any humongous objects at all - also it does not care about 
to-space exhaustions which are fairly fast, but still slower than needed.

Quick(!) testing on the reproducer with the suggested patch showed 
around 23% of young gcs were evacuation failures (60s run, 65 young gcs 
total).
The change still is an improvement over baseline of course, with >10x 
more GCs but no evac failures :) Maybe this could be improved though, in 
addition to the suggestions I already added to the JIRA entry.

Thanks,
   Thomas