RFC: Safe OOM during evac

Zhengyu Gu zgu at redhat.com
Fri Oct 20 12:59:02 UTC 2017


Okay. Sounds like string dedup still has chances to update from-space 
oops without GC workers doing CAS-masking.

But likely I will remove this UR phase to eliminate dependency on 2nd 
bitmap, anyway.

Thanks,

-Zhengyu



On 10/20/2017 08:49 AM, Roman Kennke wrote:
> I guess we could make it so.
> Currently I'm only doing the CAS-masking trick on Java threads. We'd 
> have to do it for any thread (incl. GC workers).
> And we'd have to make this the default behaviour.
> Then yes, we could do that.
> 
> Roman
> 
>> Hi Roman,
>>
>> With this patch, we should not need "fixup_roots", right?
>>
>> Thanks,
>>
>> -Zhengyu
>>
>> On 10/20/2017 05:20 AM, Roman Kennke wrote:
>>> Am 19.10.2017 um 12:27 schrieb Roman Kennke:
>>>> Hi all,
>>>>
>>>> I want to outline the problem that we have with OOM during 
>>>> evacuation, and summarize what we have so far in order to handle OOM 
>>>> during evac correctly, and describe one way that I have in mind how 
>>>> to do it.
>>>>
>>>> The problem appears when a Java threads gets into a write-barrier 
>>>> and fails to evacuate the object because it's run out of memory 
>>>> (e.g. both GCLAB and shared evac exhausted). In this case we still 
>>>> need to ensure to return a singular object, even if it's in 
>>>> from-space, otherwise we risk inconsistency (subsequent write may 
>>>> end up in wrong object copy). However, there might still be another 
>>>> Java thread which succeeds to evacuate that same object at the ~ 
>>>> same time because it still has GCLAB left.
>>>>
>>>> Here's what we came up with so far in IRC discussions:
>>>>
>>>> - Throw OOME. This is the absolute minimum solution, and we should 
>>>> probably just do that right now until we implemented a better one 
>>>> (and we might even use it as fallback for solutions that are not 
>>>> 100% proof). This is better IMO than to pretend we're ok and risk 
>>>> heap inconsistencies.
>>>>
>>>> - Make write-barrier slow-path/runtime-calls non-leaf calls. Then we 
>>>> could just safepoint and do a full-GC *while we're in the barrier*. 
>>>> This would be to most correct solution. Unfortunately it means it 
>>>> would make it very hard to optimize the write-barriers in C2, and 
>>>> the performance impact is likely not acceptable. We may try to do a 
>>>> prototype (again) and see how far Roland can take it though. The 
>>>> problem here is that we need debug info at the call sites, and C2 
>>>> maintains debug info only at certain points in the ideal graph. 
>>>> Consequently, we can move write-barriers only to such points and not 
>>>> as freely as we can do now.
>>>>
>>>> - Keep an evacuation reserve that we use only for evacuations or 
>>>> maybe even only for write-barriers or maybe even only as fallback 
>>>> for write-barriers that OOM'ed. This does very likely solve it for 
>>>> 99.999.. % of the cases, but discussions on IRC have shown that it 
>>>> is very hard to come up with a 100% safe upper bound for this 
>>>> reserve size, that allows us to theoretically prove that OOM during 
>>>> evac cannot ever happen. We might combine this with solution #1 
>>>> though: i.e. make it safe in all but the most extreme pathologic 
>>>> cases, and throw OOME if we hit a wall. I am still not very happy 
>>>> with the prospect to fail in extreme rare cases, possibly in 
>>>> production environments under high pressure.
>>>>
>>>> - Extend the brooks pointer protocol to prevent concurrent evacs. 
>>>> Let me outline my idea here:
>>>>
>>>> If a write barrier runs OOM, we need to prevent other threads from 
>>>> successfully evacuating 'our' object. We can do so by CASing an 
>>>> 'impossible' value into its brooks pointers: this guarantees that 
>>>> other threads fail to successfully install a brooks ptr *OR* give us 
>>>> the other thread's copy (which would be fine too). Problem: we need 
>>>> to deal with that special value everywhere else, most importantly in 
>>>> read-barriers. The best thing I could come up with so far is to use 
>>>> $OBJECT_ADDR | 1 as blocker value, i.e. CAS-set the lowest bit in 
>>>> the self-pointing brooks ptr. This can easily be decoded in 
>>>> read-barriers (and all other relevant code) by masking out that 
>>>> lowest bit using AND ~1. Full-GC would fix the brooks ptr to normal 
>>>> value. I don't have a good feeling what the performance impact would 
>>>> be. Something similar happens for decoding compressed oops, and that 
>>>> is commonly accepted (but is less frequent). The actual 
>>>> brooks-ptr-load probably dominates the masking and we wouldn't even 
>>>> really notice? On the upside, this makes the oom_during_evac() path 
>>>> truly non-blocking: we don't need to wait for GC workers and not for 
>>>> other Java threads and not for evacuation to be turned off or any 
>>>> such thing. (which also means, it truly complies with being 
>>>> non-blocking for leaf-calls). I believe it's a correct solution too: 
>>>> no from-space copy can slip through it. I can imagine to come up 
>>>> with a prototype for this and make it optional (by a flag) so that 
>>>> we can measure its impact or even give the option to combine it with 
>>>> any of the other options we have (e.g. evac-reserve).
>>>>
>>>>
>>> So, I made a prototype for this and SPECjvm with and without it.
>>>
>>> First with a clean checkout build:
>>>
>>> https://paste.fedoraproject.org/paste/8W8tKz5WGlvaT5iR5cUbFA
>>>
>>> And this with additional masking in the read barrier:
>>>
>>> https://paste.fedoraproject.org/paste/WT4TRm25gAbdsYSZfJqt4g
>>>
>>> First some things to notice:
>>>
>>> - compiler regularily crashes with an NPE. -ShenandoahOptimizeFinals 
>>> plus Roland's recent patch for this seems to make it go away. I ran 
>>> all benchmarks with that patch and flag applied.
>>> - serial's performance pattern is totally erratic with huge variance 
>>> between 3K and 12K. We can disregard this number and need to look 
>>> into it
>>> - XML crashes hard inside a C2 compiled method
>>>
>>> other than that, I see no significant impact of the masking read 
>>> barrier.
>>>
>>> We also might want to run some memory-reading gcbench tests. In case 
>>> anybody wants to try that, here is the patch:
>>>
>>> http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.00/ 
>>> <http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.00/>
>>>
>>> If nobody screams stop, I'm going to add some test machinery 
>>> (+ShenandoahOOMDuringEvacALot) and additional testcases (probably 
>>> hook up to gcold and gcbasher), and then RFR/RFC the patch.
>>>
>>> Thoughts?
>>>
>>> Roman
>>>
> 


More information about the shenandoah-dev mailing list