RFC: Safe OOM during evac

Fri Oct 20 12:04:31 UTC 2017

Hi Roman,

With this patch, we should not need "fixup_roots", right?

Thanks,

-Zhengyu

On 10/20/2017 05:20 AM, Roman Kennke wrote:
> Am 19.10.2017 um 12:27 schrieb Roman Kennke:
>> Hi all,
>>
>> I want to outline the problem that we have with OOM during evacuation, 
>> and summarize what we have so far in order to handle OOM during evac 
>> correctly, and describe one way that I have in mind how to do it.
>>
>> The problem appears when a Java threads gets into a write-barrier and 
>> fails to evacuate the object because it's run out of memory (e.g. both 
>> GCLAB and shared evac exhausted). In this case we still need to ensure 
>> to return a singular object, even if it's in from-space, otherwise we 
>> risk inconsistency (subsequent write may end up in wrong object copy). 
>> However, there might still be another Java thread which succeeds to 
>> evacuate that same object at the ~ same time because it still has 
>> GCLAB left.
>>
>> Here's what we came up with so far in IRC discussions:
>>
>> - Throw OOME. This is the absolute minimum solution, and we should 
>> probably just do that right now until we implemented a better one (and 
>> we might even use it as fallback for solutions that are not 100% 
>> proof). This is better IMO than to pretend we're ok and risk heap 
>> inconsistencies.
>>
>> - Make write-barrier slow-path/runtime-calls non-leaf calls. Then we 
>> could just safepoint and do a full-GC *while we're in the barrier*. 
>> This would be to most correct solution. Unfortunately it means it 
>> would make it very hard to optimize the write-barriers in C2, and the 
>> performance impact is likely not acceptable. We may try to do a 
>> prototype (again) and see how far Roland can take it though. The 
>> problem here is that we need debug info at the call sites, and C2 
>> maintains debug info only at certain points in the ideal graph. 
>> Consequently, we can move write-barriers only to such points and not 
>> as freely as we can do now.
>>
>> - Keep an evacuation reserve that we use only for evacuations or maybe 
>> even only for write-barriers or maybe even only as fallback for 
>> write-barriers that OOM'ed. This does very likely solve it for 
>> 99.999.. % of the cases, but discussions on IRC have shown that it is 
>> very hard to come up with a 100% safe upper bound for this reserve 
>> size, that allows us to theoretically prove that OOM during evac 
>> cannot ever happen. We might combine this with solution #1 though: 
>> i.e. make it safe in all but the most extreme pathologic cases, and 
>> throw OOME if we hit a wall. I am still not very happy with the 
>> prospect to fail in extreme rare cases, possibly in production 
>> environments under high pressure.
>>
>> - Extend the brooks pointer protocol to prevent concurrent evacs. Let 
>> me outline my idea here:
>>
>> If a write barrier runs OOM, we need to prevent other threads from 
>> successfully evacuating 'our' object. We can do so by CASing an 
>> 'impossible' value into its brooks pointers: this guarantees that 
>> other threads fail to successfully install a brooks ptr *OR* give us 
>> the other thread's copy (which would be fine too). Problem: we need to 
>> deal with that special value everywhere else, most importantly in 
>> read-barriers. The best thing I could come up with so far is to use 
>> $OBJECT_ADDR | 1 as blocker value, i.e. CAS-set the lowest bit in the 
>> self-pointing brooks ptr. This can easily be decoded in read-barriers 
>> (and all other relevant code) by masking out that lowest bit using AND 
>> ~1. Full-GC would fix the brooks ptr to normal value. I don't have a 
>> good feeling what the performance impact would be. Something similar 
>> happens for decoding compressed oops, and that is commonly accepted 
>> (but is less frequent). The actual brooks-ptr-load probably dominates 
>> the masking and we wouldn't even really notice? On the upside, this 
>> makes the oom_during_evac() path truly non-blocking: we don't need to 
>> wait for GC workers and not for other Java threads and not for 
>> evacuation to be turned off or any such thing. (which also means, it 
>> truly complies with being non-blocking for leaf-calls). I believe it's 
>> a correct solution too: no from-space copy can slip through it. I can 
>> imagine to come up with a prototype for this and make it optional (by 
>> a flag) so that we can measure its impact or even give the option to 
>> combine it with any of the other options we have (e.g. evac-reserve).
>>
>>
> So, I made a prototype for this and SPECjvm with and without it.
> 
> First with a clean checkout build:
> 
> https://paste.fedoraproject.org/paste/8W8tKz5WGlvaT5iR5cUbFA
> 
> And this with additional masking in the read barrier:
> 
> https://paste.fedoraproject.org/paste/WT4TRm25gAbdsYSZfJqt4g
> 
> First some things to notice:
> 
> - compiler regularily crashes with an NPE. -ShenandoahOptimizeFinals 
> plus Roland's recent patch for this seems to make it go away. I ran all 
> benchmarks with that patch and flag applied.
> - serial's performance pattern is totally erratic with huge variance 
> between 3K and 12K. We can disregard this number and need to look into it
> - XML crashes hard inside a C2 compiled method
> 
> other than that, I see no significant impact of the masking read barrier.
> 
> We also might want to run some memory-reading gcbench tests. In case 
> anybody wants to try that, here is the patch:
> 
> http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.00/ 
> <http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.00/>
> 
> If nobody screams stop, I'm going to add some test machinery 
> (+ShenandoahOOMDuringEvacALot) and additional testcases (probably hook 
> up to gcold and gcbasher), and then RFR/RFC the patch.
> 
> Thoughts?
> 
> Roman
>