RFR/RFC: Make OOM-during-evacuation race-free

Mon Oct 23 16:31:47 UTC 2017

Am 23.10.2017 um 18:22 schrieb Aleksey Shipilev:
> On 10/23/2017 05:59 PM, Roman Kennke wrote:
>> Yes I see all that. But we have found out that this is a correctness issue, and that trumps
>> performance, even if it's just a very miniscule case.
> This is the correctness issue on the cancellation path again. And we have lots of band-aids there
> already, and this is yet another band-aid.
I would disagree with that. It is a fix. It makes the OOM race problem 
disappear, and it also makes the write-barrier leaf-call issue 
disappear. But yes, I am not arguing that it's fairly intrusive and 
potentially performance-damaging. I'll try to get some gc-bench numbers.

>   What makes it different from other band-aids is that it
> touches the code we *know* is performance critical. It is a nice exercise, but a band-aid
> nevertheless. Noisy performance data may lull us into believing the performance impact is okay.
>
>> If we can come up with another solution that makes running OOM-during-evac 100% I'm all for it.  I'm
>> not fixed on my proposal, I just wanted to throw it out for discussion and bring something on the
>> table that we can do some performance tests with.
> This fwdptr mangling stuff is maybe our fallback plan, if, say, reservation scheme does not work
> itself out -- that makes the whole issue about cancellation going away.
>
> It makes little sense in my mind to allocate time for fallback plans that have bad theoreticals
> before we work out and try the fix that has good theoreticals. We are still at this stage in the
> project when we don't have to rush the intrusive band-aids out. We can actually take time to
> reimplement parts of the collector solving the issue "properly".
>
> I do wonder if instead of mangling the bits, we could reserve a "shadow" uncommitted memprotected
> heap, and set the fwdptr to that? Then we can intercept the SEGVs coming to that shadow heap, and
> redirect it to proper objects. This leaves the usual codepath the same, without ANDs, and the
> failure path would experience read storms -- but why would that matter, if we are on failure path?
That sounds very interesting too. I wonder how that redirection would 
work though. I.e. how would you get the correct oop and patch it into 
the failing code path and return...

In the meantime I'll extract the ShenandoahOOMDuringEvacALot part and 
post it for RFR, this seems useful in any case and should not be 
controversial.

Roman