RFR (XS): 8033922: G1: Back out 8033601 and go back to use the to-obj for chunked arrays.

Fri Feb 7 18:42:38 UTC 2014

Hi guys,

First, apologies for this. :-) We have done a bunch of G1 testing here 
with said fix (it's been in our repo for a few months) and we also 
tested it with production loads without any failures. But of course we 
don't have all your tests, so clearly we can't reach your level of 
coverage. FWIW, I'm glad you caught the problem since we know that we 
have to come up with a fix (at least, for our repo).

I agree with Bengt here, best approach right now is to undo the fix and 
I'll work on coming up with an alternative. I'll post some suggestions 
on the CR.

Tony

On 2/7/14, 8:24 AM, Bengt Rutisson wrote:
>
> Hi Thomas,
>
> Thanks for the quick review!
>
> Yes, there are ways to fix it, but I don't think it is worth the 
> effort since the original fix was just a fix to make G1 look more like 
> the other collectors. It was not a correctness or performance fix.
>
> Bengt
>
> On 2/7/14 1:51 PM, Thomas Schatzl wrote:
>> Hi,
>>
>> On Fri, 2014-02-07 at 13:29 +0100, Bengt Rutisson wrote:
>>> Hi all,
>>>
>>> Can I have a couple of reviews for backing out the changes for 8033601.
>>> Those changes cause verification failures in our nightly testing.
>>>
>>> http://cr.openjdk.java.net/~brutisso/8033922/webrev.00/
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8033922
>>>
>>> Background from the CR:
>>>
>>> The fix for 8033601 changed the code for chunked arrays to use the from
>>> obj length as index rather than the to object.
>>>
>>> This was nice since it made G1 look more like the other collectors.
>>>
>>> However, another difference in G1 is that it writes the forward pointer
>>> before it writes the to-object. That means that there are times when an
>>> object is forwarded, but there is no to-object image that can be 
>>> trusted.
>>>
>>> One such place is
>>>
>>> void G1ParCopyClosure<barrier, do_mark_object>
>>> ::do_oop_work(T* p)
>>>
>>> We can find that the object is forwarded, but it may not have been
>>> marked yet. So, we end up calling mark_forwarded_object() which needs
>>> the size to call _cm->grayRoot(to_obj, (size_t) from_obj->size(),
>>> _worker_id);
>>>
>>> mark_forwarded_object() has a comment that tries to describe this 
>>> situation:
>>>
>>>     // The object might be in the process of being copied by another
>>>     // worker so we cannot trust that its to-space image is
>>>     // well-formed. So we have to read its size from its from-space
>>>     // image which we know should not be changing.
>>>     _cm->grayRoot(to_obj, (size_t) from_obj->size(), _worker_id);
>>>
>>> So, just as the comment say, we can not trust the to-space image.
>>> However, after the change for 8033601 we can not trust the from-space
>>> image either.
>> I.e. the problem is that there is a small time window where another
>> thread might read a wrong length field.
>>
>>> This seems like a show-stopper for the suggested change for 8033601 so
>>> we need to back it out.
>> An alternative would be to write/copy the object header (with the length
>> field) in the to_obj before forwarding (the atomic forwarding should
>> take care of memory ordering issues).
>>
>> This seems to be too much hassle though, as it has at least one
>> disadvantage: it widens the opportunity for multiple threads working on
>> the same object (potentially more undo).
>>
>> Another alternative could be to not continue with the rest of the body
>> of G1ParCopyClosure<barrier, do_mark_object>::do_oop_work() if a thread
>> in copy_to_survivor_space() looses the race for installing the forward
>> pointer.
>>
>> So I agree with undoing this change at least for now unless somebody
>> (Tony :)) comes up with a clearly better working solution.
>>
>> Thanks,
>>    Thomas
>>
>>
>

-- 
Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprintezis at twitter.com