RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Tue May 12 13:17:38 UTC 2015

On 2015-05-12 15:05, Aleksey Shipilev wrote:
> On 11.05.2015 16:41, Andrew Haley wrote:
>> On 05/11/2015 12:33 PM, Erik Österlund wrote:
>>> Hi Andrew,
>>>
>>>> On 11 May 2015, at 11:58, Andrew Haley <aph at redhat.com> wrote:
>>>>
>>>> On 05/11/2015 11:40 AM, Erik Österlund wrote:
>>>>
>>>>> I have heard statements like this that such mechanism would not work
>>>>> on RMO, but never got an explanation why it would work only on
>>>>> TSO. Could you please elaborate?  I studied some kernel sources for
>>>>> a bunch of architectures and kernels, and it seems as far as I can
>>>>> see all good for RMO too.
>>>>
>>>> Dave Dice himself told me that the algorithm is not in general safe
>>>> for non-TSO.  Perhaps, though, it is safe in this particular case.  Of
>>>> course, I may be misunderstanding him.  I'm not sure of his reasoning
>>>> but perhaps we should include him in this discussion.
>>>
>>> I see. It would be interesting to hear his reasoning, because it is
>>> not clear to me.
>>>
>>>>  From my point of view, I can't see a strong argument for doing this on
>>>> AArch64.  StoreLoad barriers are not fantastically expensive there so
>>>> it may not be worth going to such extremes.  The cost of a StoreLoad
>>>> barrier doesn't seem to be so much more than the StoreStore that we
>>>> have to have anyway.
>>>
>>> Yeah about performance I’m not sure when it’s worth removing these
>>> fences and on what hardware.
>>
>> Your algorithm (as I understand it) trades a moderately expensive (but
>> purely local) operation for a very expensive global operation, albeit
>> with much lower frequency.  It's not clear to me how much we value
>> continuous operation versus faster operation with occasional global
>> stalls.  I suppose it must be application-dependent.
>
> Okay, Dice's asymmetric trick is nice. In fact, that is arguably what
> Parallel is using already: it serializes the mutator stores by stopping
> the mutator at safepoint. Using mprotect and TLB tricks as the
> serialization actions is cute and dandy.
>
> However, I have doubts that employing the system-wide synchronization
> mechanism for concurrent collector is a good thing, when we can't
> predict and control the long-term performance of it. For example, we are
> basically coming at the mercy of underlying OS performance with mprotect
> calls. There are industrial GCs that rely on OS performance (*cough*
> *cough*), you can see what do those require to guarantee performance.

Just to be clear, this type of synchronization is in fact already 
implemented in the JVM to synchronize thread states for the safepoint 
protocol, so it's not exactly new and unexplained territory.

However it's not clear to me that the code complexity involved with 
using that type of synchronization for conditional card marking in CMS 
is worth it.

>
> Also, given the problem is specific to CMS that arguably goes away in
> favor of G1, I would think introducing special-case-for-CMS barriers in
> mutator code is a sane interim solution.

I agree.

>
> Especially if we can backport the G1-like barrier "filtering" in CMS
> case? If I read this thread right, Erik and Thomas concluded there is no
> clear benefit of introducing the mprotect-like mechanics with G1, which
> probably means the overheads are bearable with appropriate mutator-side
> changes.

I don't think it would be easy to implement barrier "filtering" in CMS.
Keep in mind that even before the storeload was added to G1's barriers 
they were fairly heavy-weight. CMS' barriers are not, if we start to add 
conditionals and storeload barriers to them the runtime overhead may 
increase more than what it did when we added the storeload to G1.

/Mikael

>
> Thanks,
> -Aleksey
>