RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Thu May 7 10:15:58 UTC 2015

Hi Erik,

On 2015-05-06 17:01, Erik Österlund wrote:
> Hi everyone,
>
> I just read through the discussion and thought I’d share a potential solution that I believe would solve the problem.
>
> Previously I implemented something that struck me as very similar for G1 to get rid of its storeload fence in the barrier that suffered from similar symptoms.
> The idea is to process cards in batches instead of one by one and issue a global store serialization event (e.g. using mprotect to a dummy page) when cleaning. It worked pretty well but after Thomas Schatzel ran some benchmarks we decided the gain wasn’t worth the trouble for G1 since it fences only rarely when encountering interregional pointers (premature optimization). But maybe here it happens more often and is more worth the trouble to get rid of the fence?
>
> Here is a proposed new algorithm candidate (small change to algorithm in bug description):
>
> mutator (exactly as before):
>
> x.a = something
> StoreStore
> if (card[@x.a] != dirty) {
>    card[@x.a] = dirty
> }
>
> preclean:
>
> for card in batched_cards {
>    if (card[@x.a] == dirty) {
>      card[@x.a] = precleaned
>    }
> }
>
> global_store_fence()
>
> for card in batched_cards {
>    read x.a
> }
>
> The global fence will incur some local overhead (quite ouchy) and some global overhead fencing on all remote CPUs the process is scheduled to run on (not necessarily all) using cross calls in the kernel to invalidate remote TLB buffers in the L1 cache (not so ouchy) and by batching the cards, this “global" cost is amortized arbitrarily so that even on systems with a ridiculous amount of CPUs, it’s probably still a good idea. It is also possible to let multiple precleaning CPUs share the same global store fence using timestamps since it is in fact global. This guarantees scalability on many-core systems but is a bit less straightforward to implement.
>
> If you are interested in this and think it’s a good idea, I could try to patch a solution for this, but I would need some help benchmarking this in your systems so we can verify it performs the way I hope.

I think this is a good idea. The problem is asymmetric in that the CMS 
thread should be fine with taking a larger local overhead, batching the 
setting of cards to precleaned and then scanning the cards later.

Do you know how the global_store_fence() would look on different cpu 
architectures?

The VM already uses this sort of synchronization for the thread state 
transitions, see references to UseMemBar, os::serialize_thread_states, 
os::serialize_memory. Perhaps that code can be reused somehow?

/Mikael

>
> Thanks,
> /Erik
>
>
>> On 06 May 2015, at 14:52, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
>>
>> Hi Vitaly,
>>
>> On 2015-05-06 14:41, Vitaly Davidovich wrote:
>>> Mikael's suggestion was to make mutator check for !clean and then mark
>>> dirty.  If it sees stale dirty, it will write dirty again no?  Today's code
>>> would have this problem because it's checking for !dirty, but I thought the
>>> suggested change would prevent that.
>>
>> Unfortunately I don't think my suggestion would solve anything.
>>
>> If the conditional card mark would write dirty again if it sees a stale dirty it's not really solving the false sharing problem.
>>
>> The problem is not the value that the precleaner writes to the card entry, it's that the mutator may see the old "dirty" value which was overwritten as part of precleaning but not necessarily visible to the mutator thread.
>>
>> /Mikael
>>
>>
>>>
>>> sent from my phone
>>> On May 6, 2015 4:53 AM, "Andrew Haley" <aph at redhat.com> wrote:
>>>
>>>> On 05/05/15 20:51, Vitaly Davidovich wrote:
>>>>> If mutator doesn't see "clean" due to staleness, won't it just mark it
>>>>> dirty "unnecessarily" using Mikael's suggestion?
>>>>
>>>> No.  The mutator may see a stale "dirty" and not write anything.  At least
>>>> I haven't seen anything which certainly will prevent that from happening.
>>>>
>>>> Andrew.
>>>>
>>>>
>>>>
>