RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Sat May 9 11:38:22 UTC 2015

Hi Mikael,

> On 07 May 2015, at 11:15, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
> 
> Hi Erik,
> 
> On 2015-05-06 17:01, Erik Österlund wrote:
>> Hi everyone,
>> 
>> I just read through the discussion and thought I’d share a potential solution that I believe would solve the problem.
>> 
>> Previously I implemented something that struck me as very similar for G1 to get rid of its storeload fence in the barrier that suffered from similar symptoms.
>> The idea is to process cards in batches instead of one by one and issue a global store serialization event (e.g. using mprotect to a dummy page) when cleaning. It worked pretty well but after Thomas Schatzel ran some benchmarks we decided the gain wasn’t worth the trouble for G1 since it fences only rarely when encountering interregional pointers (premature optimization). But maybe here it happens more often and is more worth the trouble to get rid of the fence?
>> 
>> Here is a proposed new algorithm candidate (small change to algorithm in bug description):
>> 
>> mutator (exactly as before):
>> 
>> x.a = something
>> StoreStore
>> if (card[@x.a] != dirty) {
>>   card[@x.a] = dirty
>> }
>> 
>> preclean:
>> 
>> for card in batched_cards {
>>   if (card[@x.a] == dirty) {
>>     card[@x.a] = precleaned
>>   }
>> }
>> 
>> global_store_fence()
>> 
>> for card in batched_cards {
>>   read x.a
>> }
>> 
>> The global fence will incur some local overhead (quite ouchy) and some global overhead fencing on all remote CPUs the process is scheduled to run on (not necessarily all) using cross calls in the kernel to invalidate remote TLB buffers in the L1 cache (not so ouchy) and by batching the cards, this “global" cost is amortized arbitrarily so that even on systems with a ridiculous amount of CPUs, it’s probably still a good idea. It is also possible to let multiple precleaning CPUs share the same global store fence using timestamps since it is in fact global. This guarantees scalability on many-core systems but is a bit less straightforward to implement.
>> 
>> If you are interested in this and think it’s a good idea, I could try to patch a solution for this, but I would need some help benchmarking this in your systems so we can verify it performs the way I hope.
> 
> I think this is a good idea. The problem is asymmetric in that the CMS thread should be fine with taking a larger local overhead, batching the setting of cards to precleaned and then scanning the cards later.

I’m glad you like the solution. :)

> Do you know how the global_store_fence() would look on different cpu architectures?

The way I envision it is quite portable: One dummy page per thread (lazily initialized) which is first write protected and then unprotected. Unprotecting can be implemented lazily by the kernel but protecting can not, so we are guaranteed this will trigger a global store serialization event and flush remote store buffers. It would go something like this:

*thread->dummy_page = 0; // make sure page is in memory. If it’s offloaded to disk after the write, then that offloading will globally serialize stores too (this is paranoia to avoid potential optimizations that avoid remote store flushing if the physical page isn’t loaded to memory)
write_protect(thread->dummy_page); // serialize writes on remote CPUs
write_unprotect(thread->dummy_page);

I do remember there were concerns raised that this technique might not work for RMO machines, but I see no reason why it would not in this case since we would still emit StoreStore as normal between reference store and dirty value write, and that second store is dependent on the card value read sharing the same address. If anyone thinks this would not work for RMO machines I’m happy to discuss that.

Then it’s possible to make potentially better platform specific versions. Like windows already has a system call that does just that - issue a global store serializing fence using ipi, and nothing else.

> The VM already uses this sort of synchronization for the thread state transitions, see references to UseMemBar, os::serialize_thread_states, os::serialize_memory. Perhaps that code can be reused somehow?

Yeah I had a look at that before, but it’s used in a slightly different way. As far as I could see it deviated from my version in two ways:

1) It used 1 global dummy page instead of one per thread. This means either only one thread serializes stores or some kind of timestamping + locking mechanism is needed for the one store serializing page. I imagined 1 page per thread instead, but I can probably have 1 virtual page per thread sharing the same underlying physical memory if we are worried about the memory footprint of the technique.

2) It enqueued dummy stores to this dummy page when doing such thread transitions instead of fencing, which I believe is too conservative and unnecessary for us here. I suspect the idea behind it came from some kind of discussion of what a portable guarantee for the OS to serialize stores might be and that the conclusion of the discussion must have been something like this:

“stores issued to the dummy page happen-before the write protection, in total order” <- obvious

…and therefore a store to the dummy page was squeezed in, in place of a fence, just because it was uncertain whether remote stores would be flushed if there were no remote stores targeting the memory serializing page, am I right?

Short story:
This is a bit too conservative (we don’t want an extra store that could trap in the reference write barrier do we). I argue the extra store is unnecessary; OS/hardware is not aware of whether there are remote CPU stores pending to the specific page and therefore flush all remote stores, wherever they may be stored to. Instead I would give this guarantee:

“all stores issued in the same process affinity before the write protection happen-before the write protection is observable, in total order"

Longer story:

I could verify this guarantee by reading kernel sources.

Linux:
Since the source is open, I checked the implementation for the architectures we support (arm, aarch64, x86, x86_64, ppc, sparc) in the linux kernel sources and it will always flush /all/ remote store buffers regardless if there may not be pending remote stores to that page or not, as long as there is a change to be made to the permissions of the page (and hence TLB) which we guarantee by having 1 dummy page per thread that we flip permissions on.

BSD:
I also checked the XNU kernel sources (BSD) and it’s the same story here: cross calls using IPI/APIC, where the APIC message itself acts as a fence when received, regardless of the code to be run remotely. 

Windows:
Windows, I do not know what it does since I can’t browse the source code, but it has a system call, FlushProcessWriteBuffers, that flushes remote CPU stores that we can use instead on this platform, which is probably best suited to do the job on windows anyway. However, for x86/x86_64 it’s AFAIK impossible for kernel implementors to avoid that cross call using APIC (which by itself flushes the store buffers invariant of TLB flushing procedure).

General:
I can’t imagine any fancy magic OS/hardware solution that would know if remote store flushing is unnecessary because there are no latent remote CPU stores to the specific page being purged. The closest architecture to do something like this I came across was itanium (which we don’t need to support?) which has a specific instruction ptc.ga to purge remote TLB entries with no apparent need for a cross call. But according to the developer manual, “Global TLB purge instructions (ptc.g and ptc.ga) follow release semantics both on the local and the remote processors”, obviously meaning all stores are still flushed. This is true for WC and UC write buffers too according to the manuals. If fancy hardware would go to such extremes, there are solutions for that too, but no need to cross that bridge unless such hardware and OS appears, right?

Thanks,
/Erik

> /Mikael
> 
>> 
>> Thanks,
>> /Erik
>> 
>> 
>>> On 06 May 2015, at 14:52, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
>>> 
>>> Hi Vitaly,
>>> 
>>> On 2015-05-06 14:41, Vitaly Davidovich wrote:
>>>> Mikael's suggestion was to make mutator check for !clean and then mark
>>>> dirty.  If it sees stale dirty, it will write dirty again no?  Today's code
>>>> would have this problem because it's checking for !dirty, but I thought the
>>>> suggested change would prevent that.
>>> 
>>> Unfortunately I don't think my suggestion would solve anything.
>>> 
>>> If the conditional card mark would write dirty again if it sees a stale dirty it's not really solving the false sharing problem.
>>> 
>>> The problem is not the value that the precleaner writes to the card entry, it's that the mutator may see the old "dirty" value which was overwritten as part of precleaning but not necessarily visible to the mutator thread.
>>> 
>>> /Mikael
>>> 
>>> 
>>>> 
>>>> sent from my phone
>>>> On May 6, 2015 4:53 AM, "Andrew Haley" <aph at redhat.com> wrote:
>>>> 
>>>>> On 05/05/15 20:51, Vitaly Davidovich wrote:
>>>>>> If mutator doesn't see "clean" due to staleness, won't it just mark it
>>>>>> dirty "unnecessarily" using Mikael's suggestion?
>>>>> 
>>>>> No.  The mutator may see a stale "dirty" and not write anything.  At least
>>>>> I haven't seen anything which certainly will prevent that from happening.
>>>>> 
>>>>> Andrew.
>>>>> 
>>>>> 
>>>>> 
>> 
>