RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Mon May 11 16:51:06 UTC 2015

Hi Andrew,

> On 11 May 2015, at 17:21, Andrew Haley <aph at redhat.com> wrote:
> 
> On 05/11/2015 05:06 PM, Vitaly Davidovich wrote:
> 
>>> Also the global operation is not purely, but “mostly" locally expensive
>>> for the thread performing the global fence. The cost on global CPUs is
>>> pretty much simply a normal fence (roughly). Of course there is always
>>> gonna be that one guy with 4000 CPUs which might be a bit awkward.
> 
> Well yes, but that guy with 4000 CPUs is precisely the target for
> UseCondCardMark.

Okay. That should be fine still as I described, but a bit expensive to benchmark it and fine tune I guess. I don’t have access to any such machines. :( If somebody does we could find out.

> 
>>> But even then, with high enough n, shared, timestamped global
>>> fences etc, even such ridiculous scalability should be within
>>> reach.
>> 
>> Is it roughly like a normal fence for remote CPUs?
> 
> I would not think so.  Surely you'd have to interrupt every core in
> the process and do a bunch of flushes.  A TLB flush is expensive, as
> is interrupting the core itself.  I'm fairly sure there's no way to
> flush a remote core's TLB without interrupting it.
> 

Yes but in a round robin fashion using e.g. APIC on x86, not necessarily all globally at the same time. It’s like message passing. And the TLBs will only be purged for the range of the memory protection; this is a single page that those remote CPUs don’t even have in their TLB caches, and therefore no remote TLB caches will be changed.

For e.g. x86_64, the APIC message itself will fence and then it will run the code to find out that no TLB entries needs changing and that’s pretty much it.

This is not a scalability bottleneck at all and the constant costs I already know are not problematic because I use this technique quite a lot myself and Thomas Schatzl was kind enough to thoroughly benchmark such a card cleaning solution for me on G1 around new year on a number of benchmarks and machines. The conclusion for G1 was that it didn’t matter performance wise. Also that constant cost is amortized away arbitrarily by regulating its frequency.

Thanks,
/Erik

> Andrew.