RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Mon May 11 16:06:48 UTC 2015

Erik,

> Also the global operation is not purely, but “mostly" locally expensive
> for the thread performing the global fence. The cost on global CPUs is
> pretty much simply a normal fence (roughly). Of course there is always
> gonna be that one guy with 4000 CPUs which might be a bit awkward. But even
> then, with high enough n, shared, timestamped global fences etc, even such
> ridiculous scalability should be within reach.

Is it roughly like a normal fence for remote CPUs? You mentioned TLB being
invalidated on remote CPUs, which seems a bit more involved than a normal
fence.

I think it's an interesting approach, although I wonder if it's worth the
trouble given that G1 is aiming to replace CMS in the not-too-distant
future?

On Mon, May 11, 2015 at 11:59 AM, Erik Österlund <erik.osterlund at lnu.se>
wrote:

>  Hi Andrew,
>
>  On 11 May 2015, at 14:41, Andrew Haley <aph at redhat.com> wrote:
>
> On 05/11/2015 12:33 PM, Erik Österlund wrote:
>
> Hi Andrew,
>
> On 11 May 2015, at 11:58, Andrew Haley <aph at redhat.com> wrote:
>
> On 05/11/2015 11:40 AM, Erik Österlund wrote:
>
> I have heard statements like this that such mechanism would not work
> on RMO, but never got an explanation why it would work only on
> TSO. Could you please elaborate?  I studied some kernel sources for
> a bunch of architectures and kernels, and it seems as far as I can
> see all good for RMO too.
>
>
> Dave Dice himself told me that the algorithm is not in general safe
> for non-TSO.  Perhaps, though, it is safe in this particular case.  Of
> course, I may be misunderstanding him.  I'm not sure of his reasoning
> but perhaps we should include him in this discussion.
>
>
> I see. It would be interesting to hear his reasoning, because it is
> not clear to me.
>
> From my point of view, I can't see a strong argument for doing this on
> AArch64.  StoreLoad barriers are not fantastically expensive there so
> it may not be worth going to such extremes.  The cost of a StoreLoad
> barrier doesn't seem to be so much more than the StoreStore that we
> have to have anyway.
>
>
> Yeah about performance I’m not sure when it’s worth removing these
> fences and on what hardware.
>
>
> Your algorithm (as I understand it) trades a moderately expensive (but
> purely local) operation for a very expensive global operation, albeit
> with much lower frequency.  It's not clear to me how much we value
> continuous operation versus faster operation with occasional global
> stalls.  I suppose it must be application-dependent.
>
>
>  From my perspective the idea is to move the synchronization overhead
> from a place where it cannot be amortized away (memory access) to a code
> path where it can be pretty much arbitrarily amortized away (batched
> cleaning). We couldn’t fence every n memory accesses, but we certainly can
> global fence every n cards (batched), where we can pick a suitable n where
> the related synchronization overheads seem to vanish.
>
>  Also the global operation is not purely, but “mostly" locally expensive
> for the thread performing the global fence. The cost on global CPUs is
> pretty much simply a normal fence (roughly). Of course there is always
> gonna be that one guy with 4000 CPUs which might be a bit awkward. But even
> then, with high enough n, shared, timestamped global fences etc, even such
> ridiculous scalability should be within reach.
>
>  BTW do we normally have some kind of reasonable scalability window we
> optimize for, and don’t care as much about optimizing for that potential
> one guy? ;)
>
>
>  In this case though, if it makes us any happier, I think we could
> probably get rid of the storestore barrier too:
>
> The latent reference store is forced to serialize anyway after the
> dirty card value write is observable and about to be cleaned. So the
> potential consistency violation that the card looks dirty and then
> cleaning thread reads a stale reference value could not happen with
> my approach even without storestore hardware protection. I didn’t
> give it too much thought but on the top of my mind I can’t see any
> problems. If we want to get rid of storestore too I can give it some
> more thought.
>
>
> That is very interesting.
>
>
>  Indeed! :)
>
>
>  But you know much better than me if these fences are problematic or
> not. :)
>
>
> Not really.  AArch64 is an architecture not an implementation, and is
> designed to be implemented using a wide range of techniques. Instead
> of having very complex cores, some designers seem have decided it
> makes sense to have many of them on a die.  It may well be, though,
> that some implementers will adopt an x86-like highly-superscalar
> architecture with a great deal of speculative execution.  I can only
> predict the past...  My approach with this project has been to do
> things in the most straightforward way rather than trying to optimize
> for whatever implementations I happen to have available.
>
>
>  I see your point of view: you don’t want to be that dependent on the
> hardware and elected to go with a straightforward synchronization solution
> for this reason. This makes sense. But I think since we are dealing with an
> optimization feature here (UseCondCardMark), I believe a less straight
> forward solution makes us less dependent on such hardware details. Because
> it is an optimization, the highest possible performance is probably
> expected and even important, which suddenly becomes very tightly dependent
> on the cost of fencing which probably varies a lot from different hardware
> vendors.
>
>  Conversely, the possibly less straightforward synchronization solution
> dodges this bullet by simply not fencing and arbitrarily amortizing away
> the related synchronization costs until they vanish. :)
>
>  Thanks,
> /Erik
>
>  Andrew.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20150511/52da62dc/attachment.htm>