RFR: 8079315: UseCondCardMark broken in conjunction with CMS precleaning

Tue May 12 17:58:43 UTC 2015

Erik,

I tend to agree with you that this seems like a good solution to the
current problem at hand, irrespective of when/if G1 fully supplants CMS.
Given that similar mechanism is used for safepointing, I don't think this
introduces some completely new construct that nobody has yet seen in
Hotspot.  However, this is obviously not my decision to make :).

Given that you have DaCapo benchmarks set up, have you tried benching
Andrew's storeload proposal? Would be interesting to see if anything's
revealed there.

On Tue, May 12, 2015 at 1:23 PM, Erik Österlund <erik.osterlund at lnu.se>
wrote:

>  Hi Mikael and Andrew,
>
>  Unless I missed something, I don’t think we introduce that much code
> complexity.
> Of course I agree that G1 will make fixes in CMS a bit wasted in the long
> run.
> However, until then it would be good if CMS still works. And a few lines
> shared code (handful for the actual GC) seems, to me, both less painful
> from an engineering point of view and better performant than going through
> all mutator code paths that need changing (interpreter, c1, c2, for
> potentially many architectures).
>
>  Out of curiosity I patched the thing and my fix can be found here:
> http://cr.openjdk.java.net/~eosterlund/8079315/webrev.v1/
>
>  Fortunately it looks like CMS is already batching cards pretty well for
> me so the change turned out to be very small. I logged to see how often
> this global fence is triggered and it’s very rare so I feel quite convinced
> it won’t impact performance negatively even on “that guy’s” machine and
> with a terrible OS implementation.
>
>  I benchmarked it using DaCapo benchmarks locally on my computer (macbook
> x86_64 BSD) and there were no traces of any performance
> artefacts/regression.
>
>  If anyone happens to have a larger machine than my macbook, it would be
> interesting to take it for a spin. ;)
>
>  Disclaimer: I haven’t poked around a lot in CMS in the past, so I hope I
> didn’t miss any important card value transitions!
>
>  Thanks,
> /Erik
>
>  On 12 May 2015, at 14:17, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
>
>
>
> On 2015-05-12 15:05, Aleksey Shipilev wrote:
>
> On 11.05.2015 16:41, Andrew Haley wrote:
>
> On 05/11/2015 12:33 PM, Erik Österlund wrote:
>
> Hi Andrew,
>
> On 11 May 2015, at 11:58, Andrew Haley <aph at redhat.com> wrote:
>
> On 05/11/2015 11:40 AM, Erik Österlund wrote:
>
> I have heard statements like this that such mechanism would not work
> on RMO, but never got an explanation why it would work only on
> TSO. Could you please elaborate?  I studied some kernel sources for
> a bunch of architectures and kernels, and it seems as far as I can
> see all good for RMO too.
>
>
> Dave Dice himself told me that the algorithm is not in general safe
> for non-TSO.  Perhaps, though, it is safe in this particular case.  Of
> course, I may be misunderstanding him.  I'm not sure of his reasoning
> but perhaps we should include him in this discussion.
>
>
> I see. It would be interesting to hear his reasoning, because it is
> not clear to me.
>
> From my point of view, I can't see a strong argument for doing this on
> AArch64.  StoreLoad barriers are not fantastically expensive there so
> it may not be worth going to such extremes.  The cost of a StoreLoad
> barrier doesn't seem to be so much more than the StoreStore that we
> have to have anyway.
>
>
> Yeah about performance I’m not sure when it’s worth removing these
> fences and on what hardware.
>
>
> Your algorithm (as I understand it) trades a moderately expensive (but
> purely local) operation for a very expensive global operation, albeit
> with much lower frequency.  It's not clear to me how much we value
> continuous operation versus faster operation with occasional global
> stalls.  I suppose it must be application-dependent.
>
>
> Okay, Dice's asymmetric trick is nice. In fact, that is arguably what
> Parallel is using already: it serializes the mutator stores by stopping
> the mutator at safepoint. Using mprotect and TLB tricks as the
> serialization actions is cute and dandy.
>
> However, I have doubts that employing the system-wide synchronization
> mechanism for concurrent collector is a good thing, when we can't
> predict and control the long-term performance of it. For example, we are
> basically coming at the mercy of underlying OS performance with mprotect
> calls. There are industrial GCs that rely on OS performance (*cough*
> *cough*), you can see what do those require to guarantee performance.
>
>
> Just to be clear, this type of synchronization is in fact already
> implemented in the JVM to synchronize thread states for the safepoint
> protocol, so it's not exactly new and unexplained territory.
>
> However it's not clear to me that the code complexity involved with using
> that type of synchronization for conditional card marking in CMS is worth
> it.
>
>
> Also, given the problem is specific to CMS that arguably goes away in
> favor of G1, I would think introducing special-case-for-CMS barriers in
> mutator code is a sane interim solution.
>
>
> I agree.
>
>
> Especially if we can backport the G1-like barrier "filtering" in CMS
> case? If I read this thread right, Erik and Thomas concluded there is no
> clear benefit of introducing the mprotect-like mechanics with G1, which
> probably means the overheads are bearable with appropriate mutator-side
> changes.
>
>
> I don't think it would be easy to implement barrier "filtering" in CMS.
> Keep in mind that even before the storeload was added to G1's barriers
> they were fairly heavy-weight. CMS' barriers are not, if we start to add
> conditionals and storeload barriers to them the runtime overhead may
> increase more than what it did when we added the storeload to G1.
>
> /Mikael
>
>
> Thanks,
> -Aleksey
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20150512/4b66f4da/attachment.htm>