G1 question: concurrent cleaning of dirty cards

newer
hg: hsx/hotspot-gc/hotspot: 51 new...

Doerr, Martin

17 May 2013 17 May '13

1:29 a.m.

Hi all, we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won't enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3') by the parallel refinement after having cleaned the card (1'): Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful) (1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty) (1') store(card==clean) (2') StoreLoad barrier (3') load(oop) So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.) Kind regards, Martin

Attachments:

attachment.html (text/html — 6.1 KB)

Show replies by date

John Cuthbertson

22 May 22 May

5:29 p.m.

Hi Martin, An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have: x.f = a; x.f = b; If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b. If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again. It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'. JohnC On 5/17/2013 1:29 AM, Doerr, Martin wrote:

...

Hi all,

we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning.

The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue.

However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won't enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3') by the parallel refinement after having cleaned the card (1'):

Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)

(1) store(oop)

( StoreLoad required here ?)

(2) load(card==dirty)

(1') store(card==clean)

(2') StoreLoad barrier

(3') load(oop)

So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed.

We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.)

Kind regards,

Martin

Doerr, Martin

23 May 23 May

6:12 a.m.

Hi John, thank you very much for your comments. Your last line explains exactly what we are concerned about. Does anybody plan to prevent this situation? I don't want to propose adding StoreLoad barriers in all G1 post barriers because I'd expect undesired performance impact. Would it be feasible to rescan all cards which have been dirtied (at least once) during the next stop-the-world phase? Maybe anybody has a better idea. Kind regards, Martin From: John Cuthbertson [mailto:john.cuthbertson@oracle.com] Sent: Donnerstag, 23. Mai 2013 02:29 To: Doerr, Martin Cc: hotspot-gc-dev@openjdk.java.net; Mikael Gerdin; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Hi Martin, An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have: x.f = a; x.f = b; If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b. If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again. It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'. JohnC On 5/17/2013 1:29 AM, Doerr, Martin wrote: Hi all, we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won't enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3') by the parallel refinement after having cleaned the card (1'): Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful) (1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty) (1') store(card==clean) (2') StoreLoad barrier (3') load(oop) So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.) Kind regards, Martin

Igor Veresov

26 Jun 26 Jun

9:27 p.m.

The cards that are stored in the buffers are not available for concurrent processing right when they are enqueued. Instead they are passed to the processing threads when the buffer fills up. This passing of the buffer involves signaling of a condition (like pthread_cond_signal(), literally) that has a write barrier for sure, which would guarantee that the cards in the buffer, and contents of the card table, and the contents of the object are "in sync". The only place in the generated code where there has to be a store-store barrier (for non-TSO architectures) is between the actual field store and the dirtying of the card. Does this make sense? igor On May 23, 2013, at 6:12 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

...

Hi John,

thank you very much for your comments. Your last line explains exactly what we are concerned about. Does anybody plan to prevent this situation? I don’t want to propose adding StoreLoad barriers in all G1 post barriers because I’d expect undesired performance impact. Would it be feasible to rescan all cards which have been dirtied (at least once) during the next stop-the-world phase? Maybe anybody has a better idea.

Kind regards, Martin

From: John Cuthbertson [mailto:john.cuthbertson@oracle.com] Sent: Donnerstag, 23. Mai 2013 02:29 To: Doerr, Martin Cc: hotspot-gc-dev@openjdk.java.net; Mikael Gerdin; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Hi Martin,

An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have:

x.f = a; x.f = b;

If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b.

If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again.

It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'.

JohnC

On 5/17/2013 1:29 AM, Doerr, Martin wrote: Hi all,

we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won’t enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3’) by the parallel refinement after having cleaned the card (1’):

Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)

(1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty)

(1’) store(card==clean) (2’) StoreLoad barrier (3’) load(oop)

So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.)

Kind regards, Martin

Igor Veresov

11:14 p.m.

Oh, re-read your letter, yup, there seems to be a problem. Have you observed that in practice? igor On Jun 26, 2013, at 9:27 PM, Igor Veresov <iggy.veresov@gmail.com> wrote:

...

The cards that are stored in the buffers are not available for concurrent processing right when they are enqueued. Instead they are passed to the processing threads when the buffer fills up. This passing of the buffer involves signaling of a condition (like pthread_cond_signal(), literally) that has a write barrier for sure, which would guarantee that the cards in the buffer, and contents of the card table, and the contents of the object are "in sync".

The only place in the generated code where there has to be a store-store barrier (for non-TSO architectures) is between the actual field store and the dirtying of the card.

Does this make sense?

igor

On May 23, 2013, at 6:12 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

...
Hi John,

thank you very much for your comments. Your last line explains exactly what we are concerned about. Does anybody plan to prevent this situation? I don’t want to propose adding StoreLoad barriers in all G1 post barriers because I’d expect undesired performance impact. Would it be feasible to rescan all cards which have been dirtied (at least once) during the next stop-the-world phase? Maybe anybody has a better idea.

Kind regards, Martin

From: John Cuthbertson [mailto:john.cuthbertson@oracle.com] Sent: Donnerstag, 23. Mai 2013 02:29 To: Doerr, Martin Cc: hotspot-gc-dev@openjdk.java.net; Mikael Gerdin; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Hi Martin,

An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have:

x.f = a; x.f = b;

If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b.

If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again.

It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'.

JohnC

On 5/17/2013 1:29 AM, Doerr, Martin wrote: Hi all,

we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won’t enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3’) by the parallel refinement after having cleaned the card (1’):

Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)

(1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty)

(1’) store(card==clean) (2’) StoreLoad barrier (3’) load(oop)

So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.)

Kind regards, Martin

Doerr, Martin

27 Jun 27 Jun

1:27 a.m.

Hi Igor, we have seen crashes while testing our hotspot 23 based SAPJVM with G1. However, there's no evidence that these crashes are caused by this problem. We basically found it by reading code. Best regards, Martin From: Igor Veresov [mailto:iggy.veresov@gmail.com] Sent: Donnerstag, 27. Juni 2013 08:15 To: Doerr, Martin Cc: John Cuthbertson; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Oh, re-read your letter, yup, there seems to be a problem. Have you observed that in practice? igor On Jun 26, 2013, at 9:27 PM, Igor Veresov <iggy.veresov@gmail.com<mailto:iggy.veresov@gmail.com>> wrote: The cards that are stored in the buffers are not available for concurrent processing right when they are enqueued. Instead they are passed to the processing threads when the buffer fills up. This passing of the buffer involves signaling of a condition (like pthread_cond_signal(), literally) that has a write barrier for sure, which would guarantee that the cards in the buffer, and contents of the card table, and the contents of the object are "in sync". The only place in the generated code where there has to be a store-store barrier (for non-TSO architectures) is between the actual field store and the dirtying of the card. Does this make sense? igor On May 23, 2013, at 6:12 AM, "Doerr, Martin" <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> wrote: Hi John, thank you very much for your comments. Your last line explains exactly what we are concerned about. Does anybody plan to prevent this situation? I don't want to propose adding StoreLoad barriers in all G1 post barriers because I'd expect undesired performance impact. Would it be feasible to rescan all cards which have been dirtied (at least once) during the next stop-the-world phase? Maybe anybody has a better idea. Kind regards, Martin From: John Cuthbertson [mailto:john.cuthbertson@oracle.com<http://oracle.com/>] Sent: Donnerstag, 23. Mai 2013 02:29 To: Doerr, Martin Cc: hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; Mikael Gerdin; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Hi Martin, An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have: x.f = a; x.f = b; If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b. If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again. It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'. JohnC On 5/17/2013 1:29 AM, Doerr, Martin wrote: Hi all, we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won't enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3') by the parallel refinement after having cleaned the card (1'): Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful) (1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty) (1') store(card==clean) (2') StoreLoad barrier (3') load(oop) So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.) Kind regards, Martin

Igor Veresov

7:12 a.m.

Yea, unless I'm forgetting something, this seems very fundamental. The probability of this happening is probably greatly reduced by the card cache, nevertheless it seems possible. The only solution that comes to mind is do periodic safepoint to grab the already filled buffers and clean the corresponding entries in the card table. The processing of the grabbed buffers may of course be done concurrently. But there doesn't seem to be an easy way to ensure the ordering between the original store to the object and the cleaning store to the card table. The barrier that flushes the store to the cardtable doesn't in any way enforce the original store to the object from the other thread to happen before that. So the failure case would be this: Mutator thread: - store to object - load from cardtable - compare the cardtable byte (it is dirty) - bail from barrier Refinement thread: - clear the card If clearing of the card occurs after the mutator loads the byte from the cardtable, the mutator won't enqueue the card, which is sort of the intended behavior. But there is no guarantee that the refinement thread would see the result of "store to object", in which case the information will be lost. igor On Jun 27, 2013, at 1:27 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

...

Hi Igor,

we have seen crashes while testing our hotspot 23 based SAPJVM with G1. However, there’s no evidence that these crashes are caused by this problem. We basically found it by reading code.

Best regards, Martin

From: Igor Veresov [mailto:iggy.veresov@gmail.com] Sent: Donnerstag, 27. Juni 2013 08:15 To: Doerr, Martin Cc: John Cuthbertson; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Oh, re-read your letter, yup, there seems to be a problem. Have you observed that in practice?

igor

On Jun 26, 2013, at 9:27 PM, Igor Veresov <iggy.veresov@gmail.com> wrote:

The cards that are stored in the buffers are not available for concurrent processing right when they are enqueued. Instead they are passed to the processing threads when the buffer fills up. This passing of the buffer involves signaling of a condition (like pthread_cond_signal(), literally) that has a write barrier for sure, which would guarantee that the cards in the buffer, and contents of the card table, and the contents of the object are "in sync".

The only place in the generated code where there has to be a store-store barrier (for non-TSO architectures) is between the actual field store and the dirtying of the card.

Does this make sense?

igor

On May 23, 2013, at 6:12 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

Hi John,

thank you very much for your comments. Your last line explains exactly what we are concerned about. Does anybody plan to prevent this situation? I don’t want to propose adding StoreLoad barriers in all G1 post barriers because I’d expect undesired performance impact. Would it be feasible to rescan all cards which have been dirtied (at least once) during the next stop-the-world phase? Maybe anybody has a better idea.

Kind regards, Martin

From: John Cuthbertson [mailto:john.cuthbertson@oracle.com] Sent: Donnerstag, 23. Mai 2013 02:29 To: Doerr, Martin Cc: hotspot-gc-dev@openjdk.java.net; Mikael Gerdin; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Hi Martin,

An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have:

x.f = a; x.f = b;

If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b.

If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again.

It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'.

JohnC

On 5/17/2013 1:29 AM, Doerr, Martin wrote: Hi all,

we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won’t enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3’) by the parallel refinement after having cleaned the card (1’):

Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)

(1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty)

(1’) store(card==clean) (2’) StoreLoad barrier (3’) load(oop)

So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.)

Kind regards, Martin

Doerr, Martin

28 Jun 28 Jun

7:08 a.m.

Hi Igor, we didn't find an easy and feasible way to ensure the ordering, either. Grabbing the buffers and cleaning the cards at safepoints might be the best solution. Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. I guess the purpose of the barrier was exactly to avoid this problem (which should be working perfectly if the post barriers had StoreLoad barriers, too). Best regards, Martin From: Igor Veresov [mailto:iggy.veresov@gmail.com] Sent: Donnerstag, 27. Juni 2013 16:13 To: Doerr, Martin Cc: John Cuthbertson; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Yea, unless I'm forgetting something, this seems very fundamental. The probability of this happening is probably greatly reduced by the card cache, nevertheless it seems possible. The only solution that comes to mind is do periodic safepoint to grab the already filled buffers and clean the corresponding entries in the card table. The processing of the grabbed buffers may of course be done concurrently. But there doesn't seem to be an easy way to ensure the ordering between the original store to the object and the cleaning store to the card table. The barrier that flushes the store to the cardtable doesn't in any way enforce the original store to the object from the other thread to happen before that. So the failure case would be this: Mutator thread: - store to object - load from cardtable - compare the cardtable byte (it is dirty) - bail from barrier Refinement thread: - clear the card If clearing of the card occurs after the mutator loads the byte from the cardtable, the mutator won't enqueue the card, which is sort of the intended behavior. But there is no guarantee that the refinement thread would see the result of "store to object", in which case the information will be lost. igor On Jun 27, 2013, at 1:27 AM, "Doerr, Martin" <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> wrote: Hi Igor, we have seen crashes while testing our hotspot 23 based SAPJVM with G1. However, there's no evidence that these crashes are caused by this problem. We basically found it by reading code. Best regards, Martin From: Igor Veresov [mailto:iggy.veresov@gmail.com<http://gmail.com>] Sent: Donnerstag, 27. Juni 2013 08:15 To: Doerr, Martin Cc: John Cuthbertson; hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Oh, re-read your letter, yup, there seems to be a problem. Have you observed that in practice? igor On Jun 26, 2013, at 9:27 PM, Igor Veresov <iggy.veresov@gmail.com<mailto:iggy.veresov@gmail.com>> wrote: The cards that are stored in the buffers are not available for concurrent processing right when they are enqueued. Instead they are passed to the processing threads when the buffer fills up. This passing of the buffer involves signaling of a condition (like pthread_cond_signal(), literally) that has a write barrier for sure, which would guarantee that the cards in the buffer, and contents of the card table, and the contents of the object are "in sync". The only place in the generated code where there has to be a store-store barrier (for non-TSO architectures) is between the actual field store and the dirtying of the card. Does this make sense? igor On May 23, 2013, at 6:12 AM, "Doerr, Martin" <martin.doerr@sap.com<mailto:martin.doerr@sap.com>> wrote: Hi John, thank you very much for your comments. Your last line explains exactly what we are concerned about. Does anybody plan to prevent this situation? I don't want to propose adding StoreLoad barriers in all G1 post barriers because I'd expect undesired performance impact. Would it be feasible to rescan all cards which have been dirtied (at least once) during the next stop-the-world phase? Maybe anybody has a better idea. Kind regards, Martin From: John Cuthbertson [mailto:john.cuthbertson@oracle.com<http://oracle.com/>] Sent: Donnerstag, 23. Mai 2013 02:29 To: Doerr, Martin Cc: hotspot-gc-dev@openjdk.java.net<mailto:hotspot-gc-dev@openjdk.java.net>; Mikael Gerdin; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Hi Martin, An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have: x.f = a; x.f = b; If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b. If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again. It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'. JohnC On 5/17/2013 1:29 AM, Doerr, Martin wrote: Hi all, we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won't enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3') by the parallel refinement after having cleaned the card (1'): Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful) (1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty) (1') store(card==clean) (2') StoreLoad barrier (3') load(oop) So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.) Kind regards, Martin

Igor Veresov

9:47 a.m.

On Jun 28, 2013, at 7:08 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

...

Hi Igor,

we didn’t find an easy and feasible way to ensure the ordering, either. Grabbing the buffers and cleaning the cards at safepoints might be the best solution.

Would anybody from the G1 team like to think about that?

...

Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. I guess the purpose of the barrier was exactly to avoid this problem (which should be working perfectly if the post barriers had StoreLoad barriers, too).

Yeah, but like you noted that would have a horrific effect on performance. So, it's probably best to bunch the work up to at least eliminate the need of extra work when, say, you're looping and storing to a limited working set (G1 uses the cardtable basically for that purpose). The safepoint approach will likely require more memory for buffers and the load will be spiky, and if the collection were to happen right after we grabbed the buffers the collector will have to process all of them which is not going to work well for predictability. But nothing better comes to mind at this point. Btw, there are already periodic safepoints to do bias locking revocations, so may be it would make sense to piggyback on that. igor

...

Best regards, Martin

From: Igor Veresov [mailto:iggy.veresov@gmail.com] Sent: Donnerstag, 27. Juni 2013 16:13 To: Doerr, Martin Cc: John Cuthbertson; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Yea, unless I'm forgetting something, this seems very fundamental. The probability of this happening is probably greatly reduced by the card cache, nevertheless it seems possible. The only solution that comes to mind is do periodic safepoint to grab the already filled buffers and clean the corresponding entries in the card table. The processing of the grabbed buffers may of course be done concurrently.

But there doesn't seem to be an easy way to ensure the ordering between the original store to the object and the cleaning store to the card table. The barrier that flushes the store to the cardtable doesn't in any way enforce the original store to the object from the other thread to happen before that. So the failure case would be this:

Mutator thread: - store to object - load from cardtable - compare the cardtable byte (it is dirty) - bail from barrier

Refinement thread: - clear the card

If clearing of the card occurs after the mutator loads the byte from the cardtable, the mutator won't enqueue the card, which is sort of the intended behavior. But there is no guarantee that the refinement thread would see the result of "store to object", in which case the information will be lost.

igor

On Jun 27, 2013, at 1:27 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

Hi Igor,

we have seen crashes while testing our hotspot 23 based SAPJVM with G1. However, there’s no evidence that these crashes are caused by this problem. We basically found it by reading code.

Best regards, Martin

From: Igor Veresov [mailto:iggy.veresov@gmail.com] Sent: Donnerstag, 27. Juni 2013 08:15 To: Doerr, Martin Cc: John Cuthbertson; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Oh, re-read your letter, yup, there seems to be a problem. Have you observed that in practice?

igor

On Jun 26, 2013, at 9:27 PM, Igor Veresov <iggy.veresov@gmail.com> wrote:

The cards that are stored in the buffers are not available for concurrent processing right when they are enqueued. Instead they are passed to the processing threads when the buffer fills up. This passing of the buffer involves signaling of a condition (like pthread_cond_signal(), literally) that has a write barrier for sure, which would guarantee that the cards in the buffer, and contents of the card table, and the contents of the object are "in sync".

The only place in the generated code where there has to be a store-store barrier (for non-TSO architectures) is between the actual field store and the dirtying of the card.

Does this make sense?

igor

On May 23, 2013, at 6:12 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

Hi John,

thank you very much for your comments. Your last line explains exactly what we are concerned about. Does anybody plan to prevent this situation? I don’t want to propose adding StoreLoad barriers in all G1 post barriers because I’d expect undesired performance impact. Would it be feasible to rescan all cards which have been dirtied (at least once) during the next stop-the-world phase? Maybe anybody has a better idea.

Kind regards, Martin

From: John Cuthbertson [mailto:john.cuthbertson@oracle.com] Sent: Donnerstag, 23. Mai 2013 02:29 To: Doerr, Martin Cc: hotspot-gc-dev@openjdk.java.net; Mikael Gerdin; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Hi Martin,

An enqueued card let's the refinment threads know that the oops spanned by that card need to be walked but we're only interested in the latest contents of the fields in those oops. IOW the oop in (3') doesn't need to be the oop stored in (1). If there's a subsequent store (3) to the same location then we want the load at (3') to see the lastest contents. For example suppose we have:

x.f = a; x.f = b;

If the application thread sees the card spanning x.f is dirty at the second store then we won't enqueue the card after the second store. As long as the refinement thread sees 'b' when the card is 'refined' then we're OK since we no longer need to add an entry into the RSet for the region containing a - we do need an entry in the RSet for the region containing b.

If the application thread sees the card as clean at the second store before the refinement thread loads x.f we have just needlessly enqueued the card again.

It is only if the application thread sees the card as dirty but the refinement thread reads 'a' then there could be a problem. We have a missing RSet entry for 'b'.

JohnC

On 5/17/2013 1:29 AM, Doerr, Martin wrote: Hi all,

we have a question about the interaction between G1 post barriers and the refinement thread's concurrent dirty card cleaning. The case in which the G1 post barrier sees a clean card is obviously not problematic, because it will add an entry in a dirty card queue. However, in case in which the Java thread (mutator thread) sees the card already dirtied, it won’t enqueue the card again. Which is safe as long as its stored oop (1) is seen and processed (3’) by the parallel refinement after having cleaned the card (1’):

Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)

(1) store(oop) ( StoreLoad required here ?) (2) load(card==dirty)

(1’) store(card==clean) (2’) StoreLoad barrier (3’) load(oop)

So the refinement thread seems to rely on getting the oop which was written BEFORE the (2) load(card==dirty) was observed. We wonder how this ordering is guaranteed? There are no StoreLoad barriers in the Java Thread's path. (StoreLoad ordering needs explicit barriers even on TSO platforms.)

Kind regards, Martin

John Cuthbertson

12:26 p.m.

Hi Igor, On 6/28/2013 9:47 AM, Igor Veresov wrote:

...

On Jun 28, 2013, at 7:08 AM, "Doerr, Martin" <martin.doerr@sap.com <mailto:martin.doerr@sap.com>> wrote:

...
Hi Igor, we didn’t find an easy and feasible way to ensure the ordering, either. Grabbing the buffers and cleaning the cards at safepoints might be the best solution.

Would anybody from the G1 team like to think about that?

I've been thinking about this issue on an off for the last few weeks when I get the time. I mentioned it to Vladimir a couple of times to get his input.

...

...
Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. I guess the purpose of the barrier was exactly to avoid this problem (which should be working perfectly if the post barriers had StoreLoad barriers, too).

Yeah, but like you noted that would have a horrific effect on performance. So, it's probably best to bunch the work up to at least eliminate the need of extra work when, say, you're looping and storing to a limited working set (G1 uses the cardtable basically for that purpose). The safepoint approach will likely require more memory for buffers and the load will be spiky, and if the collection were to happen right after we grabbed the buffers the collector will have to process all of them which is not going to work well for predictability. But nothing better comes to mind at this point. Btw, there are already periodic safepoints to do bias locking revocations, so may be it would make sense to piggyback on that.

Piggy backing on all the other safepoint operations might work if they happen frequently enough but I don't know if that 's the case. And as you, even then there will be times where we haven't had a safepoint for a while and will have a ton of buffers to process at the start of the pause. It might be worth adding a suitable memory barrier to the G1 post write barrier and evaluating the throughput hit. JohnC

Igor Veresov

12:47 p.m.

The impact on the next collection however can be bounded. Say, if you make it have a safepoint to reap the buffers when the number of buffer reaches $n$, that alone would put a cap on the potential pause incurred during the collection. The card cache currently has the same effect, sort of, right? igor On Jun 28, 2013, at 12:26 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...

Hi Igor,

On 6/28/2013 9:47 AM, Igor Veresov wrote:

...
On Jun 28, 2013, at 7:08 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

...
Hi Igor,

we didn’t find an easy and feasible way to ensure the ordering, either. Grabbing the buffers and cleaning the cards at safepoints might be the best solution.

Would anybody from the G1 team like to think about that?

I've been thinking about this issue on an off for the last few weeks when I get the time. I mentioned it to Vladimir a couple of times to get his input.

...
...
Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. I guess the purpose of the barrier was exactly to avoid this problem (which should be working perfectly if the post barriers had StoreLoad barriers, too).

Yeah, but like you noted that would have a horrific effect on performance. So, it's probably best to bunch the work up to at least eliminate the need of extra work when, say, you're looping and storing to a limited working set (G1 uses the cardtable basically for that purpose). The safepoint approach will likely require more memory for buffers and the load will be spiky, and if the collection were to happen right after we grabbed the buffers the collector will have to process all of them which is not going to work well for predictability. But nothing better comes to mind at this point. Btw, there are already periodic safepoints to do bias locking revocations, so may be it would make sense to piggyback on that.

Piggy backing on all the other safepoint operations might work if they happen frequently enough but I don't know if that 's the case. And as you, even then there will be times where we haven't had a safepoint for a while and will have a ton of buffers to process at the start of the pause.

It might be worth adding a suitable memory barrier to the G1 post write barrier and evaluating the throughput hit.

JohnC

John Cuthbertson

1:53 p.m.

Hi Igor, Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner. So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit. That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads. My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine. We would need to evaluate the performance of each approach. The card cache delays the processing of cards that have been dirtied multiple times - so it does act kind of like a buffer reducing the potential for this issue. JohnC On 6/28/2013 12:47 PM, Igor Veresov wrote:

...

The impact on the next collection however can be bounded. Say, if you make it have a safepoint to reap the buffers when the number of buffer reaches $n$, that alone would put a cap on the potential pause incurred during the collection. The card cache currently has the same effect, sort of, right?

igor

On Jun 28, 2013, at 12:26 PM, John Cuthbertson <john.cuthbertson@oracle.com <mailto:john.cuthbertson@oracle.com>> wrote:

...
Hi Igor,

On 6/28/2013 9:47 AM, Igor Veresov wrote:

...
On Jun 28, 2013, at 7:08 AM, "Doerr, Martin" <martin.doerr@sap.com <mailto:martin.doerr@sap.com>> wrote:

...
Hi Igor, we didn’t find an easy and feasible way to ensure the ordering, either. Grabbing the buffers and cleaning the cards at safepoints might be the best solution.

Would anybody from the G1 team like to think about that?

I've been thinking about this issue on an off for the last few weeks when I get the time. I mentioned it to Vladimir a couple of times to get his input.

...
...
Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. I guess the purpose of the barrier was exactly to avoid this problem (which should be working perfectly if the post barriers had StoreLoad barriers, too).

Yeah, but like you noted that would have a horrific effect on performance. So, it's probably best to bunch the work up to at least eliminate the need of extra work when, say, you're looping and storing to a limited working set (G1 uses the cardtable basically for that purpose). The safepoint approach will likely require more memory for buffers and the load will be spiky, and if the collection were to happen right after we grabbed the buffers the collector will have to process all of them which is not going to work well for predictability. But nothing better comes to mind at this point. Btw, there are already periodic safepoints to do bias locking revocations, so may be it would make sense to piggyback on that.

Piggy backing on all the other safepoint operations might work if they happen frequently enough but I don't know if that 's the case. And as you, even then there will be times where we haven't had a safepoint for a while and will have a ton of buffers to process at the start of the pause.

It might be worth adding a suitable memory barrier to the G1 post write barrier and evaluating the throughput hit.

JohnC

Igor Veresov

4:02 p.m.

The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse.. igor On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...

Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

The card cache delays the processing of cards that have been dirtied multiple times - so it does act kind of like a buffer reducing the potential for this issue.

JohnC

On 6/28/2013 12:47 PM, Igor Veresov wrote:

...
The impact on the next collection however can be bounded. Say, if you make it have a safepoint to reap the buffers when the number of buffer reaches $n$, that alone would put a cap on the potential pause incurred during the collection. The card cache currently has the same effect, sort of, right?

igor

On Jun 28, 2013, at 12:26 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

On 6/28/2013 9:47 AM, Igor Veresov wrote:

...
On Jun 28, 2013, at 7:08 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

...
Hi Igor,

we didn’t find an easy and feasible way to ensure the ordering, either. Grabbing the buffers and cleaning the cards at safepoints might be the best solution.

Would anybody from the G1 team like to think about that?

I've been thinking about this issue on an off for the last few weeks when I get the time. I mentioned it to Vladimir a couple of times to get his input.

...
...
Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. I guess the purpose of the barrier was exactly to avoid this problem (which should be working perfectly if the post barriers had StoreLoad barriers, too).

Yeah, but like you noted that would have a horrific effect on performance. So, it's probably best to bunch the work up to at least eliminate the need of extra work when, say, you're looping and storing to a limited working set (G1 uses the cardtable basically for that purpose). The safepoint approach will likely require more memory for buffers and the load will be spiky, and if the collection were to happen right after we grabbed the buffers the collector will have to process all of them which is not going to work well for predictability. But nothing better comes to mind at this point. Btw, there are already periodic safepoints to do bias locking revocations, so may be it would make sense to piggyback on that.

Piggy backing on all the other safepoint operations might work if they happen frequently enough but I don't know if that 's the case. And as you, even then there will be times where we haven't had a safepoint for a while and will have a ton of buffers to process at the start of the pause.

It might be worth adding a suitable memory barrier to the G1 post write barrier and evaluating the throughput hit.

JohnC

John Cuthbertson

4:06 p.m.

Hi Igor. You misunderstood me. I meant that if we use safepoints to refine cards all of the code that currently supports refinement by mutators can be removed. That' all. JohnC On 6/28/2013 4:02 PM, Igor Veresov wrote:

...

The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

igor

On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com <mailto:john.cuthbertson@oracle.com>> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

The card cache delays the processing of cards that have been dirtied multiple times - so it does act kind of like a buffer reducing the potential for this issue.

JohnC

On 6/28/2013 12:47 PM, Igor Veresov wrote:

...
The impact on the next collection however can be bounded. Say, if you make it have a safepoint to reap the buffers when the number of buffer reaches $n$, that alone would put a cap on the potential pause incurred during the collection. The card cache currently has the same effect, sort of, right?

igor

On Jun 28, 2013, at 12:26 PM, John Cuthbertson <john.cuthbertson@oracle.com <mailto:john.cuthbertson@oracle.com>> wrote:

...
Hi Igor,

On 6/28/2013 9:47 AM, Igor Veresov wrote:

...
On Jun 28, 2013, at 7:08 AM, "Doerr, Martin" <martin.doerr@sap.com <mailto:martin.doerr@sap.com>> wrote:

...
Hi Igor, we didn’t find an easy and feasible way to ensure the ordering, either. Grabbing the buffers and cleaning the cards at safepoints might be the best solution.

Would anybody from the G1 team like to think about that?

I've been thinking about this issue on an off for the last few weeks when I get the time. I mentioned it to Vladimir a couple of times to get his input.

...
...
Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. I guess the purpose of the barrier was exactly to avoid this problem (which should be working perfectly if the post barriers had StoreLoad barriers, too).

Yeah, but like you noted that would have a horrific effect on performance. So, it's probably best to bunch the work up to at least eliminate the need of extra work when, say, you're looping and storing to a limited working set (G1 uses the cardtable basically for that purpose). The safepoint approach will likely require more memory for buffers and the load will be spiky, and if the collection were to happen right after we grabbed the buffers the collector will have to process all of them which is not going to work well for predictability. But nothing better comes to mind at this point. Btw, there are already periodic safepoints to do bias locking revocations, so may be it would make sense to piggyback on that.

Piggy backing on all the other safepoint operations might work if they happen frequently enough but I don't know if that 's the case. And as you, even then there will be times where we haven't had a safepoint for a while and will have a ton of buffers to process at the start of the pause.

It might be worth adding a suitable memory barrier to the G1 post write barrier and evaluating the throughput hit.

JohnC

Igor Veresov

6 p.m.

Oh, yes, that wouldn't work as it is with the safepoint scheme. On the other hand the mutators still may do the processing part only by picking a buffer from the already snapshotted queue, not the current one. The original plan was to use mutator processing as a throttling mechanism. May be it's still useful? I don't know.. igor On Jun 28, 2013, at 4:06 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...

Hi Igor.

You misunderstood me. I meant that if we use safepoints to refine cards all of the code that currently supports refinement by mutators can be removed. That' all.

JohnC

On 6/28/2013 4:02 PM, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

igor

On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

The card cache delays the processing of cards that have been dirtied multiple times - so it does act kind of like a buffer reducing the potential for this issue.

JohnC

On 6/28/2013 12:47 PM, Igor Veresov wrote:

...
The impact on the next collection however can be bounded. Say, if you make it have a safepoint to reap the buffers when the number of buffer reaches $n$, that alone would put a cap on the potential pause incurred during the collection. The card cache currently has the same effect, sort of, right?

igor

On Jun 28, 2013, at 12:26 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

On 6/28/2013 9:47 AM, Igor Veresov wrote:

...
On Jun 28, 2013, at 7:08 AM, "Doerr, Martin" <martin.doerr@sap.com> wrote:

> Hi Igor, > > we didn’t find an easy and feasible way to ensure the ordering, either. > Grabbing the buffers and cleaning the cards at safepoints might be the best solution.

Would anybody from the G1 team like to think about that?

I've been thinking about this issue on an off for the last few weeks when I get the time. I mentioned it to Vladimir a couple of times to get his input.

...
> > Maybe removing the barrier that flushes the store to the cardtable makes the problem more likely to occur. > I guess the purpose of the barrier was exactly to avoid this problem > (which should be working perfectly if the post barriers had StoreLoad barriers, too). >

Yeah, but like you noted that would have a horrific effect on performance. So, it's probably best to bunch the work up to at least eliminate the need of extra work when, say, you're looping and storing to a limited working set (G1 uses the cardtable basically for that purpose). The safepoint approach will likely require more memory for buffers and the load will be spiky, and if the collection were to happen right after we grabbed the buffers the collector will have to process all of them which is not going to work well for predictability. But nothing better comes to mind at this point. Btw, there are already periodic safepoints to do bias locking revocations, so may be it would make sense to piggyback on that.

Piggy backing on all the other safepoint operations might work if they happen frequently enough but I don't know if that 's the case. And as you, even then there will be times where we haven't had a safepoint for a while and will have a ton of buffers to process at the start of the pause.

It might be worth adding a suitable memory barrier to the G1 post write barrier and evaluating the throughput hit.

JohnC

Thomas Schatzl

17 Jul 17 Jul

4:20 a.m.

Hi, trying to revive that somewhat dying thread with some suggestions... On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...

The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean. An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not. Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway). The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient). Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...

On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards. (That is also Igor's suggestion from the other email I think). That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...

...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Igor Veresov

18 Jul 18 Jul

12:36 p.m.

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work. The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone". May be both approaches should be tried and evaluated..? igor On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...

Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Doerr, Martin

6 Sep 6 Sep

9:54 a.m.

Hi, thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution. I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers. (I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.) May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25. Best regards, Martin -----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work. The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone". May be both approaches should be tried and evaluated..? igor On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...

Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Mikael Gerdin

9 Sep 9 Sep

1:44 a.m.

Martin, On 2013-09-06 18:54, Doerr, Martin wrote:

...

Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.)

I have prototyped both a version of the filtering barrier and the "special safepoint" variant. The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013. The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number. I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1]. One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else). Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...

May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25.

I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen. /Mikael [1] http://openjdk.java.net/projects/jdk8/milestones

...

Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Doerr, Martin

3:35 a.m.

Hi Mikael, thanks for this information. We are glad that you're working on this issue. And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects. Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you. Hopefully the checkpointing approach will perform better in the long term. Best regards, Martin -----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Martin, On 2013-09-06 18:54, Doerr, Martin wrote:

...

Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.)

...

May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25.

...

Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Mikael Gerdin

7:32 a.m.

Martin, On 2013-09-09 12:35, Doerr, Martin wrote:

...

Hi Mikael,

thanks for this information. We are glad that you're working on this issue.

And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects.

Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you.

Right, if the card value is clean_card_val we don't need to take the membar. On the other hand this adds another conditional branch before the membar in the barrier, should we then take another conditional branch depending on the reloaded value? I'm already stretching my abilities in poking around in the code generation parts of the VM but I could probably do some performance runs if you want to provide a patch to add the additional conditionals. I don't know if the trade-off is worth it or not.

...

Hopefully the checkpointing approach will perform better in the long term.

I agree, it would be nice to slim down the barriers instead of inflating them further. /Mikael

...

Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:

...
Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.)

I have prototyped both a version of the filtering barrier and the "special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number.

I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...
May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25.

I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

...
Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Doerr, Martin

10 Sep 10 Sep

7:30 a.m.

Hi Mikael, for performance measurements, only the graphKit part should be relevant. So you can try the code below, if you like. We definitely need the reload and second comparison, because omitting the card marking is only safe if the card which has been loaded after the MemBarVolatile is clean. I guess the additional branch leads to more branch prediction misses and it probably depends on the benchmark and processor if it pays off or not. Best regards, Martin __ if_then(card_val, BoolTest::ne, young_card); { // Omitting g1_mark_card is only allowed if sequentially consistent version of card is clean. Node* not_already_dirty = __ make_label(1); __ if_then(card_val, BoolTest::ne, dirty_card); { __ goto_(not_already_dirty); } __ end_if(); sync_kit(ideal); insert_mem_bar(Op_MemBarVolatile, oop_store); __ sync_kit(this); card_val = __ load(__ ctrl(), card_adr, TypeInt::INT, T_BYTE, Compile::AliasIdxRaw); __ if_then(card_val, BoolTest::ne, dirty_card); { __ bind(not_already_dirty); g1_mark_card(ideal, card_adr, oop_store, alias_idx, index, index_adr, buffer, tf); } __ end_if(); } __ end_if(); -----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 16:32 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Martin, On 2013-09-09 12:35, Doerr, Martin wrote:

...

Hi Mikael,

thanks for this information. We are glad that you're working on this issue.

And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects.

Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you.

...

Hopefully the checkpointing approach will perform better in the long term.

I agree, it would be nice to slim down the barriers instead of inflating them further. /Mikael

...

Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:

...
Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.)

I have prototyped both a version of the filtering barrier and the "special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number.

I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...
May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25.

I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

...
Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Mikael Gerdin

7:41 a.m.

Martin, On 2013-09-10 16:30, Doerr, Martin wrote:

...

Hi Mikael,

for performance measurements, only the graphKit part should be relevant. So you can try the code below, if you like.

Thanks.

...

We definitely need the reload and second comparison, because omitting the card marking is only safe if the card which has been loaded after the MemBarVolatile is clean. I guess the additional branch leads to more branch prediction misses and it probably depends on the benchmark and processor if it pays off or not.

Agreed. I'll try it just out of curiosity. I have a few runs going so it'll probably be a few days before I get the results. /Mikael

...

Best regards, Martin

__ if_then(card_val, BoolTest::ne, young_card); {

// Omitting g1_mark_card is only allowed if sequentially consistent version of card is clean. Node* not_already_dirty = __ make_label(1); __ if_then(card_val, BoolTest::ne, dirty_card); { __ goto_(not_already_dirty); } __ end_if();

sync_kit(ideal); insert_mem_bar(Op_MemBarVolatile, oop_store); __ sync_kit(this);

card_val = __ load(__ ctrl(), card_adr, TypeInt::INT, T_BYTE, Compile::AliasIdxRaw); __ if_then(card_val, BoolTest::ne, dirty_card); { __ bind(not_already_dirty); g1_mark_card(ideal, card_adr, oop_store, alias_idx, index, index_adr, buffer, tf); } __ end_if(); } __ end_if();

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 16:32 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-09 12:35, Doerr, Martin wrote:

...
Hi Mikael,

thanks for this information. We are glad that you're working on this issue.

And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects.

Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you.

Right, if the card value is clean_card_val we don't need to take the membar. On the other hand this adds another conditional branch before the membar in the barrier, should we then take another conditional branch depending on the reloaded value? I'm already stretching my abilities in poking around in the code generation parts of the VM but I could probably do some performance runs if you want to provide a patch to add the additional conditionals.

I don't know if the trade-off is worth it or not.

...
Hopefully the checkpointing approach will perform better in the long term.

I agree, it would be nice to slim down the barriers instead of inflating them further.

/Mikael

...
Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:

...
Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.)

I have prototyped both a version of the filtering barrier and the "special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number.

I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...
May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25.

I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

...
Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Doerr, Martin

7:46 a.m.

Hi Mikael, great. Thanks for trying. Btw.: The comment below should state "if ... card is dirty". Martin -----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Dienstag, 10. September 2013 16:42 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Martin, On 2013-09-10 16:30, Doerr, Martin wrote:

...

Hi Mikael,

for performance measurements, only the graphKit part should be relevant. So you can try the code below, if you like.

Thanks.

...

We definitely need the reload and second comparison, because omitting the card marking is only safe if the card which has been loaded after the MemBarVolatile is clean. I guess the additional branch leads to more branch prediction misses and it probably depends on the benchmark and processor if it pays off or not.

Agreed. I'll try it just out of curiosity. I have a few runs going so it'll probably be a few days before I get the results. /Mikael

...

Best regards, Martin

__ if_then(card_val, BoolTest::ne, young_card); {

// Omitting g1_mark_card is only allowed if sequentially consistent version of card is clean. Node* not_already_dirty = __ make_label(1); __ if_then(card_val, BoolTest::ne, dirty_card); { __ goto_(not_already_dirty); } __ end_if();

sync_kit(ideal); insert_mem_bar(Op_MemBarVolatile, oop_store); __ sync_kit(this);

card_val = __ load(__ ctrl(), card_adr, TypeInt::INT, T_BYTE, Compile::AliasIdxRaw); __ if_then(card_val, BoolTest::ne, dirty_card); { __ bind(not_already_dirty); g1_mark_card(ideal, card_adr, oop_store, alias_idx, index, index_adr, buffer, tf); } __ end_if(); } __ end_if();

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 16:32 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-09 12:35, Doerr, Martin wrote:

...
Hi Mikael,

thanks for this information. We are glad that you're working on this issue.

And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects.

Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you.

Right, if the card value is clean_card_val we don't need to take the membar. On the other hand this adds another conditional branch before the membar in the barrier, should we then take another conditional branch depending on the reloaded value? I'm already stretching my abilities in poking around in the code generation parts of the VM but I could probably do some performance runs if you want to provide a patch to add the additional conditionals.

I don't know if the trade-off is worth it or not.

...
Hopefully the checkpointing approach will perform better in the long term.

I agree, it would be nice to slim down the barriers instead of inflating them further.

/Mikael

...
Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:

...
Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.)

I have prototyped both a version of the filtering barrier and the "special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number.

I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...
May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25.

I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

...
Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse..

One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

...
Hi Igor,

Yeah G1 has that facility right now. In fact you added it. :) When the number of completed buffers is below the green zone upper limit, none of the refinement threads are refining buffers. That is the green zone upper limit is number of buffers that we expect to be able to process during the GC without it going over some percentage of the pause time (I think the default is 10%). When the number of buffers grows above the green zone upper limit, the refinement threads start processing the buffers in stepped manner.

So during the safepoint we would process N - green-zone-upper-limit completed buffers. In fact we could have a watcher task that monitors the number of completed buffers and triggers a safepoint when the number of completed buffers becomes sufficiently high - say above the yellow-zone upper limit.

That does away with the whole notion of concurrent refinement but will remove a lot of the nasty complicated code that gets executed by the mutators or refinement threads.

I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
...
My main concern is that the we would be potentially increasing the number and duration of non-GC safepoints which cause issues with latency sensitive apps. For those workloads that only care about 90% of the transactions this approach would probably be fine.

We would need to evaluate the performance of each approach.

Hth, Thomas

Mikael Gerdin

23 Sep 23 Sep

8:07 a.m.

Martin, I've got some measurements numbers now (on a dual Opteron 6278 running Solaris 11): ============================================================================== g1fix-b43_baseline/: Benchmark Samples Mean Stdev Geomean Weight specjbb2013 20 8140.05 204.79 HbIrMaxAttempted 20 23667.00 764.74 HbIrSettled 20 19918.60 98.97 criticalJOPS 20 8140.05 204.79 maxJOPS 20 17831.70 287.18 ============================================================================== g1fix-b43_filtering: Benchmark Samples Mean Stdev %Diff P Significant specjbb2013 20 8021.80 206.89 -1.45 0.077 * HbIrMaxAttempted 20 22479.90 1902.28 5.02 0.016 * HbIrSettled 20 19633.00 440.43 1.43 0.010 * criticalJOPS 20 8021.80 206.89 -1.45 0.077 * maxJOPS 20 17539.20 228.41 -1.64 0.001 Yes ============================================================================== g1fix-b43_filtering_sap/: Benchmark Samples Mean Stdev %Diff P Significant specjbb2013 20 8025.25 241.57 -1.41 0.113 * HbIrMaxAttempted 20 22848.75 1757.94 3.46 0.067 * HbIrSettled 20 19640.65 481.17 1.40 0.020 * criticalJOPS 20 8025.25 241.57 -1.41 0.113 * maxJOPS 20 17609.85 208.58 -1.24 0.008 Yes ============================================================================== * - Not Significant: A non-zero %Diff for the mean could be noise. If the %Diff is 0, an actual difference may still exist. In either case, more samples would be needed to detect an actual difference in sample means. The last one is with your suggested addition. The difference is relatively small so I don't know if it's worth the additional complexity of another conditional branch. /Mikael On 09/10/2013 04:46 PM, Doerr, Martin wrote:

...

Hi Mikael,

great. Thanks for trying.

Btw.: The comment below should state "if ... card is dirty".

Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Dienstag, 10. September 2013 16:42 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-10 16:30, Doerr, Martin wrote:

...
Hi Mikael,

for performance measurements, only the graphKit part should be relevant. So you can try the code below, if you like. Thanks.

...
We definitely need the reload and second comparison, because omitting the card marking is only safe if the card which has been loaded after the MemBarVolatile is clean. I guess the additional branch leads to more branch prediction misses and it probably depends on the benchmark and processor if it pays off or not. Agreed. I'll try it just out of curiosity. I have a few runs going so it'll probably be a few days before I get the results.

/Mikael

...
Best regards, Martin

__ if_then(card_val, BoolTest::ne, young_card); {

// Omitting g1_mark_card is only allowed if sequentially consistent version of card is clean. Node* not_already_dirty = __ make_label(1); __ if_then(card_val, BoolTest::ne, dirty_card); { __ goto_(not_already_dirty); } __ end_if();

sync_kit(ideal); insert_mem_bar(Op_MemBarVolatile, oop_store); __ sync_kit(this);

card_val = __ load(__ ctrl(), card_adr, TypeInt::INT, T_BYTE, Compile::AliasIdxRaw); __ if_then(card_val, BoolTest::ne, dirty_card); { __ bind(not_already_dirty); g1_mark_card(ideal, card_adr, oop_store, alias_idx, index, index_adr, buffer, tf); } __ end_if(); } __ end_if();

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 16:32 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-09 12:35, Doerr, Martin wrote:

...
Hi Mikael,

thanks for this information. We are glad that you're working on this issue.

And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects.

Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you. Right, if the card value is clean_card_val we don't need to take the membar. On the other hand this adds another conditional branch before the membar in the barrier, should we then take another conditional branch depending on the reloaded value? I'm already stretching my abilities in poking around in the code generation parts of the VM but I could probably do some performance runs if you want to provide a patch to add the additional conditionals.

I don't know if the trade-off is worth it or not.

...
Hopefully the checkpointing approach will perform better in the long term. I agree, it would be nice to slim down the barriers instead of inflating them further.

/Mikael

...
Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:

...
Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.) I have prototyped both a version of the filtering barrier and the "special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number.

I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...
May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25. I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

...
Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse.. One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

> Hi Igor, > > Yeah G1 has that facility right now. In fact you added it. :) When > the number of completed buffers is below the green zone upper limit, > none of the refinement threads are refining buffers. That is the > green zone upper limit is number of buffers that we expect to be > able to process during the GC without it going over some percentage > of the pause time (I think the default is 10%). When the number of > buffers grows above the green zone upper limit, the refinement > threads start processing the buffers in stepped manner. > > So during the safepoint we would process N - green-zone-upper-limit > completed buffers. In fact we could have a watcher task that > monitors the number of completed buffers and triggers a safepoint > when the number of completed buffers becomes sufficiently high - say > above the yellow-zone upper limit. > > That does away with the whole notion of concurrent refinement but > will remove a lot of the nasty complicated code that gets executed > by the mutators or refinement threads. I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
> My main concern is that the we would be potentially increasing the > number and duration of non-GC safepoints which cause issues with > latency sensitive apps. For those workloads that only care about 90% > of the transactions this approach would probably be fine. > > We would need to evaluate the performance of each approach. Hth, Thomas

Doerr, Martin

10:11 a.m.

Hi Mikael, thanks for posting the results. I don't have a strong opinion on whether my proposal should make it into HS25 or not. Most important is that the issue gets resolved for now. Is your webrev http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/ the final version which is planned to get into JDK 8 Zero bug bounce? I think it's good. Best regards, Martin -----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 23. September 2013 17:08 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards Martin, I've got some measurements numbers now (on a dual Opteron 6278 running Solaris 11): ============================================================================== g1fix-b43_baseline/: Benchmark Samples Mean Stdev Geomean Weight specjbb2013 20 8140.05 204.79 HbIrMaxAttempted 20 23667.00 764.74 HbIrSettled 20 19918.60 98.97 criticalJOPS 20 8140.05 204.79 maxJOPS 20 17831.70 287.18 ============================================================================== g1fix-b43_filtering: Benchmark Samples Mean Stdev %Diff P Significant specjbb2013 20 8021.80 206.89 -1.45 0.077 * HbIrMaxAttempted 20 22479.90 1902.28 5.02 0.016 * HbIrSettled 20 19633.00 440.43 1.43 0.010 * criticalJOPS 20 8021.80 206.89 -1.45 0.077 * maxJOPS 20 17539.20 228.41 -1.64 0.001 Yes ============================================================================== g1fix-b43_filtering_sap/: Benchmark Samples Mean Stdev %Diff P Significant specjbb2013 20 8025.25 241.57 -1.41 0.113 * HbIrMaxAttempted 20 22848.75 1757.94 3.46 0.067 * HbIrSettled 20 19640.65 481.17 1.40 0.020 * criticalJOPS 20 8025.25 241.57 -1.41 0.113 * maxJOPS 20 17609.85 208.58 -1.24 0.008 Yes ============================================================================== * - Not Significant: A non-zero %Diff for the mean could be noise. If the %Diff is 0, an actual difference may still exist. In either case, more samples would be needed to detect an actual difference in sample means. The last one is with your suggested addition. The difference is relatively small so I don't know if it's worth the additional complexity of another conditional branch. /Mikael On 09/10/2013 04:46 PM, Doerr, Martin wrote:

...

Hi Mikael,

great. Thanks for trying.

Btw.: The comment below should state "if ... card is dirty".

Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Dienstag, 10. September 2013 16:42 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-10 16:30, Doerr, Martin wrote:

...
Hi Mikael,

for performance measurements, only the graphKit part should be relevant. So you can try the code below, if you like. Thanks.

...
We definitely need the reload and second comparison, because omitting the card marking is only safe if the card which has been loaded after the MemBarVolatile is clean. I guess the additional branch leads to more branch prediction misses and it probably depends on the benchmark and processor if it pays off or not. Agreed. I'll try it just out of curiosity. I have a few runs going so it'll probably be a few days before I get the results.

/Mikael

...
Best regards, Martin

__ if_then(card_val, BoolTest::ne, young_card); {

// Omitting g1_mark_card is only allowed if sequentially consistent version of card is clean. Node* not_already_dirty = __ make_label(1); __ if_then(card_val, BoolTest::ne, dirty_card); { __ goto_(not_already_dirty); } __ end_if();

sync_kit(ideal); insert_mem_bar(Op_MemBarVolatile, oop_store); __ sync_kit(this);

card_val = __ load(__ ctrl(), card_adr, TypeInt::INT, T_BYTE, Compile::AliasIdxRaw); __ if_then(card_val, BoolTest::ne, dirty_card); { __ bind(not_already_dirty); g1_mark_card(ideal, card_adr, oop_store, alias_idx, index, index_adr, buffer, tf); } __ end_if(); } __ end_if();

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 16:32 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-09 12:35, Doerr, Martin wrote:

...
Hi Mikael,

thanks for this information. We are glad that you're working on this issue.

And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects.

Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you. Right, if the card value is clean_card_val we don't need to take the membar. On the other hand this adds another conditional branch before the membar in the barrier, should we then take another conditional branch depending on the reloaded value? I'm already stretching my abilities in poking around in the code generation parts of the VM but I could probably do some performance runs if you want to provide a patch to add the additional conditionals.

I don't know if the trade-off is worth it or not.

...
Hopefully the checkpointing approach will perform better in the long term. I agree, it would be nice to slim down the barriers instead of inflating them further.

/Mikael

...
Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:

...
Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.) I have prototyped both a version of the filtering barrier and the "special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number.

I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...
May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25. I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

...
Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:

...
The mutator processing doesn't solve it. The card clearing event is still asynchronous with respect to possible mutations in other threads. While one mutator thread is processing buffers and clearing cards the other can sneak in and do the store to the same object that will go unnoticed. So I'm afraid it's either a store-load barrier, or we need to stop all mutator threads to prevent this race, or worse.. One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

...
On Jun 28, 2013, at 1:53 PM, John Cuthbertson <john.cuthbertson@oracle.com> wrote:

> Hi Igor, > > Yeah G1 has that facility right now. In fact you added it. :) When > the number of completed buffers is below the green zone upper limit, > none of the refinement threads are refining buffers. That is the > green zone upper limit is number of buffers that we expect to be > able to process during the GC without it going over some percentage > of the pause time (I think the default is 10%). When the number of > buffers grows above the green zone upper limit, the refinement > threads start processing the buffers in stepped manner. > > So during the safepoint we would process N - green-zone-upper-limit > completed buffers. In fact we could have a watcher task that > monitors the number of completed buffers and triggers a safepoint > when the number of completed buffers becomes sufficiently high - say > above the yellow-zone upper limit. > > That does away with the whole notion of concurrent refinement but > will remove a lot of the nasty complicated code that gets executed > by the mutators or refinement threads. I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

...
> My main concern is that the we would be potentially increasing the > number and duration of non-GC safepoints which cause issues with > latency sensitive apps. For those workloads that only care about 90% > of the transactions this approach would probably be fine. > > We would need to evaluate the performance of each approach. Hth, Thomas

Mikael Gerdin

24 Sep 24 Sep

2:25 a.m.

Martin, On 09/23/2013 07:11 PM, Doerr, Martin wrote:

...

Hi Mikael,

thanks for posting the results. I don't have a strong opinion on whether my proposal should make it into HS25 or not. Most important is that the issue gets resolved for now.

Is your webrev http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/ the final version which is planned to get into JDK 8 Zero bug bounce? I think it's good.

Thanks. I'd like to do a cleanup of the CardTable stuff, moving some G1 specific code to G1's CardTable class as a cleanup before this change. I don't like the G1 specific code leaking into cardTableModRefBS.cpp. /Mikael

...

Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 23. September 2013 17:08 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

I've got some measurements numbers now (on a dual Opteron 6278 running Solaris 11):

============================================================================== g1fix-b43_baseline/: Benchmark Samples Mean Stdev Geomean Weight specjbb2013 20 8140.05 204.79 HbIrMaxAttempted 20 23667.00 764.74 HbIrSettled 20 19918.60 98.97 criticalJOPS 20 8140.05 204.79 maxJOPS 20 17831.70 287.18 ============================================================================== g1fix-b43_filtering: Benchmark Samples Mean Stdev %Diff P Significant specjbb2013 20 8021.80 206.89 -1.45 0.077 * HbIrMaxAttempted 20 22479.90 1902.28 5.02 0.016 * HbIrSettled 20 19633.00 440.43 1.43 0.010 * criticalJOPS 20 8021.80 206.89 -1.45 0.077 * maxJOPS 20 17539.20 228.41 -1.64 0.001 Yes ============================================================================== g1fix-b43_filtering_sap/: Benchmark Samples Mean Stdev %Diff P Significant specjbb2013 20 8025.25 241.57 -1.41 0.113 * HbIrMaxAttempted 20 22848.75 1757.94 3.46 0.067 * HbIrSettled 20 19640.65 481.17 1.40 0.020 * criticalJOPS 20 8025.25 241.57 -1.41 0.113 * maxJOPS 20 17609.85 208.58 -1.24 0.008 Yes ============================================================================== * - Not Significant: A non-zero %Diff for the mean could be noise. If the %Diff is 0, an actual difference may still exist. In either case, more samples would be needed to detect an actual difference in sample means.

The last one is with your suggested addition. The difference is relatively small so I don't know if it's worth the additional complexity of another conditional branch.

/Mikael

On 09/10/2013 04:46 PM, Doerr, Martin wrote:

...
Hi Mikael,

great. Thanks for trying.

Btw.: The comment below should state "if ... card is dirty".

Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Dienstag, 10. September 2013 16:42 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-10 16:30, Doerr, Martin wrote:

...
Hi Mikael,

for performance measurements, only the graphKit part should be relevant. So you can try the code below, if you like. Thanks.

...
We definitely need the reload and second comparison, because omitting the card marking is only safe if the card which has been loaded after the MemBarVolatile is clean. I guess the additional branch leads to more branch prediction misses and it probably depends on the benchmark and processor if it pays off or not. Agreed. I'll try it just out of curiosity. I have a few runs going so it'll probably be a few days before I get the results.

/Mikael

...
Best regards, Martin

__ if_then(card_val, BoolTest::ne, young_card); {

// Omitting g1_mark_card is only allowed if sequentially consistent version of card is clean. Node* not_already_dirty = __ make_label(1); __ if_then(card_val, BoolTest::ne, dirty_card); { __ goto_(not_already_dirty); } __ end_if();

sync_kit(ideal); insert_mem_bar(Op_MemBarVolatile, oop_store); __ sync_kit(this);

card_val = __ load(__ ctrl(), card_adr, TypeInt::INT, T_BYTE, Compile::AliasIdxRaw); __ if_then(card_val, BoolTest::ne, dirty_card); { __ bind(not_already_dirty); g1_mark_card(ideal, card_adr, oop_store, alias_idx, index, index_adr, buffer, tf); } __ end_if(); } __ end_if();

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 16:32 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-09 12:35, Doerr, Martin wrote:

...
Hi Mikael,

thanks for this information. We are glad that you're working on this issue.

And we appreciate both of your proposals. I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad. I like the card table based filtering of young objects.

Just an additional comment on this filtering technique: The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar. Here's a SPARC example: __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card); __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty); __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad)); ... reload I guess it won't fix the performance penalty. I just wanted to share this idea with you. Right, if the card value is clean_card_val we don't need to take the membar. On the other hand this adds another conditional branch before the membar in the barrier, should we then take another conditional branch depending on the reloaded value? I'm already stretching my abilities in poking around in the code generation parts of the VM but I could probably do some performance runs if you want to provide a patch to add the additional conditionals.

I don't know if the trade-off is worth it or not.

...
Hopefully the checkpointing approach will perform better in the long term. I agree, it would be nice to slim down the barriers instead of inflating them further.

/Mikael

...
Best regards, Martin

-----Original Message----- From: Mikael Gerdin [mailto:mikael.gerdin@oracle.com] Sent: Montag, 9. September 2013 10:45 To: Doerr, Martin Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:

...
Hi,

thanks for sharing your ideas. Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756. And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well. This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment, because people are already concerned about large barrier code.) I have prototyped both a version of the filtering barrier and the "special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my prototype of the "special safepoint/checkpointing" has horrible (-60%) performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits and policies for triggering the safepoint and checkpointing the buffers. I basically just wanted to get it to work without crashing and see a ballpark performance number.

I don't have a special preference for any of the possible solutions, but I'm not sure if I have the time to get the checkpointing variant into shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work on the checkpointing variant as a future task (or in parallel by someone else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty): http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/ http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

...
May I ask for a bug id or something which allows tracking of this issue? Hopefully, it can be addressed during development of hotspot 25. I am currently working on this issue under bug id 8014555. Unfortunately that bug's description contains internal information and is therefore not visible on bugs.sun.com. On the other hand, most of the information in the bug consists of analysis of the crashes and not any discussion about the actual memory ordering problem. In fact, I've not been able to prove that the cause for the crashes in the bug are due to this problem, but if I run the test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

...
Best regards, Martin

-----Original Message----- From: hotspot-gc-dev-bounces@openjdk.java.net [mailto:hotspot-gc-dev-bounces@openjdk.java.net] On Behalf Of Igor Veresov Sent: Donnerstag, 18. Juli 2013 21:36 To: Thomas Schatzl Cc: hotspot-gc-dev@openjdk.java.net; Braun, Matthias Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness. Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes. It would work as follows: - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while.. - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel. - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl@oracle.com> wrote:

...
Hi,

trying to revive that somewhat dying thread with some suggestions...

On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote: > The mutator processing doesn't solve it. The card clearing event is > still asynchronous with respect to possible mutations in other > threads. While one mutator thread is processing buffers and clearing > cards the other can sneak in and do the store to the same object that > will go unnoticed. So I'm afraid it's either a store-load barrier, or > we need to stop all mutator threads to prevent this race, or worse.. One option to reduce the overhead of the store-load barrier is to only execute it if it is needed; actually a large part of the memory accesses are to the young gen. These accesses are going to be filtered out by the existing mechanism anyway, are always dirty, and never reset to clean.

An (e.g. per-region) auxiliary table could be used that indicates that for a particular region we will actually need the card mark and the storeload barrier or not.

Outside of safepoints, entries to that table are only ever marked dirty, never reset to clean. This could be done without synchronization I think, as in the worst case a thread will see from the card table that the corresponding regions' cards are dirty (i.e. will be filtered anyway).

The impact of the additional cost in the barrier might be offset by the cache bandwidth saved by not accessing the card table to some degree (and avoiding the StoreLoad barrier for most accesses). The per-region table should be small (a byte per region would be sufficient).

Actually one could tests where the actual card table lookup is completely disabled and just always handle mutations in the areas not covered by this table. If this area is sufficiently small, this could be an option.

> On Jun 28, 2013, at 1:53 PM, John Cuthbertson > <john.cuthbertson@oracle.com> wrote: > >> Hi Igor, >> >> Yeah G1 has that facility right now. In fact you added it. :) When >> the number of completed buffers is below the green zone upper limit, >> none of the refinement threads are refining buffers. That is the >> green zone upper limit is number of buffers that we expect to be >> able to process during the GC without it going over some percentage >> of the pause time (I think the default is 10%). When the number of >> buffers grows above the green zone upper limit, the refinement >> threads start processing the buffers in stepped manner. >> >> So during the safepoint we would process N - green-zone-upper-limit >> completed buffers. In fact we could have a watcher task that >> monitors the number of completed buffers and triggers a safepoint >> when the number of completed buffers becomes sufficiently high - say >> above the yellow-zone upper limit. >> >> That does away with the whole notion of concurrent refinement but >> will remove a lot of the nasty complicated code that gets executed >> by the mutators or refinement threads. I think it is possible to only reset the card table at the safepoint; the buffers that were filled before taking the snapshot can still be processed concurrently afterwards.

(That is also Igor's suggestion from the other email I think).

That may be somewhat expensive for very large heaps; but as you mention that effort could be limited by only cleaning the cards that have a completed buffer entry.

>> My main concern is that the we would be potentially increasing the >> number and duration of non-GC safepoints which cause issues with >> latency sensitive apps. For those workloads that only care about 90% >> of the transactions this approach would probably be fine. >> >> We would need to evaluate the performance of each approach. Hth, Thomas

4543

Age (days ago)

4673

Last active (days ago)

List overview

Download

26 comments

5 participants

participants (5)

Doerr, Martin
Igor Veresov
John Cuthbertson
Mikael Gerdin
Thomas Schatzl