advice & review requested for 6896297, race condition in rmid causing JCK failure

newer
hg: jdk7/tl/jdk: 6896297: (rmi)...

Stuart Marks

7 Apr 2011 7 Apr '11

10:19 p.m.

Hi Core-Libs developers, I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297 and here's a webrev of the fix I'm working on: http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/ Briefly, the problem is incorrect synchronization of groupTable, a HashMap field of an Activation object. The code mostly locks groupTable around any access to it. However, no such locking is done during serialization. If the groupTable is modified while it's being serialized, ConcurrentModificationException occurs. The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to remove the external locking entirely. Unfortunately this will change the serialized representation of the Activation object, which I'm not sure is acceptable. Assuming that we can't change the serialized represenation, the alternative approach would be to make sure that locking is done properly during serialization. This is fairly easy to do by locking groupTable in a writeObject() method. Unfortunately, this introduces a deadlock. This deadlock occurs because, with this modification, there are now paths through the code that take locks in the opposite order. Specifically, the addLogRecord() method locks the log object and then (via serialization and the newly added writeObject() method) locks groupTable. However, the unregisterGroup() method locks groupTable and calls GroupEntry.unregisterGroup() which logs the event, which takes a lock on the log. After some analysis, I've determined that the call to GroupEntry.unregisterGroup() can be moved outside the synchronization of groupTable. This removes the ordering problem. With these fixes in place (the state of the webrev above) I can get several hundred successful test runs with neither ConcurrentModificationException nor any deadlocks. Of course, that doesn't mean the change is correct. :-) Questions: 1. Is there a requirement that the serialized form of Activation remain unchanged? If we can change it, we might as well just use ConcurrentHashMap instead of HashMap. 2. Is my lock ordering analysis correct? I've pored over the code, but not really being familiar with any RMI concepts, I'm not sure I have it right. I've written this analysis into a big comment I've added to the code. 3. There is also a potential concurrency issue with idTable, which is used similarly to groupTable. I haven't seen a test failure from it though. It seems sensible to add a lock for it in Activation.writeObject() though. I don't particularly like nesting the locks of idTable and groupTable, but locking them separately would require serializing the Activation object field by field instead of simply using defaultWriteObject(). Is this a reasonable approach? Thanks for any advice or comments. s'marks

Show replies by date

David Holmes

7 Apr 7 Apr

11:19 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Hi Stuart, I can't answer your specific questions as I'm not familiar with the RMI infrastructure either. Taking the lock in writeObject and moving group.unregister out of the sync block to avoid deadlock seems reasonable. The main question to ask with such a move is whether the temporary inconsistency in state is something that can be observed and cause a problem. The lock ordering analysis seems reasonable. I do note that the startupLock protocol seems suspicious as there is a race with detection of the lock value. I assume init itself can never be invoked by more than one thread ;-) It may be that the object can only become accessible concurrently at some time during init (ie it gets published) in which case the protocol will work fine - but you really need a full understanding of the lifecycle of the objects to determine that. That aside, using ConcurrentHashMap would simplify the solution to the current problem. If you are concerned about the serialized form then you could define writeObject and readObject such that they convert between CHM and plain HM using the serialized-fields API. Cheers, David Stuart Marks said the following on 04/08/11 08:19:

...

Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/

Briefly, the problem is incorrect synchronization of groupTable, a HashMap field of an Activation object. The code mostly locks groupTable around any access to it. However, no such locking is done during serialization. If the groupTable is modified while it's being serialized, ConcurrentModificationException occurs.

The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to remove the external locking entirely. Unfortunately this will change the serialized representation of the Activation object, which I'm not sure is acceptable.

Assuming that we can't change the serialized represenation, the alternative approach would be to make sure that locking is done properly during serialization. This is fairly easy to do by locking groupTable in a writeObject() method. Unfortunately, this introduces a deadlock.

This deadlock occurs because, with this modification, there are now paths through the code that take locks in the opposite order. Specifically, the addLogRecord() method locks the log object and then (via serialization and the newly added writeObject() method) locks groupTable. However, the unregisterGroup() method locks groupTable and calls GroupEntry.unregisterGroup() which logs the event, which takes a lock on the log.

After some analysis, I've determined that the call to GroupEntry.unregisterGroup() can be moved outside the synchronization of groupTable. This removes the ordering problem.

With these fixes in place (the state of the webrev above) I can get several hundred successful test runs with neither ConcurrentModificationException nor any deadlocks.

Of course, that doesn't mean the change is correct. :-)

Questions:

1. Is there a requirement that the serialized form of Activation remain unchanged? If we can change it, we might as well just use ConcurrentHashMap instead of HashMap.

2. Is my lock ordering analysis correct? I've pored over the code, but not really being familiar with any RMI concepts, I'm not sure I have it right. I've written this analysis into a big comment I've added to the code.

3. There is also a potential concurrency issue with idTable, which is used similarly to groupTable. I haven't seen a test failure from it though. It seems sensible to add a lock for it in Activation.writeObject() though. I don't particularly like nesting the locks of idTable and groupTable, but locking them separately would require serializing the Activation object field by field instead of simply using defaultWriteObject(). Is this a reasonable approach?

Thanks for any advice or comments.

s'marks

Stuart Marks

12 Apr 12 Apr

9:42 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Hi David, Thanks for your notes and analysis. Good point about the temporary inconsistency between groupEntry and groupTable. With the change, there's a brief moment where a GroupEntry in the registered state can be observed to be absent from the groupTable, which wasn't possible before. I don't think this is a problem though. I can't find any places where operations on a GroupEntry need to look in the groupTable. (If they did exist, they'd have run into the same deadlock that I ran into in making this change.) Most paths through the code seem to first fetch an entry from groupTable and then operate on the entry. Somebody could remove and unregister the entry from groupTable between the fetch and the operation on the entry, but that can already occur in absence of this change. In general the style of this code seems to be to lock individual data structures piecemeal, so I wouldn't be surprised if there are already cases where they can be inconsistent temporarily. For example, it seems like there ought to be some relationship between the members of idTable and groupTable, yet they are locked and updated independently through quite different code paths. This is dangerously close to a line of reasoning that goes like "it's already inconsistent, so making it a little more inconsistent doesn't hurt" which I detest. However, I don't have any better ideas at the moment. Also note that switching from HM to CHM (regardless of what we do with the serialized form) has the same issue. Doing this will protect the maps themselves -- and avoid ConcurrentModificationException, the original cause of this problem -- but it won't preserve any consistency between the maps. Thanks for mentioning startupLock. It seemed suspicious to me too, but I think addressing those issues is out of scope for this bugfix. I'll talk about serialization more in my forthcoming reply to Alan. s'marks On 4/7/11 4:19 PM, David Holmes wrote:

...

Hi Stuart,

I can't answer your specific questions as I'm not familiar with the RMI infrastructure either. Taking the lock in writeObject and moving group.unregister out of the sync block to avoid deadlock seems reasonable. The main question to ask with such a move is whether the temporary inconsistency in state is something that can be observed and cause a problem.

The lock ordering analysis seems reasonable.

I do note that the startupLock protocol seems suspicious as there is a race with detection of the lock value. I assume init itself can never be invoked by more than one thread ;-) It may be that the object can only become accessible concurrently at some time during init (ie it gets published) in which case the protocol will work fine - but you really need a full understanding of the lifecycle of the objects to determine that.

That aside, using ConcurrentHashMap would simplify the solution to the current problem. If you are concerned about the serialized form then you could define writeObject and readObject such that they convert between CHM and plain HM using the serialized-fields API.

Cheers, David

Stuart Marks said the following on 04/08/11 08:19:

...
Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/

Briefly, the problem is incorrect synchronization of groupTable, a HashMap field of an Activation object. The code mostly locks groupTable around any access to it. However, no such locking is done during serialization. If the groupTable is modified while it's being serialized, ConcurrentModificationException occurs.

The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to remove the external locking entirely. Unfortunately this will change the serialized representation of the Activation object, which I'm not sure is acceptable.

Assuming that we can't change the serialized represenation, the alternative approach would be to make sure that locking is done properly during serialization. This is fairly easy to do by locking groupTable in a writeObject() method. Unfortunately, this introduces a deadlock.

This deadlock occurs because, with this modification, there are now paths through the code that take locks in the opposite order. Specifically, the addLogRecord() method locks the log object and then (via serialization and the newly added writeObject() method) locks groupTable. However, the unregisterGroup() method locks groupTable and calls GroupEntry.unregisterGroup() which logs the event, which takes a lock on the log.

After some analysis, I've determined that the call to GroupEntry.unregisterGroup() can be moved outside the synchronization of groupTable. This removes the ordering problem.

With these fixes in place (the state of the webrev above) I can get several hundred successful test runs with neither ConcurrentModificationException nor any deadlocks.

Of course, that doesn't mean the change is correct. :-)

Questions:

1. Is there a requirement that the serialized form of Activation remain unchanged? If we can change it, we might as well just use ConcurrentHashMap instead of HashMap.

2. Is my lock ordering analysis correct? I've pored over the code, but not really being familiar with any RMI concepts, I'm not sure I have it right. I've written this analysis into a big comment I've added to the code.

3. There is also a potential concurrency issue with idTable, which is used similarly to groupTable. I haven't seen a test failure from it though. It seems sensible to add a lock for it in Activation.writeObject() though. I don't particularly like nesting the locks of idTable and groupTable, but locking them separately would require serializing the Activation object field by field instead of simply using defaultWriteObject(). Is this a reasonable approach?

Thanks for any advice or comments.

s'marks

Alan Bateman

11 Apr 11 Apr

2:46 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Stuart Marks wrote:

...

Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/ I haven't worked on RMI and so I don't have a good insight into the implementation. I think (and Peter Jones can correct me) that changing the serialized form would just inhibit recovery of rmid from an existing log file. As David said, you could switch to CHM and implement readObject/writeObject to serialize/deserialize in the current form. That said, I looked your webrev and the changes look reasonable to me. I don't see anything obviously wrong with moving the unregisterGroup out of the synchronized block.

-Alan.

Stuart Marks

12 Apr 12 Apr

10:08 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

On 4/11/11 7:46 AM, Alan Bateman wrote:

...

Stuart Marks wrote:

...
Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/

...

I haven't worked on RMI and so I don't have a good insight into the implementation. I think (and Peter Jones can correct me) that changing the serialized form would just inhibit recovery of rmid from an existing log file. As David said, you could switch to CHM and implement readObject/writeObject to serialize/deserialize in the current form. That said, I looked your webrev and the changes look reasonable to me. I don't see anything obviously wrong with moving the unregisterGroup out of the synchronized block.

I didn't see anything obviously wrong either... my hope (or fear) was that somebody would point out something non-obviously wrong. :-) In a face-to-face conversation, Joe Darcy pointed out to me that serialization compatibility is over the types of the fields declared in the class, not the actual serial data in the object stream. The groupTable and idTable fields are both Maps, so we could change the implementation from HashMap to ConcurrentHashMap and it would be compatible with any implementation that has CHM, i.e. 1.5 or later. (It would probably be wise to have readObject reconstitute the tables as CHM in case they encountered HM in the serialized form.) The alternatives seem to be as follows: 1. Convert groupTable and idTable to ConcurrentHashMap and remove external synchronization, and otherwise make minimal changes to serialization code. This puts a CHM into the object stream, which should be compatible with anything using JDK 1.5 or later as noted above. 2. Convert groupTable and idTable to ConcurrentHashMap and remove external synchronization, and provide readObject/writeObject implementations to place ordinary HashMaps (as opposed to CHMs) in the object stream. This serialized form is more compatible, but it requires more serialization code to be developed, tested, and maintained. 3. Adjust the locking strategy to prevent ConcurrentModificationException but otherwise make no changes to serialization or data structures. This is what I've posted in the webrev. The key issue seems to be whether we want to preserve compatibility of the serialized form. If we care about compatibility of the serialized form, #3 is probably the safest. If somebody comes along and tells us that we don't have to worry about serial compatibility of this stuff at all, then #1 is probably the best since it simplifies the code a lot. In the absence of any further information, I'm inclined to go with #3, since I've already developed and tested it and it's basically ready to go. Any further thoughts, anybody? s'marks

Alan Bateman

13 Apr 13 Apr

2:54 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Stuart Marks wrote:

...

:

The key issue seems to be whether we want to preserve compatibility of the serialized form. If we care about compatibility of the serialized form, #3 is probably the safest. If somebody comes along and tells us that we don't have to worry about serial compatibility of this stuff at all, then #1 is probably the best since it simplifies the code a lot. I think (and Peter can say for sure) that the compatibility issue is really only for the case that rmid needs to recover from an existing log file. Without more insight then I tend to agree that #3 is the best for now.

-Alan.

Peter Jones

14 Apr 14 Apr

6:14 a.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Stuart, A couple of comments so far: I don't think that there would be a serialization compatibility problem with changing Activation.groupTable to have a ConcurrentHashMap instead of a HashMap. The field's declared type is Map, so the deserialization process would be able to assign either class to the field, and the code doesn't otherwise care about the Map implementation class. An rmid's state created with CHM could not be recoverable with JDK 1.4, of course, but I don't think that that's a concern. The fix would not help an rmid whose state gets recovered from an earlier version created with HM, unless a replacement (HM->CHM) was also done upon deserialization. More troubling I think is the higher-level issue suggested by this ConcurrentModificationException. Thread B (in the 6896297 Evaluation) is attempting to write a "snapshot" of the entire persistent state of the Activation instance, which presumably should be consistent with future "updates" written to the log, while Thread A is concurrently attempting to mutate that state (before writing an update for its mutation). It seems highly doubtful that potentially including an uncertain amount of this concurrent mutation in the written snapshot is safe. I might expect a coarser lock to be used across all such mutations, together with their associated log writes (at which point groupTable wouldn't need to be concurrent). The approach at your webrev below seems OK to address the immediate ConcurrentModificationException, but (unless I'm missing something) the higher-level issue with snapshot correctness described above would remain. Cheers, -- Peter P.S. This seems somewhat related to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4945726 On Apr 7, 2011, at 6:19 PM, Stuart Marks wrote:

...

Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/

Briefly, the problem is incorrect synchronization of groupTable, a HashMap field of an Activation object. The code mostly locks groupTable around any access to it. However, no such locking is done during serialization. If the groupTable is modified while it's being serialized, ConcurrentModificationException occurs.

The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to remove the external locking entirely. Unfortunately this will change the serialized representation of the Activation object, which I'm not sure is acceptable.

Assuming that we can't change the serialized represenation, the alternative approach would be to make sure that locking is done properly during serialization. This is fairly easy to do by locking groupTable in a writeObject() method. Unfortunately, this introduces a deadlock.

This deadlock occurs because, with this modification, there are now paths through the code that take locks in the opposite order. Specifically, the addLogRecord() method locks the log object and then (via serialization and the newly added writeObject() method) locks groupTable. However, the unregisterGroup() method locks groupTable and calls GroupEntry.unregisterGroup() which logs the event, which takes a lock on the log.

After some analysis, I've determined that the call to GroupEntry.unregisterGroup() can be moved outside the synchronization of groupTable. This removes the ordering problem.

With these fixes in place (the state of the webrev above) I can get several hundred successful test runs with neither ConcurrentModificationException nor any deadlocks.

Of course, that doesn't mean the change is correct. :-)

Questions:

1. Is there a requirement that the serialized form of Activation remain unchanged? If we can change it, we might as well just use ConcurrentHashMap instead of HashMap.

2. Is my lock ordering analysis correct? I've pored over the code, but not really being familiar with any RMI concepts, I'm not sure I have it right. I've written this analysis into a big comment I've added to the code.

3. There is also a potential concurrency issue with idTable, which is used similarly to groupTable. I haven't seen a test failure from it though. It seems sensible to add a lock for it in Activation.writeObject() though. I don't particularly like nesting the locks of idTable and groupTable, but locking them separately would require serializing the Activation object field by field instead of simply using defaultWriteObject(). Is this a reasonable approach?

Thanks for any advice or comments.

s'marks

Stuart Marks

11:01 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Hi Peter, Thanks for your comments on this issue. I understand from Alan that you have some expertise in RMI; it's most welcome here. It's good to hear that serialization compatibility isn't that big of a deal. This makes things easier. It would seem reasonable to investigate replacing HashMap with ConcurrentHashMap, which would remove a fair bit of the external locking, and reduce or eliminate the need for a complex locking hierarchy along with a big comment that describes it. This will also make it easier to retrofit a coarser locking strategy if necessary; see below. Converting instances of HM to CHM if necessary upon deserialization is probably a good idea and isn't too difficult. The larger issue of snapshot consistency is indeed troubling. I'd agree that the ConcurrentModificationException is a symptom of a larger problem and that making it go away (either with a different locking strategy or using CHM) is mostly merely suppressing the symptom without addressing the larger issue. The immediate need, though, is to fix the JCK failure, so I suspect what I'll need to do is to push a fix for the CME and handle the resolution of any larger issue separately. I'll investigate an alternative fix that uses CHM instead of modified external locking, and I'll post an updated webrev. s'marks On 4/13/11 11:14 PM, Peter Jones wrote:

...

Stuart,

A couple of comments so far:

I don't think that there would be a serialization compatibility problem with changing Activation.groupTable to have a ConcurrentHashMap instead of a HashMap. The field's declared type is Map, so the deserialization process would be able to assign either class to the field, and the code doesn't otherwise care about the Map implementation class. An rmid's state created with CHM could not be recoverable with JDK 1.4, of course, but I don't think that that's a concern. The fix would not help an rmid whose state gets recovered from an earlier version created with HM, unless a replacement (HM->CHM) was also done upon deserialization.

More troubling I think is the higher-level issue suggested by this ConcurrentModificationException. Thread B (in the 6896297 Evaluation) is attempting to write a "snapshot" of the entire persistent state of the Activation instance, which presumably should be consistent with future "updates" written to the log, while Thread A is concurrently attempting to mutate that state (before writing an update for its mutation). It seems highly doubtful that potentially including an uncertain amount of this concurrent mutation in the written snapshot is safe. I might expect a coarser lock to be used across all such mutations, together with their associated log writes (at which point groupTable wouldn't need to be concurrent).

The approach at your webrev below seems OK to address the immediate ConcurrentModificationException, but (unless I'm missing something) the higher-level issue with snapshot correctness described above would remain.

Cheers,

-- Peter

P.S. This seems somewhat related to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4945726

On Apr 7, 2011, at 6:19 PM, Stuart Marks wrote:

...
Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/

Briefly, the problem is incorrect synchronization of groupTable, a HashMap field of an Activation object. The code mostly locks groupTable around any access to it. However, no such locking is done during serialization. If the groupTable is modified while it's being serialized, ConcurrentModificationException occurs.

The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to remove the external locking entirely. Unfortunately this will change the serialized representation of the Activation object, which I'm not sure is acceptable.

Assuming that we can't change the serialized represenation, the alternative approach would be to make sure that locking is done properly during serialization. This is fairly easy to do by locking groupTable in a writeObject() method. Unfortunately, this introduces a deadlock.

This deadlock occurs because, with this modification, there are now paths through the code that take locks in the opposite order. Specifically, the addLogRecord() method locks the log object and then (via serialization and the newly added writeObject() method) locks groupTable. However, the unregisterGroup() method locks groupTable and calls GroupEntry.unregisterGroup() which logs the event, which takes a lock on the log.

After some analysis, I've determined that the call to GroupEntry.unregisterGroup() can be moved outside the synchronization of groupTable. This removes the ordering problem.

With these fixes in place (the state of the webrev above) I can get several hundred successful test runs with neither ConcurrentModificationException nor any deadlocks.

Of course, that doesn't mean the change is correct. :-)

Questions:

1. Is there a requirement that the serialized form of Activation remain unchanged? If we can change it, we might as well just use ConcurrentHashMap instead of HashMap.

2. Is my lock ordering analysis correct? I've pored over the code, but not really being familiar with any RMI concepts, I'm not sure I have it right. I've written this analysis into a big comment I've added to the code.

3. There is also a potential concurrency issue with idTable, which is used similarly to groupTable. I haven't seen a test failure from it though. It seems sensible to add a lock for it in Activation.writeObject() though. I don't particularly like nesting the locks of idTable and groupTable, but locking them separately would require serializing the Activation object field by field instead of simply using defaultWriteObject(). Is this a reasonable approach?

Thanks for any advice or comments.

s'marks

Stuart Marks

19 Apr 19 Apr

10:35 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Hi all, Please review an updated webrev for this bug: http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.1/ Using ConcurrentHashMap is much nicer in many ways, and it seems to resolve the JCK failures at least as well as the previous fix did. This does nothing to remedy the issue of snapshot consistency, which seems to be covered by 4945726, which we'll postpone (again) to a future release. s'marks On 4/14/11 4:01 PM, Stuart Marks wrote:

...

Hi Peter,

Thanks for your comments on this issue. I understand from Alan that you have some expertise in RMI; it's most welcome here.

It's good to hear that serialization compatibility isn't that big of a deal. This makes things easier. It would seem reasonable to investigate replacing HashMap with ConcurrentHashMap, which would remove a fair bit of the external locking, and reduce or eliminate the need for a complex locking hierarchy along with a big comment that describes it. This will also make it easier to retrofit a coarser locking strategy if necessary; see below.

Converting instances of HM to CHM if necessary upon deserialization is probably a good idea and isn't too difficult.

The larger issue of snapshot consistency is indeed troubling. I'd agree that the ConcurrentModificationException is a symptom of a larger problem and that making it go away (either with a different locking strategy or using CHM) is mostly merely suppressing the symptom without addressing the larger issue. The immediate need, though, is to fix the JCK failure, so I suspect what I'll need to do is to push a fix for the CME and handle the resolution of any larger issue separately.

I'll investigate an alternative fix that uses CHM instead of modified external locking, and I'll post an updated webrev.

s'marks

On 4/13/11 11:14 PM, Peter Jones wrote:

...
Stuart,

A couple of comments so far:

I don't think that there would be a serialization compatibility problem with changing Activation.groupTable to have a ConcurrentHashMap instead of a HashMap. The field's declared type is Map, so the deserialization process would be able to assign either class to the field, and the code doesn't otherwise care about the Map implementation class. An rmid's state created with CHM could not be recoverable with JDK 1.4, of course, but I don't think that that's a concern. The fix would not help an rmid whose state gets recovered from an earlier version created with HM, unless a replacement (HM->CHM) was also done upon deserialization.

More troubling I think is the higher-level issue suggested by this ConcurrentModificationException. Thread B (in the 6896297 Evaluation) is attempting to write a "snapshot" of the entire persistent state of the Activation instance, which presumably should be consistent with future "updates" written to the log, while Thread A is concurrently attempting to mutate that state (before writing an update for its mutation). It seems highly doubtful that potentially including an uncertain amount of this concurrent mutation in the written snapshot is safe. I might expect a coarser lock to be used across all such mutations, together with their associated log writes (at which point groupTable wouldn't need to be concurrent).

The approach at your webrev below seems OK to address the immediate ConcurrentModificationException, but (unless I'm missing something) the higher-level issue with snapshot correctness described above would remain.

Cheers,

-- Peter

P.S. This seems somewhat related to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4945726

On Apr 7, 2011, at 6:19 PM, Stuart Marks wrote:

...
Hi Core-Libs developers,

I'd like to solicit some advice and discussion about this bug and a potential fix I'm cooking for it. Here is the bug report; it contains details about the problem and my analysis of it:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297

and here's a webrev of the fix I'm working on:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/

Briefly, the problem is incorrect synchronization of groupTable, a HashMap field of an Activation object. The code mostly locks groupTable around any access to it. However, no such locking is done during serialization. If the groupTable is modified while it's being serialized, ConcurrentModificationException occurs.

The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to remove the external locking entirely. Unfortunately this will change the serialized representation of the Activation object, which I'm not sure is acceptable.

Assuming that we can't change the serialized represenation, the alternative approach would be to make sure that locking is done properly during serialization. This is fairly easy to do by locking groupTable in a writeObject() method. Unfortunately, this introduces a deadlock.

This deadlock occurs because, with this modification, there are now paths through the code that take locks in the opposite order. Specifically, the addLogRecord() method locks the log object and then (via serialization and the newly added writeObject() method) locks groupTable. However, the unregisterGroup() method locks groupTable and calls GroupEntry.unregisterGroup() which logs the event, which takes a lock on the log.

After some analysis, I've determined that the call to GroupEntry.unregisterGroup() can be moved outside the synchronization of groupTable. This removes the ordering problem.

With these fixes in place (the state of the webrev above) I can get several hundred successful test runs with neither ConcurrentModificationException nor any deadlocks.

Of course, that doesn't mean the change is correct. :-)

Questions:

1. Is there a requirement that the serialized form of Activation remain unchanged? If we can change it, we might as well just use ConcurrentHashMap instead of HashMap.

2. Is my lock ordering analysis correct? I've pored over the code, but not really being familiar with any RMI concepts, I'm not sure I have it right. I've written this analysis into a big comment I've added to the code.

3. There is also a potential concurrency issue with idTable, which is used similarly to groupTable. I haven't seen a test failure from it though. It seems sensible to add a lock for it in Activation.writeObject() though. I don't particularly like nesting the locks of idTable and groupTable, but locking them separately would require serializing the Activation object field by field instead of simply using defaultWriteObject(). Is this a reasonable approach?

Thanks for any advice or comments.

s'marks

Alan Bateman

20 Apr 20 Apr

4 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Stuart Marks wrote:

...

Hi all,

Please review an updated webrev for this bug:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.1/

Using ConcurrentHashMap is much nicer in many ways, and it seems to resolve the JCK failures at least as well as the previous fix did. This does nothing to remedy the issue of snapshot consistency, which seems to be covered by 4945726, which we'll postpone (again) to a future release.

This is nicer and apologies for raising the concern about serialization compatibility - I incorrectly thought that these tables were declared as HashMaps. At L242 you might want to include a comment to remind a future maintainer why it's a zero length array rather than using size(). -Alan.

Stuart Marks

5:50 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

On 4/20/11 9:00 AM, Alan Bateman wrote:

...

Stuart Marks wrote:

...
Hi all,

Please review an updated webrev for this bug:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.1/

Using ConcurrentHashMap is much nicer in many ways, and it seems to resolve the JCK failures at least as well as the previous fix did. This does nothing to remedy the issue of snapshot consistency, which seems to be covered by 4945726, which we'll postpone (again) to a future release.

This is nicer and apologies for raising the concern about serialization compatibility - I incorrectly thought that these tables were declared as HashMaps.

At L242 you might want to include a comment to remind a future maintainer why it's a zero length array rather than using size().

Yes, good idea; I had puzzled over this a bit myself and indeed I had gotten it wrong the first time. I'll add a comment and push the fix. s'marks

Peter Jones

6:37 p.m.

New subject: advice & review requested for 6896297, race condition in rmid causing JCK failure

Hi Stuart, On Apr 19, 2011, at 6:35 PM, Stuart Marks wrote:

...

Please review an updated webrev for this bug:

http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.1/

Using ConcurrentHashMap is much nicer in many ways, and it seems to resolve the JCK failures at least as well as the previous fix did.

These changes look good to me.

...

This does nothing to remedy the issue of snapshot consistency, which seems to be covered by 4945726, which we'll postpone (again) to a future release.

...

On 4/14/11 4:01 PM, Stuart Marks wrote:

...
The larger issue of snapshot consistency is indeed troubling. I'd agree that the ConcurrentModificationException is a symptom of a larger problem and that making it go away (either with a different locking strategy or using CHM) is mostly merely suppressing the symptom without addressing the larger issue. The immediate need, though, is to fix the JCK failure, so I suspect what I'll need to do is to push a fix for the CME and handle the resolution of any larger issue separately.

Understood, makes sense. Cheers, -- Peter

5437

Age (days ago)

5450

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Alan Bateman
David Holmes
Peter Jones
Stuart Marks

advice & review requested for 6896297, race condition in rmid causing JCK failure

tags

participants (4)