advice & review requested for 6896297, race condition in rmid causing JCK failure
Stuart Marks
stuart.marks at oracle.com
Tue Apr 12 21:42:23 UTC 2011
Hi David,
Thanks for your notes and analysis. Good point about the temporary
inconsistency between groupEntry and groupTable. With the change, there's a
brief moment where a GroupEntry in the registered state can be observed to be
absent from the groupTable, which wasn't possible before. I don't think this is
a problem though. I can't find any places where operations on a GroupEntry need
to look in the groupTable. (If they did exist, they'd have run into the same
deadlock that I ran into in making this change.) Most paths through the code
seem to first fetch an entry from groupTable and then operate on the entry.
Somebody could remove and unregister the entry from groupTable between the
fetch and the operation on the entry, but that can already occur in absence of
this change.
In general the style of this code seems to be to lock individual data
structures piecemeal, so I wouldn't be surprised if there are already cases
where they can be inconsistent temporarily. For example, it seems like there
ought to be some relationship between the members of idTable and groupTable,
yet they are locked and updated independently through quite different code
paths. This is dangerously close to a line of reasoning that goes like "it's
already inconsistent, so making it a little more inconsistent doesn't hurt"
which I detest. However, I don't have any better ideas at the moment.
Also note that switching from HM to CHM (regardless of what we do with the
serialized form) has the same issue. Doing this will protect the maps
themselves -- and avoid ConcurrentModificationException, the original cause of
this problem -- but it won't preserve any consistency between the maps.
Thanks for mentioning startupLock. It seemed suspicious to me too, but I think
addressing those issues is out of scope for this bugfix.
I'll talk about serialization more in my forthcoming reply to Alan.
s'marks
On 4/7/11 4:19 PM, David Holmes wrote:
> Hi Stuart,
>
> I can't answer your specific questions as I'm not familiar with the RMI
> infrastructure either. Taking the lock in writeObject and moving
> group.unregister out of the sync block to avoid deadlock seems reasonable. The
> main question to ask with such a move is whether the temporary inconsistency in
> state is something that can be observed and cause a problem.
>
> The lock ordering analysis seems reasonable.
>
> I do note that the startupLock protocol seems suspicious as there is a race
> with detection of the lock value. I assume init itself can never be invoked by
> more than one thread ;-) It may be that the object can only become accessible
> concurrently at some time during init (ie it gets published) in which case the
> protocol will work fine - but you really need a full understanding of the
> lifecycle of the objects to determine that.
>
> That aside, using ConcurrentHashMap would simplify the solution to the current
> problem. If you are concerned about the serialized form then you could define
> writeObject and readObject such that they convert between CHM and plain HM
> using the serialized-fields API.
>
> Cheers,
> David
>
> Stuart Marks said the following on 04/08/11 08:19:
>> Hi Core-Libs developers,
>>
>> I'd like to solicit some advice and discussion about this bug and a potential
>> fix I'm cooking for it. Here is the bug report; it contains details about the
>> problem and my analysis of it:
>>
>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6896297
>>
>> and here's a webrev of the fix I'm working on:
>>
>> http://cr.openjdk.java.net/~smarks/reviews/6896297/webrev.0/
>>
>> Briefly, the problem is incorrect synchronization of groupTable, a HashMap
>> field of an Activation object. The code mostly locks groupTable around any
>> access to it. However, no such locking is done during serialization. If the
>> groupTable is modified while it's being serialized,
>> ConcurrentModificationException occurs.
>>
>> The obvious fix would be to use ConcurrentHashMap instead of Hashmap and to
>> remove the external locking entirely. Unfortunately this will change the
>> serialized representation of the Activation object, which I'm not sure is
>> acceptable.
>>
>> Assuming that we can't change the serialized represenation, the alternative
>> approach would be to make sure that locking is done properly during
>> serialization. This is fairly easy to do by locking groupTable in a
>> writeObject() method. Unfortunately, this introduces a deadlock.
>>
>> This deadlock occurs because, with this modification, there are now paths
>> through the code that take locks in the opposite order. Specifically, the
>> addLogRecord() method locks the log object and then (via serialization and
>> the newly added writeObject() method) locks groupTable. However, the
>> unregisterGroup() method locks groupTable and calls
>> GroupEntry.unregisterGroup() which logs the event, which takes a lock on the
>> log.
>>
>> After some analysis, I've determined that the call to
>> GroupEntry.unregisterGroup() can be moved outside the synchronization of
>> groupTable. This removes the ordering problem.
>>
>> With these fixes in place (the state of the webrev above) I can get several
>> hundred successful test runs with neither ConcurrentModificationException nor
>> any deadlocks.
>>
>> Of course, that doesn't mean the change is correct. :-)
>>
>> Questions:
>>
>> 1. Is there a requirement that the serialized form of Activation remain
>> unchanged? If we can change it, we might as well just use ConcurrentHashMap
>> instead of HashMap.
>>
>> 2. Is my lock ordering analysis correct? I've pored over the code, but not
>> really being familiar with any RMI concepts, I'm not sure I have it right.
>> I've written this analysis into a big comment I've added to the code.
>>
>> 3. There is also a potential concurrency issue with idTable, which is used
>> similarly to groupTable. I haven't seen a test failure from it though. It
>> seems sensible to add a lock for it in Activation.writeObject() though. I
>> don't particularly like nesting the locks of idTable and groupTable, but
>> locking them separately would require serializing the Activation object field
>> by field instead of simply using defaultWriteObject(). Is this a reasonable
>> approach?
>>
>> Thanks for any advice or comments.
>>
>> s'marks
>>
More information about the core-libs-dev
mailing list