RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Stuart Marks stuart.marks at oracle.com
Tue May 5 00:25:32 UTC 2020



On 5/1/20 10:41 PM, Jason Mehrens wrote:
> 1. I assume you are using "c instanceof List" instead of "!(c instanceof Set)" to correctly handle IdentitityHashMap.values()?  The instanceof List seems like safe choice but it is too bad we can still fool that check by wrapping List as an unmodifiableCollection.  If splitIterator().characteristics() could tell us that the collection used identity comparisons I think we would be able to determine if it was safe to swap the ordering in the general case as we could check for IDENTITY, SORTED, and comparator equality.

I'm using "instance List", not for the reason of IdentityHashMap.values() 
specifically (though that's a good example), but mainly to try to be minimal. 
While I think getting the semantics right takes priority, the change does impact 
performance, so I want to reintroduce the performance heuristic in the safest 
manner possible. Checking for !Set seems dangerous, as there might be any number 
of Collections out there that aren't Sets and that aren't consistent with 
equals. Checking for instanceof List seemed like the safest and most minimal 
thing to do.

In fact, I'm not even sure how safe List is! It's certainly possible for someone 
to have a List that isn't consistent with equals. Such a thing might violate the 
List contract, but that hasn't stopped people from implementing such things 
(outside the JDK).

I also toyed around with various additional tests for when it would be 
profitable to switch iteration to the smaller collection. This could be applied 
to a variety of consistent-with-equals set implementations in the JDK. The 
benefits of swapping the iteration is more modest in these cases compared to 
List, however. Avoiding repeated List.contains() helps avoid quasi-quadratic 
(O(M*N)) performance. Swapping iteration order of sets gets us only the smaller 
of O(M) vs O(N), which is still linear.

Also, as you noted, this heuristic is easily defeated by things like the 
collection wrappers.

> 2. Should code applied to HashSet.removeAll also be applied to HashMap.keySet().removeAll and HashMap.entrySet().removeAll?  Collections::newSetFromMap will end up having different behavior if it is not consistently applied.

I think the *results* will be consistent, but the *performance* might not be 
consistent.

But these cases could result in pathological performance if removeAll(list) is 
called, so I'm a bit concerned about them. If we agree that "instanceof List" is 
a reasonable heuristic, then I don't have any objection in principle to adding 
it to these locations as well. But I want to be careful about sprinkling ad hoc 
policies like this around the JDK.

I note that the erroneous size-based optimization was copied into, and therefore 
needs to be removed from ConcurrentSkipListSet (JDK-8218945) and the set views 
of CHM (JDK-8219069). I'd more concerned about getting these cleaned up in the 
short term.

> 3. Not to derail this thread but do think we need a new JIRA ticket to address Collections.disjoint?  Seems like it has similar sins of calling size and using "!(c2 instanceof Set)" but I don't want to muddy the waters by covering any solutions to that method in this thread.

Yeah I'm not sure what to do about Collections.disjoint().

Note that there are some statements in bug reports (possibly even made by me!) 
that assert that Collections.disjoint should be consistent with Set.removeAll. I 
don't think that's the case. As discussed elsewhere, removeAll() needs to be 
consistent about relying on the membership semantics of the argument collection.

As Collections.disjoint currently stands, it has the big "Care must be 
exercised" disclaimer in its spec, and it goes to some length to make a bunch of 
tests and vary the iteration accordingly. The programmer can get a speedup using 
disjoint() compared to copying a set and then calling retainAll(), provided 
sufficient care is taken. Maybe that's an acceptable tradeoff.

If you have any ideas about how disjoint()'s spec or implementation could be 
improved, you could file a JIRA issue, or I could file one if you sent me the info.

s'marks


More information about the core-libs-dev mailing list