RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Sat May 9 01:46:44 UTC 2020

Stuart,

I assume Deque/Queue would suffer the same issue as List.  It would be nice to some how ensure ArrayDeque.contains is not called as far as the heuristic goes.  Looks like PriorityQueue uses equals for contains but that is not to say there are not other queue implementations in the wild that do something different.

The more I think about it in general, it seems like your task would be easier if you could declare:
1. AbstractCollection.removeAll should assume this.contains is O(N) and iterate over this and check contains on arg.
2. AbstractSet.removeAll should iterate over argument and assume that this.contains/remove is at least O(log (N)) and assume argument.contains is O(N).
3. Concrete implementations of removeAll know the cost of their contains method.  If it is known to be O(N) then walk this otherwise walk the arg.
4. Include an example in removeAll that says if membership semantics differ between sets use: source.removeIf(e -> removals.contains(e)); or removals.forEach(e -> source.remove(e)); instead.  If they don't differ then use removeAll.

Given this has been around since JDK 1.3 would it be safe to assume that code that is mixing collections with different membership semantics is already working around this issue and therefore the risk is a bit lower in changing the spec?  Lots of concrete implementations already override removeAll anyway.

Jason

________________________________________
From: core-libs-dev <core-libs-dev-bounces at openjdk.java.net> on behalf of Stuart Marks <stuart.marks at oracle.com>
Sent: Monday, May 4, 2020 7:25 PM
To: Jason Mehrens
Cc: core-libs-dev
Subject: Re: RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

On 5/1/20 10:41 PM, Jason Mehrens wrote:
> 1. I assume you are using "c instanceof List" instead of "!(c instanceof Set)" to correctly handle IdentitityHashMap.values()?  The instanceof List seems like safe choice but it is too bad we can still fool that check by wrapping List as an unmodifiableCollection.  If splitIterator().characteristics() could tell us that the collection used identity comparisons I think we would be able to determine if it was safe to swap the ordering in the general case as we could check for IDENTITY, SORTED, and comparator equality.

I'm using "instance List", not for the reason of IdentityHashMap.values()
specifically (though that's a good example), but mainly to try to be minimal.
While I think getting the semantics right takes priority, the change does impact
performance, so I want to reintroduce the performance heuristic in the safest
manner possible. Checking for !Set seems dangerous, as there might be any number
of Collections out there that aren't Sets and that aren't consistent with
equals. Checking for instanceof List seemed like the safest and most minimal
thing to do.

In fact, I'm not even sure how safe List is! It's certainly possible for someone
to have a List that isn't consistent with equals. Such a thing might violate the
List contract, but that hasn't stopped people from implementing such things
(outside the JDK).

I also toyed around with various additional tests for when it would be
profitable to switch iteration to the smaller collection. This could be applied
to a variety of consistent-with-equals set implementations in the JDK. The
benefits of swapping the iteration is more modest in these cases compared to
List, however. Avoiding repeated List.contains() helps avoid quasi-quadratic
(O(M*N)) performance. Swapping iteration order of sets gets us only the smaller
of O(M) vs O(N), which is still linear.

Also, as you noted, this heuristic is easily defeated by things like the
collection wrappers.

> 2. Should code applied to HashSet.removeAll also be applied to HashMap.keySet().removeAll and HashMap.entrySet().removeAll?  Collections::newSetFromMap will end up having different behavior if it is not consistently applied.

I think the *results* will be consistent, but the *performance* might not be
consistent.

But these cases could result in pathological performance if removeAll(list) is
called, so I'm a bit concerned about them. If we agree that "instanceof List" is
a reasonable heuristic, then I don't have any objection in principle to adding
it to these locations as well. But I want to be careful about sprinkling ad hoc
policies like this around the JDK.

I note that the erroneous size-based optimization was copied into, and therefore
needs to be removed from ConcurrentSkipListSet (JDK-8218945) and the set views
of CHM (JDK-8219069). I'd more concerned about getting these cleaned up in the
short term.

> 3. Not to derail this thread but do think we need a new JIRA ticket to address Collections.disjoint?  Seems like it has similar sins of calling size and using "!(c2 instanceof Set)" but I don't want to muddy the waters by covering any solutions to that method in this thread.

Yeah I'm not sure what to do about Collections.disjoint().

Note that there are some statements in bug reports (possibly even made by me!)
that assert that Collections.disjoint should be consistent with Set.removeAll. I
don't think that's the case. As discussed elsewhere, removeAll() needs to be
consistent about relying on the membership semantics of the argument collection.

As Collections.disjoint currently stands, it has the big "Care must be
exercised" disclaimer in its spec, and it goes to some length to make a bunch of
tests and vary the iteration accordingly. The programmer can get a speedup using
disjoint() compared to copying a set and then calling retainAll(), provided
sufficient care is taken. Maybe that's an acceptable tradeoff.

If you have any ideas about how disjoint()'s spec or implementation could be
improved, you could file a JIRA issue, or I could file one if you sent me the info.

s'marks