RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Fri May 1 20:01:53 UTC 2020

Hi all,

I'm taking another swing at this one. I've taken a minimal approach compared to 
my previous proposals.

Briefly, AbstractSet.removeAll switches from iterating this collection and 
calling contains() on its argument, to iterating the argument and calling 
this.contains(), if the argument collection is smaller than this collection. 
This attempt at optimization is incorrect, because this collection and the 
argument collection might have different semantics for contains().

There is a confounding performance problem, which is that if the argument is a 
List, contains() is generally linear, which can result in pathologically slow 
(O(m*n)) performance. Because of the iteration-switching above, this performance 
problem can appear and disappear based on the relative sizes, which is startling.

The fix for the semantic problem is simply to remove the attempted optimization 
from AbstractSet. This comports with the specification of Collection.removeAll 
and Set.removeAll, which imply that contains() is called on the argument 
collection. This allows sets to inherit the implementation in 
AbstractCollection.removeAll. In addition, this ensures that removeAll is now 
always the complement of retainAll (as it should be), which it sometimes was not 
when the optimization was in place.

IdentityHashMap's keySet and entrySet views were broken by this optimization, so 
they had to override removeAll with copies of the implementation from 
AbstractCollection. Since they can now inherit from AbstractCollection, these 
overrides have been removed.

This leaves a potential performance problem with removeAll when the argument is 
a List. To mitigate this, HashSet.removeAll switches to iterating the argument 
if the argument is a List. This is safe, as both HashSet and List use the same 
semantics for contains() (at least, as far as I know).

I've opted not to pursue size-based optimizations at this time, since they 
provide only incremental benefit, whereas the pathological performance problem 
mentioned above is the primary issue. Also, it's actually quite difficult to 
determine when it's safe to switch iteration.

Finally, I've added performance notes to the specifications of containsAll(), 
removeAll(), and retainAll(), warning about potential performance problems that 
can occur with repeated calls to contains() made by these methods.

Bug:

     https://bugs.openjdk.java.net/browse/JDK-6394757

Webrev:

     http://cr.openjdk.java.net/~smarks/reviews/6394757/webrev.2/

Previous discussions:

     http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html

     http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058140.html

     http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058378.html

     http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060007.html

     http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060147.html

Thanks,

s'marks