RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Mon May 4 23:19:15 UTC 2020

Hi Jens,

This is a situation where intuition can easily lead one astray. Indeed, I 
initially made the same assumption [1] that x.removeAll(y) should essentially 
call x.remove() on each element of y, as if y::forEach(x::remove). This would 
imply that y is iterated and that x's membership semantics would be used. 
However, after some further discussion, research, and analysis, I changed my 
mind. [2] This is somewhat counterintuitive, but there are good reasons for 
going the opposite direction.

First, consider the specification of Collection.removeAll:

     Removes all of this collection's elements that are also contained in
     the specified collection[...].

This is somewhat oddly worded, even stilted. Why doesn't it say something simple 
like "removes each of the elements of the specified collection from this 
collection"?

The answer is that they mean different things. Consider:

     var list = new ArrayList<>(List.of("a", "a"))
     list.removeAll(List.of("a"))
     list ==> []

The word "all" is used in two senses simultaneously here: it removes from this 
collection "all" occurrences of "all" the elements of the specified collection.

The implication here is that *this* collection is iterated, and an element is 
removed if it's contained in the specified collection. That implies contains() 
is called on the specified collection, and that uses the specified collection's 
membership semantics. AbstractCollection.removeAll is implemented exactly this way.

Given that Set.removeAll overrides Collection.removeAll, they should have 
compatible semantics to the greatest extent possible. (And yes, there are 
various existing violations of Liskov subsitutability in the collections 
framework, but changing this would introduce a new violation as well as a 
massive incompatibility.)

You might say, the example above has duplicates in *this* collection, which 
turns out to be a list. But this is different, because now we're talking about 
sets, which can't have duplicates, so the alternative sense of removeAll doesn't 
matter.

It still matters, because the notion of "duplicate" depends on whose membership 
semantics are in use. Consider:

     Collection<String> list = new ArrayList<>(List.of("a", "A"))
     Collection<String> hash = new HashSet<>(list)
     var tree = new TreeSet<>(String.CASE_INSENSITIVE_ORDER)
     tree.add("A")
     list.removeAll(tree)
     list ==> []
     hash.removeAll(tree)
     hash ==> []

Even though neither the List nor the HashSet appears to have duplicates, they 
both do when using the TreeSet's semantics. It would be terribly inconsistent to 
have AbstractCollection.removeAll and AbstractSet.removeAll differ in which 
collection's semantics they depend upon.

Finally, there is retainAll. It may be that removeAll is used more than 
retainAll, but their specifications and semantics are inextricably linked. Their 
specs are:

     removeAll: Removes all of this collection's elements that are also
     contained in the specified collection

     retainAll: Retains only the elements in this collection that are
     contained in the specified collection

The specifications are inverses of each other, so the result of removeAll should 
be the complement of the result of retainAll, and both rely on the contains() 
semantics of the specified collection.

**

It's perhaps unfortunate that Set.removeAll seems "backwards", but there are 
good reasons it's this way. It needs to stay this way in order to remain 
compatible with Collections.removeAll and to preserve its relationship with 
retainAll.

s'marks

[1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058172.html

[2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058540.html

On 5/2/20 11:26 AM, Jens Lideström wrote:
> I will chip in once more on this topic. (I did so once before [1], perhaps
> in a much too wordy manner.)
> 
> The proposed changes in AbstractSet.removeAll will be inherited by
> TreeSet. A TreeSet might use a comparator as a custom equality test.
> 
> Because of this it seems backwards if TreeSet.removeAll where to use the
> equality semantics of the argument instead of the ones of the receiver,
> which will be the case with the proposed changes. It seems much more
> intuitive and regular to instead use the equality semantics of the receiver.
> 
> The same thing applies to other collections with custom equality tests.
> 
> It seems to be surprising and error prone to users if TreeSet where to
> behave like this:
> 
> var s = new TreeSet<String>(String.CASE_INSENSITIVE_ORDER);
> s.addAll(Set.of("A", "B"));
> s.remove("a"); // Removes "A" as expected
> s.removeAll(List.of("b")); // Ops, doesn't remove "B"!
> 
> Many other common methods use the equality semantics of the receiver:
> containsAll, equals, remove, contains. To let TreeSet.removeAll be
> inconsistent with all these methods seems worse than it being inconsistent
> with the less commonly used TreeSet.retainAll.
> 
> The conclusion is that I think it would be better if TreeSet or
> AbstractSet gets an implementation of removeAll which iterates over the
> argument collection.
> 
> Best regards,
> Jens Lideström
> 
> [1]:
> https://mail.openjdk.java.net/pipermail/core-libs-dev/2019-July/061581.html
> 
> On 2020-05-01 22:01, Stuart Marks wrote:
>> Hi all,
>>
>> I'm taking another swing at this one. I've taken a minimal approach
>> compared to my previous proposals.
>>
>> Briefly, AbstractSet.removeAll switches from iterating this collection and
>> calling contains() on its argument, to iterating the argument and calling
>> this.contains(), if the argument collection is smaller than this
>> collection. This attempt at optimization is incorrect, because this
>> collection and the argument collection might have different semantics for
>> contains().
>>
>> There is a confounding performance problem, which is that if the argument
>> is a List, contains() is generally linear, which can result in
>> pathologically slow (O(m*n)) performance. Because of the
>> iteration-switching above, this performance problem can appear and
>> disappear based on the relative sizes, which is startling.
>>
>> The fix for the semantic problem is simply to remove the attempted
>> optimization from AbstractSet. This comports with the specification of
>> Collection.removeAll and Set.removeAll, which imply that contains() is
>> called on the argument collection. This allows sets to inherit the
>> implementation in AbstractCollection.removeAll. In addition, this ensures
>> that removeAll is now always the complement of retainAll (as it should
>> be), which it sometimes was not when the optimization was in place.
>>
>> IdentityHashMap's keySet and entrySet views were broken by this
>> optimization, so they had to override removeAll with copies of the
>> implementation from AbstractCollection. Since they can now inherit from
>> AbstractCollection, these overrides have been removed.
>>
>> This leaves a potential performance problem with removeAll when the
>> argument is a List. To mitigate this, HashSet.removeAll switches to
>> iterating the argument if the argument is a List. This is safe, as both
>> HashSet and List use the same semantics for contains() (at least, as far
>> as I know).
>>
>> I've opted not to pursue size-based optimizations at this time, since they
>> provide only incremental benefit, whereas the pathological performance
>> problem mentioned above is the primary issue. Also, it's actually quite
>> difficult to determine when it's safe to switch iteration.
>>
>> Finally, I've added performance notes to the specifications of
>> containsAll(), removeAll(), and retainAll(), warning about potential
>> performance problems that can occur with repeated calls to contains() made
>> by these methods.
>>
>> Bug:
>>
>>      https://bugs.openjdk.java.net/browse/JDK-6394757
>>
>> Webrev:
>>
>>      http://cr.openjdk.java.net/~smarks/reviews/6394757/webrev.2/
>>
>> Previous discussions:
>>
>>     
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-July/007125.html
>>
>>     
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058140.html
>>
>>     
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058378.html
>>
>>
>>      http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060007.html
>>
>>      http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-May/060147.html
>>
>> Thanks,
>>
>> s'marks
>>