RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Jason Mehrens jason_mehrens at hotmail.com
Fri May 15 01:06:07 UTC 2020


I think you have it covered. Thanks for the feedback it for sure helped me understand the approach.

So on a lighthearted note I did figure out a way (tongue in cheek) to test if Collection::contains is expensive. You could even figure out if the membership semantics of each collection are the same without even creating new APIs.  Anyway, don't take this example seriously and enjoy!
          
   @Deprecated
     public static boolean removeAllTester(Collection<?> dis, Collection<?> c) {
        Objects.requireNonNull(c);
        boolean modified = false;
        
        class Ref {
            int count;
            Ref() {
            }
            
            public boolean equals(Object o) {
                count++;
                return super.equals(o);
            }
            
            public int hashCode() {
                count++;
                return super.hashCode();
            }
        }
        
        boolean expensiveContains;
        try {
            Ref test = new Ref();
            c.contains(test);
            //Identity == zero and greater than 10 is a poor bin length
            //Obviously this fails if the collection::contains performs 'e.equals(o)'
            expensiveContains = test.count >= 10;
        } catch (RuntimeException usesCompare) {
            expensiveContains = false;
        }
        
        if (expensiveContains) {
            for (Object e : c)
                modified |= dis.remove(e);
        } else {
            for (Iterator<?> i = dis.iterator(); i.hasNext(); ) {
                if (c.contains(i.next())) {
                    i.remove();
                    modified = true;
                }
            }
        }
        return modified;
    }

public static void main(String[] args) {
        int sourceSize = 300000;
        int removalsSize =  300000;
        
        Set<Integer> source = new HashSet<Integer>();
        Collection<Integer> removals = new ArrayList<Integer>();
        
        for (int i = 0; i < sourceSize; i++)
            source.add(i);
        for (int i = 1; i <= removalsSize; i++)
           removals.add(-i);
        
        long start = System.currentTimeMillis();
        boolean modified;
        
        //modified = source.removeAll(removals);
        modified = removeAllTester(source, removals);
        
        //removals.forEach(e -> source.remove(e));
        //source.removeIf(e -> removals.contains(e));
        long end = System.currentTimeMillis();
        System.out.println("Time taken: " + (end - start) + "ms " + modified);
    }


Jason
________________________________________
From: Stuart Marks <stuart.marks at oracle.com>
Sent: Tuesday, May 12, 2020 3:34 PM
To: Jason Mehrens
Cc: core-libs-dev
Subject: Re: RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes



On 5/8/20 6:46 PM, Jason Mehrens wrote:
> I assume Deque/Queue would suffer the same issue as List.  It would be nice to some how ensure ArrayDeque.contains is not called as far as the heuristic goes.  Looks like PriorityQueue uses equals for contains but that is not to say there are not other queue implementations in the wild that do something different.

The difficulty here is how far to take the heuristic. The goal here is not to
guarantee that behavior can never be O(M*N) but to catch what seem to be the
most common cases. I think the most common case is where the argument is a List,
but there are lots of possibilities that we'd inevitably miss.

> The more I think about it in general, it seems like your task would be easier if you could declare:
> 1. AbstractCollection.removeAll should assume this.contains is O(N) and iterate over this and check contains on arg.
> 2. AbstractSet.removeAll should iterate over argument and assume that this.contains/remove is at least O(log (N)) and assume argument.contains is O(N).
> 3. Concrete implementations of removeAll know the cost of their contains method.  If it is known to be O(N) then walk this otherwise walk the arg.
> 4. Include an example in removeAll that says if membership semantics differ between sets use: source.removeIf(e -> removals.contains(e)); or removals.forEach(e -> source.remove(e)); instead.  If they don't differ then use removeAll.

This sort of makes sense from a performance standpoint, but it doesn't address
the semantic issues I raised in my reply the other day to Jens Lideström.
Specifically, we don't want Set.removeAll or AbstractSet.removeAll to disagree
with Collection.removeAll and AbstractCollection.removeAll. We also want
removeAll() to be the complement of retainAll().

> Given this has been around since JDK 1.3 would it be safe to assume that code that is mixing collections with different membership semantics is already working around this issue and therefore the risk is a bit lower in changing the spec?  Lots of concrete implementations already override removeAll anyway.

I'm not sure that's true. People keep running into either the performance
problem or the semantic problem. Sometimes people live with bugs for a long time
and can't figure it out. "Yeah, removeAll sometimes gives this weird result. I
don't know why. Maybe I just don't understand Java."

s'marks


More information about the core-libs-dev mailing list