RFR (M): JDK-6394757: rev1: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Wed May 22 16:02:58 UTC 2019

For the record, the performance change that concerns me involves the removal of a small number of elements from a large HashSet.

Let’s say I want to remove three elements from a large HashSet.

I could write

target.remove(e1);
target.remove(e2);
target.remove(e3);

or, if the elements were in a collection

target.removeAll(c);

The first case takes advantage of hashing. The parameter elements are hashed and only the elements in the matching bucket are compared (using equals). That makes this an O(1) operation in terms of the size of the target set, assuming it is properly loaded.

One might naively think that the second case performs similarly. In the current implementation it does. In the changed implementation, *all* of the elements of the target set must be examined and tested for being members of the parameter collection. The operation is now O(n), where n is the size of the target set.

This is a huge difference. I’m not sure it makes sense to say that we need to gather data to see how much of a problem this is. Anyone who uses large sets in this manner has had no reason to complain. They will complain only after they start using the new implementation.

  Alan

> On May 20, 2019, at 4:53 PM, Stuart Marks <stuart.marks at oracle.com> wrote:
> 
> On 5/16/19 11:48 PM, Peter Levart wrote:
>> As Stuart says, optimizations mustn't change semantics. So is there a possibly narrower optimization, that doesn't change the semantics?
>> Here's a sketch of one:
>> - create a marker interface (could be JDK-internal) and mark all conforming Set implementations with it
>> - use AbstractSet.removeAll optimization only when both collections implement the marker interface
> 
> Hi Peter,
> 
> It's certainly reasonable for there to be some concern over the potential performance loss. I don't particularly like the fact that even as we're improving the semantics, that some programs will suffer a slowdown, even those that aren't actually affected by the semantic issues.
> 
> However, I'm not convinced that we really understand the performance issue. When this was brought up in January, [1] the claim was that this wasted thousands of CPU hours. I'm skeptical. To back this up, in a followup message [2] the original poster linked to a Stack Overflow question [3] that turns out to have been derived from an old article by Jon Skeet [4]. To be sure, this is an interesting technical discussion, but it doesn't count as evidence that this is an actual performance problem.
> 
> If you look at the example closely, it turns out that it calls removeAll() and passes an argument a List with 300,000 elements. If you're doing operations on large lists, maybe you shouldn't be surprised to see a slowdown. Maybe instead you should have been surprised (in the iteration-switched case) that it ran reasonably fast at all.
> 
> Nonetheless it remains a possibility that some code that worked properly and performed reasonably well will suffer a slowdown. One way -- and I stress that this is only one way -- that performance might be improved is to switch the collection upon which iteration is done. The linked bug reports have a bunch of suggestions for doing this, including checking to see whether the argument is a List. To these I'd add the stipulation that, the iteration can be switched if it can be determined that it is *safe* to do so; and the safety depends on whether the two collections use compatible membership semantics.
> 
> Collections that use equals() are compatible with each other; keysets of IdentityHashMap are compatible with each other; SortedSets that use identical comparison methods are compatible with each other. But note, SortedSets that use a comparisons methods that are consistent with equals are also compatible with collections that use equals(). And it seems likely to be undecidable whether two different (!=) comparison methods are in fact compatible.
> 
> In any case it would interesting to try to chew on this notion of compatible membership semantics for a while. It might be difficult to come up with a complete and correct test. It's probably easier to come up with an approximation, like ConcurrentSkipListMap's equals() method. [5] I'm not sure such a discussion would be fruitful, though, until we're sure we understand how much of a performance problem this actually is.
> 
> s'marks
> 
> 
> 
> 
> [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-January/058140.html
> 
> [2] http://mail.openjdk.java.net/pipermail/core-libs-dev/2019-February/058378.html
> 
> [3] https://stackoverflow.com/questions/28671903/the-hashsett-removeall-method-is-surprisingly-slow
> 
> [4] https://codeblog.jonskeet.uk/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza/
> 
> [5] http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/java.base/share/classes/java/util/concurrent/ConcurrentSkipListMap.java#l1706
>