RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Wed May 6 21:08:48 UTC 2020

On 5/4/20 6:44 PM, Alan Snyder wrote:
> What problem are you trying to solve?
> 
> The Collections framework explicitly does not support custom membership 
> semantics. If you think it should, file an RFE and create a JEP. It’s not a 
> small change and tinkering is not the way to get there.

There are already three different kinds of sets in the JDK that support 
different membership semantics: sets derived from IdentityHashMap, ordinary sets 
like HashSet, and the SortedSet/NavigableSet family. Arguably the last already 
supports custom membership semantics, as it's possible for callers to provide 
their own comparators. I'm trying to fix semantic bugs in the way various 
collection operations handle situations with mixed membership semantics, and 
secondarily, to avoid pathological performance problems that have arisen.

> If your present concern is performance surprises, be aware that your new 
> proposal has the same problem as the old one. Although it removes some 
> data-dependent performance surprises, it adds a potential JDK-dependent 
> performance surprise. The data-dependent performance issues can be detected by 
> testing, but no amount of testing can alert a developer to the possibility that 
> an unexpected implementation change in a future JDK might cause a big 
> performance hit.
> 
> You may have mis-remembered the performance problem that I am concerned about. 
> It involves using removeAll to remove a relatively small number of elements from 
> a large, hash based collection. The performance hit is the need to iterate over 
> the entire collection and test every element regardless of the number of 
> elements being removed. Although the performance problem may be exacerbated when 
> the argument is a List, the problem exists for any collection that is much 
> smaller than the target collection.

You're conflating two different parts of the performance issue.

This is illustrated in an article that Jon Skeet posted back in 2010, [1] which 
is linked from JDK-6982173. Briefly, Skeet observed that a removeAll using a 
List of 300,000 elements could take nearly 3 minutes, whereas iterating a 
HashSet of 1,000,000 elements would take only 38ms.

(These numbers are from 2010, and hardware is certainly different today, and 
these aren't rigorous benchmarks. However, an informal benchmark that shows the 
difference between 3 minutes and 38ms is a pretty clear demonstration of a 
performance problem.)

Taking 3 minutes for this kind of operation is clearly pathological behavior, 
which is what I'm trying to address. Although it seems impossible to prevent it 
from ever happening, putting some special cases for handling List into places 
such as HashSet.removeAll would seem to cover the most of the common cases.

It's true that if you're removing a small set from a large set, iterating the 
"wrong" set might take 38ms instead of a much smaller time (probably 
microseconds). This would indeed be a performance regression. (It might also be 
an improvement in correctness, if the sets have different membership contracts.)

The fact is that there are performance regressions from one JDK release to the 
next. Sometimes they're introduced by accident, and we try to address these 
where possible. Sometimes they're introduced intentionally, as part of various 
tradeoffs. That's what's going on here. I'm improving the correctness of the 
system and avoiding pathological performance problems, while introducing a 
performance regression that seems modest relative to the pathological 
performance issue that's being mitigated.

s'marks

[1] 
https://codeblog.jonskeet.uk/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza/

> 
>    Alan
> 
> 
> 
> 
>> On May 4, 2020, at 5:25 PM, Stuart Marks <stuart.marks at oracle.com 
>> <mailto:stuart.marks at oracle.com>> wrote:
>>
>>
>>
>> On 5/1/20 10:41 PM, Jason Mehrens wrote:
>>> 1. I assume you are using "c instanceof List" instead of "!(c instanceof 
>>> Set)" to correctly handle IdentitityHashMap.values()?  The instanceof List 
>>> seems like safe choice but it is too bad we can still fool that check by 
>>> wrapping List as an unmodifiableCollection.  If 
>>> splitIterator().characteristics() could tell us that the collection used 
>>> identity comparisons I think we would be able to determine if it was safe to 
>>> swap the ordering in the general case as we could check for IDENTITY, SORTED, 
>>> and comparator equality.
>>
>> I'm using "instance List", not for the reason of IdentityHashMap.values() 
>> specifically (though that's a good example), but mainly to try to be minimal. 
>> While I think getting the semantics right takes priority, the change does 
>> impact performance, so I want to reintroduce the performance heuristic in the 
>> safest manner possible. Checking for !Set seems dangerous, as there might be 
>> any number of Collections out there that aren't Sets and that aren't 
>> consistent with equals. Checking for instanceof List seemed like the safest 
>> and most minimal thing to do.
>>
>> In fact, I'm not even sure how safe List is! It's certainly possible for 
>> someone to have a List that isn't consistent with equals. Such a thing might 
>> violate the List contract, but that hasn't stopped people from implementing 
>> such things (outside the JDK).
>>
>> I also toyed around with various additional tests for when it would be 
>> profitable to switch iteration to the smaller collection. This could be 
>> applied to a variety of consistent-with-equals set implementations in the JDK. 
>> The benefits of swapping the iteration is more modest in these cases compared 
>> to List, however. Avoiding repeated List.contains() helps avoid 
>> quasi-quadratic (O(M*N)) performance. Swapping iteration order of sets gets us 
>> only the smaller of O(M) vs O(N), which is still linear.
>>
>> Also, as you noted, this heuristic is easily defeated by things like the 
>> collection wrappers.
>>
>>> 2. Should code applied to HashSet.removeAll also be applied to 
>>> HashMap.keySet().removeAll and HashMap.entrySet().removeAll? 
>>>  Collections::newSetFromMap will end up having different behavior if it is 
>>> not consistently applied.
>>
>> I think the *results* will be consistent, but the *performance* might not be 
>> consistent.
>>
>> But these cases could result in pathological performance if removeAll(list) is 
>> called, so I'm a bit concerned about them. If we agree that "instanceof List" 
>> is a reasonable heuristic, then I don't have any objection in principle to 
>> adding it to these locations as well. But I want to be careful about 
>> sprinkling ad hoc policies like this around the JDK.
>>
>> I note that the erroneous size-based optimization was copied into, and 
>> therefore needs to be removed from ConcurrentSkipListSet (JDK-8218945) and the 
>> set views of CHM (JDK-8219069). I'd more concerned about getting these cleaned 
>> up in the short term.
>>
>>> 3. Not to derail this thread but do think we need a new JIRA ticket to 
>>> address Collections.disjoint?  Seems like it has similar sins of calling size 
>>> and using "!(c2 instanceof Set)" but I don't want to muddy the waters by 
>>> covering any solutions to that method in this thread.
>>
>> Yeah I'm not sure what to do about Collections.disjoint().
>>
>> Note that there are some statements in bug reports (possibly even made by me!) 
>> that assert that Collections.disjoint should be consistent with Set.removeAll. 
>> I don't think that's the case. As discussed elsewhere, removeAll() needs to be 
>> consistent about relying on the membership semantics of the argument collection.
>>
>> As Collections.disjoint currently stands, it has the big "Care must be 
>> exercised" disclaimer in its spec, and it goes to some length to make a bunch 
>> of tests and vary the iteration accordingly. The programmer can get a speedup 
>> using disjoint() compared to copying a set and then calling retainAll(), 
>> provided sufficient care is taken. Maybe that's an acceptable tradeoff.
>>
>> If you have any ideas about how disjoint()'s spec or implementation could be 
>> improved, you could file a JIRA issue, or I could file one if you sent me the 
>> info.
>>
>> s'marks
>>
>