RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Thu May 7 00:05:17 UTC 2020

> On May 6, 2020, at 2:08 PM, Stuart Marks <stuart.marks at oracle.com> wrote:
> 
> 
> 
> On 5/4/20 6:44 PM, Alan Snyder wrote:
>> What problem are you trying to solve?
>> The Collections framework explicitly does not support custom membership semantics. If you think it should, file an RFE and create a JEP. It’s not a small change and tinkering is not the way to get there.
> 
> There are already three different kinds of sets in the JDK that support different membership semantics: sets derived from IdentityHashMap, ordinary sets like HashSet, and the SortedSet/NavigableSet family. Arguably the last already supports custom membership semantics, as it's possible for callers to provide their own comparators. I'm trying to fix semantic bugs in the way various collection operations handle situations with mixed membership semantics, and secondarily, to avoid pathological performance problems that have arisen.
> 

Let me clarify. I was referring to the non-support by the framework, which restricts membership semantics:
Many methods in Collections Framework interfaces are defined in terms of the equals <https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/lang/Object.html#equals(java.lang.Object)> method. For example, the specification for the contains(Object o) <https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/util/Collection.html#contains(java.lang.Object)> method says: "returns true if and only if this collection contains at least one element e such that (o==null ? e==null : o.equals(e))." This specification should not be construed to imply that invoking Collection.contains with a non-null argument o will cause o.equals(e) to be invoked for any element e. Implementations are free to implement optimizations whereby the equals invocation is avoided, for example, by first comparing the hash codes of the two elements. (The Object.hashCode() <https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/lang/Object.html#hashCode()>specification guarantees that two objects with unequal hash codes cannot be equal.) More generally, implementations of the various Collections Framework interfaces are free to take advantage of the specified behavior of underlying Object <https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/lang/Object.html> methods wherever the implementor deems it appropriate.

Although the framework provides implementation classes that MAY be used to create nonconforming collections, these classes are called out as potentially non-conforming. For example:
This class is not a general-purpose Map implementation! While this class implements the Map interface, it intentionally violates Map's general contract, which mandates the use of the equals method when comparing objects. This class is designed for use only in the rare cases wherein reference-equality semantics are required.

And in many places, the implications of non-conformance are also called out:
While the object returned by this method implements the Collection interface, it does not obey Collection's general contract. Like its backing map, the collection returned by this method defines element equality as reference-equality rather than object-equality. This affects the behavior of its contains, remove and containsAll methods.

The phrase “affects the behavior” is open ended. It means all bets are off.

(I am glad to see these statements in the specification, but I would not assume that all possible behavioral consequences are documented.)

As it is also the case that these classes CAN be used to create conforming collections, there is justification for including them in the generic type hierarchy, but such inclusion does not imply conformance for all instances of these classes. The type system is inadequate to fully distinguish conforming and non-conforming collections.

My point is that (fully) supporting custom membership semantics changes a fundamental property of the original design and should be approached as a redesign. It is not a matter of fixing a series of issues one at a time, as you discover them. Calling these issues semantic bugs when there is no violation of the specification is leading you down the wrong path, in my opinion.

What would (full) support for custom membership semantics look like? What restrictions would you place on the contains() method? What axioms would you specify for the generic operations? Is substitutability tossed out the window? What does it mean for an object to be a member of a set (according to contains) but not be returned by the iterator or included in the size? What should Set.copyOf() do if its argument is a non-conforming collection? etc. etc.

>> If your present concern is performance surprises, be aware that your new proposal has the same problem as the old one. Although it removes some data-dependent performance surprises, it adds a potential JDK-dependent performance surprise. The data-dependent performance issues can be detected by testing, but no amount of testing can alert a developer to the possibility that an unexpected implementation change in a future JDK might cause a big performance hit.
>> You may have mis-remembered the performance problem that I am concerned about. It involves using removeAll to remove a relatively small number of elements from a large, hash based collection. The performance hit is the need to iterate over the entire collection and test every element regardless of the number of elements being removed. Although the performance problem may be exacerbated when the argument is a List, the problem exists for any collection that is much smaller than the target collection.
> 
> You're conflating two different parts of the performance issue.
> 
> This is illustrated in an article that Jon Skeet posted back in 2010, [1] which is linked from JDK-6982173. Briefly, Skeet observed that a removeAll using a List of 300,000 elements could take nearly 3 minutes, whereas iterating a HashSet of 1,000,000 elements would take only 38ms.
> 
> (These numbers are from 2010, and hardware is certainly different today, and these aren't rigorous benchmarks. However, an informal benchmark that shows the difference between 3 minutes and 38ms is a pretty clear demonstration of a performance problem.)
> 
> Taking 3 minutes for this kind of operation is clearly pathological behavior, which is what I'm trying to address. Although it seems impossible to prevent it from ever happening, putting some special cases for handling List into places such as HashSet.removeAll would seem to cover the most of the common cases.
> 
> It's true that if you're removing a small set from a large set, iterating the "wrong" set might take 38ms instead of a much smaller time (probably microseconds). This would indeed be a performance regression. (It might also be an improvement in correctness, if the sets have different membership contracts.)
> 
> The fact is that there are performance regressions from one JDK release to the next. Sometimes they're introduced by accident, and we try to address these where possible. Sometimes they're introduced intentionally, as part of various tradeoffs. That's what's going on here. I'm improving the correctness of the system and avoiding pathological performance problems, while introducing a performance regression that seems modest relative to the pathological performance issue that's being mitigated.
> 
> s'marks
> 
> [1] https://codeblog.jonskeet.uk/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza/
> 

I understand that trade-offs are necessary, and if I understand correctly, the CSR process should review intentional regressions. Do you agree?

  Alan

> 
> 
>>   Alan
>>> On May 4, 2020, at 5:25 PM, Stuart Marks <stuart.marks at oracle.com <mailto:stuart.marks at oracle.com>> wrote:
>>> 
>>> 
>>> 
>>> On 5/1/20 10:41 PM, Jason Mehrens wrote:
>>>> 1. I assume you are using "c instanceof List" instead of "!(c instanceof Set)" to correctly handle IdentitityHashMap.values()?  The instanceof List seems like safe choice but it is too bad we can still fool that check by wrapping List as an unmodifiableCollection.  If splitIterator().characteristics() could tell us that the collection used identity comparisons I think we would be able to determine if it was safe to swap the ordering in the general case as we could check for IDENTITY, SORTED, and comparator equality.
>>> 
>>> I'm using "instance List", not for the reason of IdentityHashMap.values() specifically (though that's a good example), but mainly to try to be minimal. While I think getting the semantics right takes priority, the change does impact performance, so I want to reintroduce the performance heuristic in the safest manner possible. Checking for !Set seems dangerous, as there might be any number of Collections out there that aren't Sets and that aren't consistent with equals. Checking for instanceof List seemed like the safest and most minimal thing to do.
>>> 
>>> In fact, I'm not even sure how safe List is! It's certainly possible for someone to have a List that isn't consistent with equals. Such a thing might violate the List contract, but that hasn't stopped people from implementing such things (outside the JDK).
>>> 
>>> I also toyed around with various additional tests for when it would be profitable to switch iteration to the smaller collection. This could be applied to a variety of consistent-with-equals set implementations in the JDK. The benefits of swapping the iteration is more modest in these cases compared to List, however. Avoiding repeated List.contains() helps avoid quasi-quadratic (O(M*N)) performance. Swapping iteration order of sets gets us only the smaller of O(M) vs O(N), which is still linear.
>>> 
>>> Also, as you noted, this heuristic is easily defeated by things like the collection wrappers.
>>> 
>>>> 2. Should code applied to HashSet.removeAll also be applied to HashMap.keySet().removeAll and HashMap.entrySet().removeAll?  Collections::newSetFromMap will end up having different behavior if it is not consistently applied.
>>> 
>>> I think the *results* will be consistent, but the *performance* might not be consistent.
>>> 
>>> But these cases could result in pathological performance if removeAll(list) is called, so I'm a bit concerned about them. If we agree that "instanceof List" is a reasonable heuristic, then I don't have any objection in principle to adding it to these locations as well. But I want to be careful about sprinkling ad hoc policies like this around the JDK.
>>> 
>>> I note that the erroneous size-based optimization was copied into, and therefore needs to be removed from ConcurrentSkipListSet (JDK-8218945) and the set views of CHM (JDK-8219069). I'd more concerned about getting these cleaned up in the short term.
>>> 
>>>> 3. Not to derail this thread but do think we need a new JIRA ticket to address Collections.disjoint?  Seems like it has similar sins of calling size and using "!(c2 instanceof Set)" but I don't want to muddy the waters by covering any solutions to that method in this thread.
>>> 
>>> Yeah I'm not sure what to do about Collections.disjoint().
>>> 
>>> Note that there are some statements in bug reports (possibly even made by me!) that assert that Collections.disjoint should be consistent with Set.removeAll. I don't think that's the case. As discussed elsewhere, removeAll() needs to be consistent about relying on the membership semantics of the argument collection.
>>> 
>>> As Collections.disjoint currently stands, it has the big "Care must be exercised" disclaimer in its spec, and it goes to some length to make a bunch of tests and vary the iteration accordingly. The programmer can get a speedup using disjoint() compared to copying a set and then calling retainAll(), provided sufficient care is taken. Maybe that's an acceptable tradeoff.
>>> 
>>> If you have any ideas about how disjoint()'s spec or implementation could be improved, you could file a JIRA issue, or I could file one if you sent me the info.
>>> 
>>> s'marks
>>> 
>