RFR [15] 6394757: rev2: AbstractSet.removeAll semantics are surprisingly dependent on relative sizes

Sat May 9 02:32:48 UTC 2020

I believe I can agree with you on a number of points.

	• The specification is confusing.
	• The problem areas involve the “membership contract”.
	• There may have been some intent initially to support multiple membership semantics (this was new to me).
	• There are inconsistencies in the specification (but we may disagree on where the inconsistencies are).

You say:

    It's also somewhat odd to say that some collection implementations are conforming whereas others aren't.

Yes, it is odd, but I would say that the design is odd, and it is odd by choice.

Here’s the situation:

You have an abstract class or interface A and a concrete class C. The specification of class C depends upon certain parameters, such as the implementation of an abstract method or the choice of a type parameter or the choice of a constructor parameter. Some uses of C produce a specification that is consistent with the specification of A, but others produce specifications that are inconsistent with the specification of A.

As the designer of A and C, you have to make a choice:

	• C does not extend or implement A. This choice makes the question of conformance moot. Even if a particular instance of C conforms to the specification of A, the type system prevents it from being used as an A. (I’m ignoring the extra complexity that arises if A is an interface and someone creates a subclass of C that implements A.)
	• Define C to extend or implement A. This choice allows any instance of C to be used as an A, even the ones that do not conform to the specification of A.

The first choice avoids all of the problems we have been discussing, but it rules out cases that have no problems with non-conformance. Presumably on the grounds of greater utility, the designers of the Collection framework (or their successors) chose the second option for several concrete classes (the ones we have been discussing). Recognizing the issue of non-conformance, they chose to document it (often using bold type). Documentation is the best they could do given that it is not possible in general to detect improper use of non-conforming objects either at compile time or at run time.

If you believe that the possibility of non-conforming instances is inherently inconsistent and inconsistency must be fixed, then the only options are to remove the inheritance link or to weaken A so that C always conforms to A. I’m not sure that either option is practical in this case.

You quote the specification:

    The behavior of a sorted set is well-defined even if its ordering is inconsistent with equals; it just fails to obey the general contract of the Set interface.

This is exactly the case above, with A=Set and C=TreeSet.

(Actually, as the specification of TreeSet does not change the specifications of the basic methods in any significant way, it would more accurate to say that the TreeSet implementation does not conform to the TreeSet specification. That is certainly inconsistent, but it could be fixed by changing the specification of TreeSet, as you implied.)

You say:

    The issue of membership semantics is part of the original design.

I agree, but only to the point of identifying inconsistent membership semantics as a source of non-conformance. What is not part of the original design is allowing a broader range of membership semantics to be conformant, which I assume is what you want.

Another point of disagreement involves the specification of membership, especially in the case of Set, which I believe is where the problems arise.

I’m not completely sure what you are thinking, but it sounds like you believe that membership is defined by the implementation of the contains() method. I believe that is incorrect (i.e., not implied by the current specification).

When I read the specification of Set, the only way I can make sense of it is that the specification of Set is defined in terms of mathematical sets.

Consider the “more formal” definition of contains():

    More formally, returns true if and only if this set contains an element e such that Objects.equals(o, e).

Clearly, it would make no sense to define the contains() method in terms of itself. But it does make sense to say that the current state of a Set is a mathematical set of instances such that no pair of instances returns true from e1.equals(e2). I believe that is what the specification is trying to say, however imperfectly.

So, rather that *defining* set membership, the contains() method *reveals* it.

Other methods also reveal set membership, such as the iterator, or compute values that depend upon set membership, such as size() and hashCode(). They are defined in terms of “the elements of the set”, i.e., the mathematical set of instances that satisfies the equals rule.

The add() method specification also depends upon set membership, and it has the interesting property that it is defined to add at most one element to the (mathematical) set of instances.

The non-conforming sets are non-conforming in these basic operations: add(), size(), iterator(), as well as operations derived from those basic operations, such as hashCode().

Consider the classic example, a TreeSet of Strings that is case-insensitive. If I add one string “hello” to an empty TreeSet, how many elements does it contain? The specification of add() says one. The size() method returns one. The iterator returns one element. But the contains() method returns true for multiple non-equal objects. Oops.

In other words, a case-insensitive TreeSet does not conform to the specification of Collection or Set (or even TreeSet, but that could be fixed).

What I conclude from this exercise:

	• As set membership is not defined by the implementation of the contains() method, the problematic removeAll() implementation is not broken just because it does not call contains() on the argument. Using iteration, hashCode() and equals() is a valid alternative, according to the current specification of Collection and Set.
	• The problems of non-conforming collections are larger than you think. It is *not* the case that all of the basic axioms still apply.
	• This history of the design is not required to understand how a comparator that is inconsistent with equals() leads to non-conformance.
	• If you are proposing to define set membership programmatically in terms of the implementation of the contains() method, that would be a very large change indeed. I would want to evaluate the complete solution before evaluating individual changes.

In case it is not obvious, there are workarounds that any developer can use to avoid creating non-conforming collections.

To create a conforming case-insensitive TreeSet of Strings, for example, define a wrapper class for Strings that implements case-insensitive equals and hashCode.

To create a conforming IdentityHashMap of Strings, for example, define a wrapper class for Strings that canonicalizes them at least to the point that no two equal instances ever co-exist.

  Alan

> On May 8, 2020, at 1:49 PM, Stuart Marks <stuart.marks at oracle.com> wrote:
> 
> 
> 
> On 5/6/20 5:05 PM, Alan Snyder wrote:
>> Let me clarify. I was referring to the non-support by the framework, which restricts membership semantics:
>> ...[specification text regarding equals()]...
> 
> OK, thanks for the clarification.
> 
>> Although the framework provides implementation classes that MAY be used to create nonconforming collections, these classes are called out as potentially non-conforming. For example:
> 
> [IdentityHashMap]
> 
>> *This class is /not/ a general-purpose |Map| implementation! While this class implements the |Map| interface, it intentionally violates |Map's| general contract, which mandates the use of the |equals| method when comparing objects. This class is designed for use only in the rare cases wherein reference-equality semantics are required.*
>> And in many places, the implications of non-conformance are also called out:
>> *While the object returned by this method implements the |Collection| interface, it does /not/ obey |Collection's| general contract. Like its backing map, the collection returned by this method defines element equality as reference-equality rather than object-equality. This affects the behavior of its |contains|, |remove| and |containsAll| methods.*
>> The phrase “affects the behavior” is open ended. It means all bets are off.
> 
> No, all bets are not off.
> 
> "All bets are off" isn't an unreasonable interpretation of the current wording of the specification, but in fact the specification is incredibly poorly worded and confusing. Plus there are several out-and-out bugs in the spec.
> 
> It's also somewhat odd to say that some collection implementations are conforming whereas others aren't. Implementations can have bugs that render them non-conformant to a specification. In this case, though, parts of the specification contradict each other. You could pick part A of the spec and say that part B is "non-conformant" but that really doesn't make sense if you consider the totality of the specification. The fact is that the spec as a whole is self-inconsistent. I intend to fix that.
> 
>> My point is that (fully) supporting custom membership semantics changes a fundamental property of the original design and should be approached as a redesign. It is not a matter of fixing a series of issues one at a time, as you discover them. Calling these issues semantic bugs when there is no violation of the specification is leading you down the wrong path, in my opinion.
> 
> The issue of membership semantics is part of the original design.
> 
> If you look at the history, the collections framework was added in JDK 1.2, and the JDK 1.2 specification includes both Set and SortedSet. There is the wording in the SortedSet specification that is very similar to what exists today: [edited for brevity]
> 
>    Note that the ordering maintained by a sorted set must be consistent with
>    equals if the sorted set is to correctly implement the Set interface.
>    This is so because the Set interface is defined in terms of the equals
>    operation, but a sorted set performs all element comparisons using its
>    compareTo (or compare) method. The behavior of a sorted set is well-defined
>    even if its ordering is inconsistent with equals; it just fails to obey the
>    general contract of the Set interface.
> 
> The original designers (including Josh Bloch, and probably others) clearly thought about these issues enough to put that wording in the specification. I don't think they thought through the full implications of this, or at least they didn't write it down, hence the vague wording above.
> 
> In addition, the issue was forgotten, and over time, bugs were introduced in the system that made things worse. For example, the "optimization" that is the root cause of the bug we are discussing (JDK-6394757) was introduced in JDK 1.3. In the comments on this bug report [1] Bloch is quoted as saying
> 
>    The spec clearly states that removeAll "Removes all this collection's
>    elements that are also contained in the specified collection." So the
>    current behavior is not just unpredictable, it's broken. It worked until
>    1.3, and was broken in 1.3-1.5. Sigh...
> 
>    (And indeed, AbstractCollection.removeAll does precisely what the
>    Collection.removeAll spec demands)
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-6394757?focusedCommentId=12242803&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12242803
> 
> Another bug, a spec bug this time, was introduced later. (I haven't searched the history to find out exactly when.) The TreeSet.add() method specification says this: [2]
> 
>    Adds the specified element to this set if it is not already present. More
>    formally, adds the specified element e to this set if the set contains no
>    element e2 such that Objects.equals(e, e2). If this set already contains the
>    element, the call leaves the set unchanged and returns false.
> 
> [2] https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/util/TreeSet.html#add(E)
> 
> This is simply wrong, as TreeSet membership is determined by the comparison method, not by equals().
> 
>> What would (full) support for custom membership semantics look like? What restrictions would you place on the contains() method? What axioms would you specify for the generic operations? Is substitutability tossed out the window? What does it mean for an object to be a member of a set (according to contains) but not be returned by the iterator or included in the size? What should Set.copyOf() do if its argument is a non-conforming collection? etc. etc.
> 
> I think it's fair to raise all of these questions if "all bets are off" but indeed all bets are not off. I think the root of the problem is in the statement from the specification,
> 
>    The behavior of a sorted set is well-defined even if its ordering is
>    inconsistent with equals; it just fails to obey the general contract
>    of the Set interface.
> 
> This statement is essentially impossible to understand without looking at the history, various bugs (such as this one), the design intent of the collections framework, etc. What it really means is the following.
> 
> A SortedSet being "well-defined" answers most of your questions above. The obvious axioms all still apply. If you add an element and add() returns true, then contains() on that element should return true, and iterating the set should at some point return that element, etc.
> 
> The "general contract" phrase that appears in several places doesn't mean "the totality of the Set interface" but instead it means specifically what we've been calling the "membership contract" of the set. That is,
> 
>    Sets contain no pair of elements e1 and e2 such that e1 $ e2.
> 
> Where $ is:
> 
>    compare(e1, e2) == 0   for SortedSet
>    ==                     for sets built from IdentityHashMap
>    equals()               for Sets excluding the above
> 
> Once it's clear that "general contract" really means membership contract, many things become clearer. Most bets in fact are still on. Issues arise only when there are cases where there are two sets at play, and the question of which set's membership contract is in use needs to be specified. For copy constructors, Set.copyOf, and addAll(), it's clear that elements are added according to the receiver's membership contract. Other methods require a bit more analysis:
> 
>    containsAll()
>    equals()
>    removeAll()
>    retainAll()
> 
> The containsAll() and equals() methods both use the membership contract of the receiver, not the argument. Unfortunately, the equals() specification says,
> 
>    Returns true if the specified object is also a set, the two sets have the
>    same size, and every member of the specified set is contained in this set
>    (or equivalently, every member of this set is contained in the specified
>    set).
> 
> As should be clear from this discussion, the "equivalently" clause is incorrect -- another spec bug.
> 
> Finally, removeAll() and retainAll() use the membership semantics of the argument. I've discussed this elsewhere in this thread.
> 
> To summarize, fixing this bug is part of a larger effort to clean up various ambiguities and bugs in the specifications, as well as bugs in implementations. The fact that I'm proceeding incrementally shouldn't be interpreted to mean that I'm fixing semantic issues one by one as I come across them.
> 
>> I understand that trade-offs are necessary, and if I understand correctly, the CSR process should review intentional regressions. Do you agree?
> 
> Yes, I will definitely file a CSR for this, to include specification changes as well as the behavioral and performance tradeoffs.
> 
> s'marks
>