RFR JDK-8225339 Optimize HashMap.keySet()/HashMap.values()/HashSet toArray() methods

Wed Jun 5 23:48:35 UTC 2019

On 6/5/19 7:27 AM, Claes Redestad wrote:
> On 2019-06-05 15:49, Tagir Valeev wrote:
>> In particular it's never mentioned in HashSet, HashMap or Collection spec that 
>> toArray implements a fail-fast behavior: this is said only about the 
>> iterator() method. 
> 
> that's not entirely true, since the @implSpec for toArray on
> AbstractCollection states it is equivalent to iterating over the
> collection[1].
> 
> So since the implementations you're changing inherit their current
> implementation from AbstractCollection and the iterators of HashMap is
> specified to be fail-fast, then that behavior can be argued to be
> specified also for the affected toArray methods.
> 
> I'm not fundamentally objecting to the behavior change, but I do think
> it needs careful review and a CSR (or at least plenty of reviewers
> agreeing that one isn't needed).

OK, let's slow down here a bit.

The fail-fast, CME-throwing behavior is primarily of interest when iteration is 
external, that is, it's driven by the application making a succession of calls 
on the Iterator. The main concern arises when the application makes other 
modifications to the collection being iterated, between calls to the Iterator.

There is always the possibility of modification by some other thread, and this 
is mentioned obliquely in the spec near "fail-fast", but given memory visibility 
issues it's pretty much impossible to make any concrete statements about what 
happens if modifications are made by another thread.

In the case of these methods, the iteration occurs entirely within the context 
of a single method call. There's no possibility of the application making a 
concurrent modification on the same thread. So, while the fail-fast behavior is 
being changed, strictly speaking, I consider it "de minimis", as it's pretty 
difficult for an application to observe this change in behavior.

Regarding the @implSpec, that applies only to the implementation in 
AbstractCollection, and @implSpec is not inherited. Again, it is a change in 
behavior since this method is being changed from being inherited to being 
overridden, but it's not a specification issue.

In any case I don't think the concurrent modification behavior change is an 
issue to be concerned about.

**

Now, regarding visible changes in behavior, it's quite easy to observe this 
change in behavior, at least from a subclass of HashSet. A HashSet subclass 
could override the iterator() method and detect that it was called when 
toArray() is called. With this change, toArray() is overridden, and so 
iterator() would no longer be called in this circumstance.

This kind of change is generally allowable, but we have had complaints about the 
pattern of self-calls changing from release to release. This isn't a reason NOT 
to make the change, but the fact is that it does make changes to the visible 
behavior, and this potentially does affect actual code. (Code that I'd consider 
poorly written, but still.)

I talked this over with Joe Darcy (CSR chair) and we felt that it would be 
prudent to file a CSR a request to document the behavior change.

**

Some comments on the code.

Overall I think the changes are going in the right direction. It's amazing that 
after all this time, there are still cases in core collections that are 
inheriting the slow Abstract* implementations.

In HashMap, I think the factoring between keysToArray() and prepareArray() can 
be improved. The prepareArray() method itself is sensible, in that it implements 
to weird toArray(T[]) semantics of allocating a new array if the given one is 
too short, and it puts a 'null' at the right place if the given array is too 
long, and returns the possibly reallocated array.

The first thing that keysToArray() does is to call prepareArray() and use its 
return value.  What concerns me is that the no-arg toArray() method does this:

     return keysToArray(new Object[size]);

so the array is created with exactly the right size; yet the first thing 
keysToArray() does is to call prepareArray(), which checks the size again and 
determines that nothing need be done.

It would be a small optimization to shift things around so that keysToArray() 
takes an array known to be "prepared" already and to remove its call to 
prepareArray(). Then, require the caller to call prepareArray() if necessary. 
Then we'd have this:

  953  public Object[] toArray() { return keysToArray(new Object[size]); }
  954  public <T> T[] toArray(T[] a) { return keysToArray(prepareArray(a)); }

I'm not terribly concerned about avoiding an extra method call. But other code 
in this file, for example like the following:

  930  Node<K,V>[] tab;
  931  int idx = 0;
  932  if (size > 0 && (tab = table) != null) {

is written with an embedded assignment to 'tab', probably in order to avoid a 
useless load of the 'table' field if the size is zero. (This style occurs in 
several places.) So, if this code is making this level of micro-optimizations 
already, let's not waste it by calling extra methods unnecessarily. (I don't 
know if this will impact the benchmarks though.)

Aside from performance, refactoring keysToArray/prepareArray this way makes more 
sense to me anyway.

Similar adjustments to the call sites within HashSet and LinkedHashMap would 
need to be done.

I'd recommend adding doc comments to HashMap prepareArray() and keysToArray(), 
since they are either called from or overridden from outside this file. It 
needn't be a full-on spec comment, but enough to make it clear that other things 
are depending on these internal interfaces.

Thanks,

s'marks