Excessive copying of Collection.toArray()
Stuart Marks
stuart.marks at oracle.com
Thu Sep 4 03:17:29 UTC 2025
Hi Shaun,
You got pretty far with your analysis; congratulations! It would indeed be nice to
avoid excess copying. The crux of the matter is, as you concluded, how one can know
whether to trust the caller. As you surmised the "only trust ArrayList" policy is
merely a stopgap.
You already proposed and knocked down the "obvious solution" which is to trust JDK
classes (which are loaded by the boot classloader). Similar simple solutions, such
as an allowlist with `instanceof` checks, can be defeated by subclassing. OK, maybe
use getClass() instead. Now we have an allowlist that loads all the classes that it
mentions, possibly unnecessarily. Hmmm.
Using a private interface of some sort is an interesting technique to explore, but
one needs to be careful about subclassing. (Actually it would probably be a public
interface in jdk.internal.util or nearby, but not exported.)
And of course we'd want to avoid having random logic sprinkled all around the JDK,
so the "safeToArray" logic and implementation would have to be factored so that it's
nicely centralized.
But this would still leave non-JDK classes out in the cold. An arrangement that
could work would be to have some construct that allows the creation of and writing
to an array, but which precludes the possibility of retaining a reference to the
array (or perhaps merely preventing writes to it after a certain point). One could
imagine a library construct that implements this. Of course it would require classes
to opt into this arrangement. Ordinary implementors of toArray() would still suffer
an extra copy.
But maybe this is overkill, and the best is the enemy of the good. Perhaps we could
add a short allowlist that contains just the most frequently-used JDK classes. This
wouldn't be exhaustive, but it could help the situation by covering more of the
common cases.
I was going to file an enhancement request for this, but there already is one:
https://bugs.openjdk.org/browse/JDK-8257475
I filed it several years ago. :blush:
s'marks
On 9/2/25 11:19 AM, Shaun Spiller wrote:
> Hello!
>
> Please consider the following real utility method I have:
>
> /**
> * Returns an immutable list via {@link List#of} that is the
> * sorted content of the collection.
> */
> public static <E> @NonNull List<E> sortedImmutableList(
> @NonNull Collection<E> col,
> @Nullable Comparator<? super E> comparator) {
> ArrayList<E> list = new ArrayList<>(col);
> list.sort(comparator);
> return List.copyOf(list);
> }
>
> Quiz question: Assume this method is invoked with a large HashSet as
> its collection argument. How many ARRAY copies of the collection are
> created by this method? Clearly there must be at least one for the
> result, held internally in the immutable list. Any others?
>
>
> (Scroll for the answer.)
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> .
>
> I was surprised to realize the answer is four:
>
> 1. ArrayList calls toArray() on the HashSet.
> 2. ArrayList defensively copies the array to assign to its internal
> elementData field, because it doesn't trust HashSet.
> 3. List.copyOf calls toArray() on the ArrayList.
> 4. List.of defensively copies the array to assign to its internal
> elements field, because it doesn't trust ArrayList.
>
> It is quite possible to rewrite my particular utility method with a
> raw array instead of ArrayList, to eliminate 2 array copies, but this
> requires an icky unchecked cast, and by rights is the kind of thing
> that shouldn't be necessary anyway.
>
> -----
>
> In general, this kind of excessive defensive copying is going on ALL THE TIME.
>
> The underlying nuisance is that a subclass of Collection could
> potentially override toArray() to keep a reference to the returned
> array and make mischief with it. (Also, for ArrayList, it would be a
> problem if the returned array were a subclass of Object[].)
>
> I did a search thru the JDK and found a slew of cases where code calls
> Collection.toArray() and then has to worry about defensively copying
> it -- or trying to avoid defensively copying it:
>
> - java.util.List.copyOf(Collection) -- avoids copying an existing
> List.of/copyOf list, but for any other collection it calls toArray()
> and then defensively copies it again
>
> - java.util.ArrayList(Collection) -- constructor has a special check
> `if (collection.getClass() == ArrayList.class)`, in which case it
> skips the defensive copy and trusts the result of toArray(), but it
> does not trust any other collection
>
> - java.util.Vector(Collection) -- constructor has the same
> optimization trick as ArrayList, where it trusts ArrayList, but
> creates a defensive copy for any other collection. Vector doesn't even
> trust itself!
>
> - java.util.PriorityQueue.initElementsFromCollection(Collection) --
> has optimization for ArrayList, but no other collection
>
> - java.util.concurrent.PriorityBlockingQueue(Collection) -- has
> optimization for ArrayList, but no other collection
>
> - java.util.concurrent.CopyOnWriteArrayList(Collection) -- has
> optimizations for ArrayList and CopyOnWriteArrayList, but no other
> collection.
>
> - java.util.concurrent.CopyOnWriteArrayList.addAllAbsent(Collection)
> -- has optimization for ArrayList but no other collection.
>
> - java.util.stream.Collectors.toUnmodifiableList() -- this avoids an
> excess copy, but does so in a very complicated way
>
> - sun.awt.util.IdentityArrayList(Collection) -- this appears to be an
> old copy-paste of ArrayList except with identity indexOf; like
> pre-2020 ArrayList, it blindly trusts that `col.toArray()` returns a
> new array, although based on this class's extremely limited use there
> is no manifestable bug.
>
> - java.lang.invoke.MethodHandles.dropArguments/dropArgumentsToMatch --
> these call `list.toArray()` and then clone the array unconditionally
>
> - java.time.zone.ZoneRules.of(...) -- calls `lastRules.toArray()` and
> then clones the array unconditionally
>
> -----
>
> I would suggest that there should be an internal utility method for
> use in all these cases, something like:
>
> public static Object[] safeToArray(Collection<?> c) {
> Object[] a = c.toArray();
> if (!isTrusted(c)) {
> a = Arrays.copyOf(a, a.length, Object[].class);
> }
> return a;
> }
>
> The delicate issue then is: what is the correct design of "isTrusted"?
> The check must be fast, and not spoofable by classes outside the JDK.
>
> My first thought was to detect that the class is loaded by the
> bootstrap class loader:
>
> static boolean isTrusted(Collection<?> c) {
> return c.getClass().getClassLoader() == null;
> }
>
> However, I now realize this is unwise because it would shift the
> burden of trust into **every** Collection class in the JDK, instead of
> into the classes that actually care about the result. For example,
> Collections.UnmodifiableCollection.toArray() simply returns the result
> of its internal `col.toArray()`, which immediately carves a hole the
> size of a bus thru the check.
>
> Perhaps better would be for classes to explicitly opt-in and say "Yes,
> I promise that my toArray() method is safe". Would a package-private
> marker interface work for this?
>
> I am not sure.
>
> Anyway, the point is that classes within the JDK should not need to
> create so many defensive copies of each other.
>
> Note that this is becoming more of an issue now that List.of is
> becoming a popular "go-to" list for many situations where ArrayList
> would traditionally have been used. The "only trust ArrayList" pattern
> was never ideal and is starting to age.
>
> This is my first post here so I apologize for oversights in formatting
> or etiquette.
> Thanks for reading.
More information about the core-libs-dev
mailing list