The non-deterministic iteration order of Immutable Collections

Sun Mar 26 09:38:07 UTC 2023

I think Map#of and friends would be more useful and less error prone if they where to return collections that have a fixed iteration order, where the order is defined by the insertion order when the map is created.

### My experience with jdeps

I encountered this as a problem when I tried to generate a module dependency graph using jdeps. I had my classpath wrong and got an error message. The error message contained a list of modules or jar-files (I don't remember the details, unfortunately).

The relevant part of this story for collection iteration order is this: Each time I re-run the jdeps command I got a different list of modules in the error message.

It took me a long time and a lot of confusion before I realised that it was actually not a different list of modules, but instead the same list, in different order.

I never came around to verify it, but a likely explanation for this confusing behaviour is that jdeps used Map#ofEntries internally, and printed the entries in the error message, which gave a new random order for each program execution.

Map#of probably works fine for the main functionality of jdeps, which probably is to use the map for lookups, but for error message printouts the result was confusing.

### Map encounter order in general

I think a large part of all maps that are used in Java applications end up being iterated over in some situations: Either they are displayed in a GUI, or serialised and sent over a network connection, or, as in my example, being displayed in an error message.

In neither of these cases is it a strait-out bug to use a random iteration order, but in all of them it is likely bad and confusing to do so.

Because of this I think that general purpose maps should by default preserve the insertion order of the entries.

Not preserving insertion order should be considered to be a niche use case, used only to optimize memory footprint when the users opt in to it because they decide that it is acceptable.

### Other collection frameworks

The Guava immutable map factories create implementations that preserve the iteration order:

https://github.com/google/guava/wiki/ImmutableCollectionsExplained#how

https://guava.dev/releases/23.0/api/docs/com/google/common/collect/ImmutableMap.Builder.html

Because of the iteration order issue of Map#of I have continued to used these.

### Memory cost to store insertion order

The Guava RegularImmutableMap uses a separate array of references to map entries to define the iteration order. This implementation gives an extra memory overhead per entry of one reference.

The size of a map using the standard library immutable map implementation, MapN, is approximately the following. (MapN uses a memory efficient implementation which avoids the need to allocate entry objects.)

Size per entry, measured in the number of references that are used:

1 - Reference to key
1 - Reference to value
1 - Hash table extra space at 66 % load

1 - Key object header
1 - Approximate size contents of a small key object
1 - Value object header
1 - Approximate size contents of a small value object

Total: 7

The addition of one extra reference for tracking the insertion order gives a memory increase of 1/7 = 14 % in this scenario. To me this sounds like a good trade-off for a map with better behaviour.

Best regards,
Jens Lideström
Java and Mumps programming enthusiast

On 2023-03-25 01:16, Stuart Marks wrote:
> 
> 
> On 3/24/23 10:18 AM, Chris Hegarty wrote:
>> I know that this has come up a number of times before, but I cannot seem to find anything directly relevant to my use-case, or recent.
>>
>> The iteration order of immutable Sets and Maps (created by the `of` factories) is clearly unspecified - good. Code should not depend upon this iteration order, and if so that code is broken. I completely agree and am aligned with this principle.
>>
>> The Elasticsearch code base heavily uses these collection implementations, and their iterators. On occasion we run into issues, where our code (because of a bug or a race), inadvertently has a dependency on the iteration order. To be clear, this is a bug in our code, and we want to reproduce and fix it.  The problem we are encountering is that the iteration order is not determinable or reproducible, which makes it extremely challenging to reproduce the bug even when we have reproducible testcase to hand. ( whereas, for example, it is common practice in other parts of the code to log a seed used for randomization )
>>
>> We very much like using the immutable collections, and we definitely don't want to pay the price of sorting things in the code, since we don't want to, and should not, depend upon the iteration order. The only concern is with reproducibility when we run into (our own) bugs.
>>
>> I don't (yet) want to be prescriptive in any potential solution. And I know that this has been discussed before. I mostly just want to start a conversation and see how much traction it gets.
> 
> Hi Chris,
> 
> Yes, this has come up before, but it's been mostly theoretical. That is, people worry about this when they hear of the idea of randomized iteration order, but I've never heard any followup. This is in fact the first time I've heard of an actual case where this is a real problem. So thanks for bringing it up.
> 
> (And unfortunately, it's notoriously difficult to make code truly iteration-order dependent. We've had our own history of problems with this in the JDK. I'd be interested in hearing from you at some point about the exact pathology of how this occurred.)
> 
> There's currently no debugging or experimental interface that will let one print or set the random seed that's in use. The obvious approach of using an agent to hack away at runtime doesn't work, because the unmodifiable collections implementations are used very early in startup and they're usually loaded before the agent runs. There are limitations on what an agent can do when redefining an already-loaded class, for example, removing the 'final' modifier from fields isn't supported. (Well I suppose one can always use Unsafe.)
> 
> Here's another approach that might work for you.
> 
> 1. Write a little program that extracts the class bytes of ImmutableCollection.class from the runtime image, and use something like ASM or ClassFile to make the class public and to make the REVERSE and SALT32L fields public and non-final.
> 
> 2. Launch a VM using the --patch-module option to use this class instead of the built-in one.
> 
> 3. Write an agent, or some debugging library that's called at the right time, to use simple reflective access to get or set those fields as desired.
> 
> This is a bit fiddly but it might be easier than rebuilding and deploying a custom JDK.
> 
> Setting (formerly) final fields after the VM is initialized is often somewhat risky. In this case these particular fields are used only during iteration, and they don't actually change the layout of any data structures. So setting them to some desired value should apply to all subsequent iterations, even of existing data structures.
> 
> I'll think about better ways to do this in the product. The best approach isn't obvious. The typical way of doing things like this using system properties is tricky, as it depends on order of class initialization at startup (and you know how fragile that is).
> 
> s'marks