races on flat values

Wed Aug 10 14:38:09 UTC 2022

Thanks for writing this up, John.

This matches my thinking overall.  Looking for confirmation on one
point below related to synchronized blocks & non-atomic values.

On Wed, Jul 13, 2022 at 11:32 PM John Rose <john.r.rose at oracle.com> wrote:
>
> So, let’s talk about the Java Memory Model and consequent JVM support for flattenable types.
>
> (Racing on flats is either really cool, as on the Bonneville Salt Flats, or else really dumb, if the flats are your flat tires. Either image will do here. Races are a generally bad idea, but sometimes they are how the cool kids get the best performance, just barely avoiding application-defined failure modes.)
>
> When a value is atomic, and/or when a reference type is used (not the C.val companion), I think there does not need to be any impact on the JMM. It is always enough to say that a load or store of the value behaves as if (two very important words: “as if”) the value were separately buffered in the heap, and accessed only via a safely published pointer.
>
> Consider a racing read of a composite (multi-field) value from a variable V of type C.val, that has also been set from two or more racing writes V=X, V=Y. The “as if” rule implies that the read will see either X with its fields X.*F* or Y with its fields Y.*F*, because X and Y behave “as if” they are references to separate buffered instances of class C. (Either of the writes could be a buffered snapshot of aboriginal value C.default.) This is all straight out of the JMM’s repertoire of scenarios.
>
> The JVM can try to pull off various shenanigans under the covers, as long as the user observes behavior “as if” the values were buffered in the heap. This is a good starting point for implementors, and it takes us quickly into special cases where the structure of the .*F* fields is simple enough. If they all pack in an atomically readable/writable memory unit (64 or 128 bits typically), then the JVM implementor can choose to quietly maintain the illusion of heap buffering, while storing a composite like X.*F* by packing the .*F* in a single memory unit, but omitting any physical reference for X. And then of course it doesn’t matter whether X existed in the first place or not, so don’t allocate it either.
>
> There’s more where that came from. An example would be an optimistic technique which stores both X and some of the .*F* in 128 bits, plus the rest of the .*F* in more bits; some sort of pointer encoding would say whether X must be followed (to a different heap block) or else (as the optimist hopes) all the .*F* can be read, perhaps helped by with a fast seq-lock or some other exclusion of races. I’m just getting started here, but it gets very very tricky and performance can stall out quickly.
>
> As has been mentioned already, nullable reference types can potentially be flattened as well as regular value types, and this would require encoding an extra “null channel” in addition to the .*F* fields of the value being pointed to. This could be a byte, or a bit, or less than a bit if there is “slack” in any of the other .*F* fields. So a tri-state OptionalBoolean could still be just one byte, but an OptionalLong is cursed with the need to find that 65th bit. There’s a lot to say about potential implementations under that benign “as if” rule. With respect to JMM, the implementor has to ensure that if a default value C.default is never stored into a C.ref flat container, no possible race can observe that value. This is a special hazard, depending on the order in which the null channel is read and written, since an all zero flat C.ref might race to a C.default value if the null channel were set to the not-null state, before any .*F* components were stored. The safest way to stay within bounds is to use what Brian calls “half-flat” formats that fit in 128 bits (or whatever is the natural unit for a platform) and load and store those formats with atomic instructions. This was already true for the C.val flats discussed above, but it becomes that much harder when you have the null channel along as a hitchhiker.
>
> Those were the preliminaries; now we come to non-atomic value types of the form C.val. When such a variable is flattened fully (this is Brian’s “full flat” option), new kinds of races can create torn values. A full account of these races requires new rules within the JMM, describing what loads and stores (and other events) look like when they involve the new value types.
>
> I think we can say that a variable (field or array element) of a non-atomic value type (but no other type) decomposes into independent sub-variables, one for each field. This has to happen recursively, for fields that are themselves Q-types. Each sub-variable is an independent variable of the JMM.
>
> Maybe that is enough to build up various useful and interesting events and relations (read, write, happens-before, etc.), or maybe not. This is regardless of whether the JVM actually flattens; again it is all “as if”, but this time with more races.
>
> So, suppose a read (getfield or aaload) grabs the .*F* field values from some variable V, into which was previously stored both X (including by reference all of X.*F*) and racing with X also Y (including by reference Y.*F*). Let’s call these variables V.*F*. I think we should then break the single read and both writes into one read and two writes per field (of C).
>
> I think the JMM can ignore the references X and Y and just track the individual read and write events. Having null out of the picture helps us forget the pointers as well.
>
> From the POV of the write, a thread decided on a whole bundle of field values and stored them, one by one, into the separate .*F* variables.
>
> The more exotic events of the JMM can, I think, simply be distributed from V to the sub-variables V.*F* in a regular way, just as we distributed the plain reads and writes.
>
> This seems workable. Let’s test it by considering a hybrid scenario where the class C includes a slowly-varying field and a rapidly-varying one. Maybe an array cursor:
>
> value record C(Object[] array; int index) {
>   public non-atomic companion type C.val;
>   public boolean hasNext() { return index < array.length; }
>   public Object get() { return array[index]; }
>   public C next() { check(); return new C(array, index+1); }
>   private void check() { if (!hasNext())  throw new Error(); }
> }
>
> (Such a type is reasonably safe to use even with a public value type: The failure modes are comparable to those you get if you race an iterator for an ArrayList. One might even expect its null-capable reference type to flatten nicely in 64 or 128 bits. But that’s beside my point here.)
>
> Suppose I have a full-flat container of C.val and I have two racing writes; the first set up the variable from scratch, and the second changes the index but not the array (say, by calling next).
>
> static final Object[] ARR = {22, 33, 55, 77};
> static C.val V = new C(ARR, 0);
> void T1() { V = V.next(); }
> void T2() { System.out.println(V); }
> // T1 and T2 execute concurrently
>
> The effect I would like is for a racing read to receive either index value, but always the same array value, as long as all racing writes have contributed the same array value. I think this is true for the example code above. What do you think?
>
> The reason I want this effect is I want to enable an optimization like this:
>
> void T1optimized() {
>   if (false)  V = V.next();  //original code version
>   if (false)  V = new C(V.array, V.index+1); //inlining
>   if (false)  V = V __WithSubVariables {  //inlining
>     array = V.array;
>     index = V.index+1;
>   }
>   boolean MAYBE = false;
>   if (MAYBE)  {  //unbundling sub-variables
>     V.*array = V.*array; //useless store, kill it if you can
>     V.*index = V.*index+1;
>   } else {
>     V.*index += 1;  //32-bit memory update
>   }
> }
>
> To get to the pleasing end result, I think the JIT has to work through the intermediate phases, and have permission to stop at any of them at any time. So I want the JIT to have the option (at its own whim) to either make a 32-bit memory update of just the V.*index sub-variable, or else a larger update to both sub-variables (the if (MAYBE) block in the example).
>
> This will have no effect in the simple scenario described above. But in more complex ones, where the V.*array sub-variable is changing as well, the JMM will allow arbitrarily strange mismatches between fields, such as a really obsolete array and the latest index (into a different array). This could happen if a racing composite write stalled just before writing V.*array, waiting until just about now, wrote a really old value, and then stalled permanently before writing the associated V.*index value. Meanwhile more normal threads are writing more or less coherent array/index pairs, but suddenly a racing read can pick up a recent index and the very old array component.
>
> I guess this is all true whether or not the JIT makes that final optimization step of nullifying a useless write. So (to finish with this laborious example) I guess the JIT has all it needs to optimize the processing of non-atomic full-flat values, without straying from the JMM (which is very permissive in the presence of races).
>
> One thing to observe in passing is that when a method like C::next runs, it has a value this which is on the stack, not in heap. This means that there can be no races on the fields of this during the execution of any method. So as the body of any C method executes, the fields this.array and this.index cannot change. This is as one would expect from final fields. But it’s true even if the original copy of this is being concurrently trashed by racing writes. This means the JIT cannot treat race-prone heap containers as spilled copies of this, to be reloaded at leisure.

Agreed.  A field (V.*F) must be "sticky" after having been read and so
we can't re-read from the heap containers.

> It has to pick up any and all fields that it might need just once per field needed in an inlined method call. It might kill dead stores, of course. If the class of this is atomic, it must use an atomic to pick up this.*F* (or the parts needed) at one time; otherwise it can pick up the needed parts of this.*F* as needed, but at most once per part.

I think this is right but I'm still slightly uncomfortable with
picking up the "needed parts of this.*F* as needed" as it can greatly
expand the race window in user code in ways that aren't obvious from
the source code.  Since this only applies to racy programs, the
correct answer for the programmer - regardless of the size of the
window - is to properly synchronize the access.

In this model, when do the reads of fields actually occur for
synchronized blocks?  In the following code, all C.val fields used
after the block must be read during the synchronized block, correct?

static C.val sharedVal = ....;

C.val myVal;
synchronized(someLock) {
   myVal = sharedVal;
}
... myVal.array ...
... myVal.index

Assuming all the subsequent field reads through myVal are guaranteed
to have been privatized by the end of the synchronized block, then I
think the model makes sense and sounds right to me.

@Tobi, have the rest of the OpenJ9 JIT developers weighed in on this model yet?

>
> Does the partial write technique sketched here work for atomic flat values? I wish it would but I think it doesn’t, in general. Suppose that C above (the cursor class) were atomic, as is surely more typical. If I update the V.*index sub-variable by itself, I have to make sure that the 32-bit update is atomic with respect to the neighboring V.*array variable. If the hardware allows me to mix 32-bit and 64-bit atomics on the same word, well and good; I can do the narrow update. But it probably won’t work very often, and perhaps the hardware would have trouble sorting out the conflicting update sizes.
>
> A special case of this is setting a half-flat C.ref variable to null. This should allow a narrow store (say just one byte to the null channel), leaving the other bytes as garbage to be dealt with later. (The GC can come along later and zero them out, kind of like with weak reference processing, but more certain and eager.) Doing this requires care in ordering the null checks. If the JIT sees the null channel set to the “null=yes” state (probably a zero bit or byte) then the JIT needs to cover its eyes and ignore any other bits it picked up in the same atomic read, because they might be non-zero garbage marooned by a partial write to the null channel. Since writing a zero byte to memory is naturally atomic, the hardware might tolerate null-channel writes mixed in with full 64-bit and 128-bit reads.
>
> An optimistic narrow-word null check might work on the read side. If I read the null channel using a single-byte read, and observe the “null=yes” state, I don’t need to read anything else. But, if I observe “null=no” using the narrow read, and do the full-width atomic read of 64 or 128 bits, I need to check the null channel again, in the full read, since a null might have come into memory between my two memory operations. This is uncomfortably like the “test twice” anti-pattern, but I think it actually works. Whether it is profitable is anybody’s guess. I put this in as another example of VM shenanigans behind the “as if” rules.
>
> Partial reads from flat atomic variables are probably a good idea in general. (That is, as long as they don’t interfere with hardware’s graceful execution of the atomic write instructions that populate the variables in the first place.) If the cursor C is atomic, and I write V.index() (again V is of type C), the JIT doesn’t need to load the V.*array sub-variable, just the V.*index sub-variable. No atomicity failures can be observed even if V has racing writes, since you can believe any V.*array value you like came with your sample of V.*index. But methods which work on two or more fields must pick up all of the sub-variable values (for those fields) in an atomic operation. So V.hasNext(), which looks at both array.length and index, needs to pick up the bundle V.*{index,array} in a coherent manner, using an atomic 64-bit or 128-bit load.