races on flat values

Thu Jul 14 03:31:46 UTC 2022

So, let’s talk about the Java Memory Model and consequent JVM support 
for flattenable types.

(Racing on flats is either really cool, as on the Bonneville Salt Flats, 
or else really dumb, if the flats are your flat tires.  Either image 
will do here.  Races are a generally bad idea, but sometimes they are 
how the cool kids get the best performance, just barely avoiding 
application-defined failure modes.)

When a value is atomic, and/or when a reference type is used (not the 
`C.val` companion), I think there does not need to be any impact on the 
JMM.  It is always enough to say that a load or store of the value 
behaves as if (two very important words: “as if”) the value were 
separately buffered in the heap, and accessed only via a safely 
published pointer.

Consider a racing read of a composite (multi-field) value from a 
variable V of type `C.val`, that has also been set from two or more 
racing writes `V=X`, `V=Y`. The “as if” rule implies that the read 
will see either X with its fields `X.*F*` or Y with its fields `Y.*F*`, 
because X and Y behave “as if” they are references to separate 
buffered instances of class C.  (Either of the writes could be a 
buffered snapshot of aboriginal value `C.default`.)  This is all 
straight out of the JMM’s repertoire of scenarios.

The JVM can try to pull off various shenanigans under the covers, as 
long as the user observes behavior “as if” the values were buffered 
in the heap.  This is a good starting point for implementors, and it 
takes us quickly into special cases where the structure of the `.*F*` 
fields is simple enough.  If they all pack in an atomically 
readable/writable memory unit (64 or 128 bits typically), then the JVM 
implementor can choose to quietly maintain the illusion of heap 
buffering, while storing a composite like `X.*F*` by packing the `.*F*` 
in a single memory unit, but omitting any physical reference for X.  And 
then of course it doesn’t matter whether X existed in the first place 
or not, so don’t allocate it either.

There’s more where that came from.  An example would be an optimistic 
technique which stores both X and some of the `.*F*` in 128 bits, plus 
the rest of the `.*F*` in more bits; some sort of pointer encoding would 
say whether X must be followed (to a different heap block) or else (as 
the optimist hopes) all the `.*F*` can be read, perhaps helped by with a 
fast seq-lock or some other exclusion of races.  I’m just getting 
started here, but it gets very very tricky and performance can stall out 
quickly.

As has been mentioned already, nullable reference types can potentially 
be flattened as well as regular value types, and this would require 
encoding an extra “null channel” in addition to the `.*F*` fields of 
the value being pointed to.  This could be a byte, or a bit, or less 
than a bit if there is “slack” in any of the other `.*F*` fields.  
So a tri-state `OptionalBoolean` could still be just one byte, but an 
`OptionalLong` is cursed with the need to find that 65th bit.  There’s 
a lot to say about potential implementations under that benign “as 
if” rule.  With respect to JMM, the implementor has to ensure that if 
a default value `C.default` is never stored into a `C.ref` flat 
container, no possible race can observe that value.  This is a special 
hazard, depending on the order in which the null channel is read and 
written, since an all zero flat `C.ref` might race to a `C.default` 
value if the null channel were set to the not-null state, before any 
`.*F*` components were stored.  The safest way to stay within bounds is 
to use what Brian calls “half-flat” formats that fit in 128 bits (or 
whatever is the natural unit for a platform) and load and store those 
formats with atomic instructions.  This was already true for the `C.val` 
flats discussed above, but it becomes that much harder when you have the 
null channel along as a hitchhiker.

Those were the preliminaries; now we come to non-atomic value types of 
the form `C.val`.  When such a variable is flattened fully (this is 
Brian’s “full flat” option), new kinds of races can create torn 
values.  A full account of these races requires new rules within the 
JMM, describing what loads and stores (and other events) look like when 
they involve the new value types.

I think we can say that a variable (field or array element) of a 
non-atomic value type (but no other type) decomposes into independent 
*sub-variables*, one for each field.  This has to happen recursively, 
for fields that are themselves Q-types.  Each sub-variable is an 
independent variable of the JMM.

Maybe that is enough to build up various useful and interesting events 
and relations (read, write, happens-before, etc.), or maybe not.  This 
is regardless of whether the JVM actually flattens; again it is all 
“as if”, but this time with more races.

So, suppose a read (`getfield` or `aaload`) grabs the `.*F*` field 
values from some variable V, into which was previously stored both X 
(including by reference all of `X.*F*`) and racing with X also Y 
(including by reference `Y.*F*`).  Let’s call these variables `V.*F*`. 
  I think we should then break the single read and both writes into one 
read and two writes *per field* (of C).

I think the JMM can ignore the references X and Y and just track the 
individual read and write events.  Having null out of the picture helps 
us forget the pointers as well.

 From the POV of the write, a thread decided on a whole bundle of field 
values and stored them, one by one, into the separate `.*F*` variables.

The more exotic events of the JMM can, I think, simply be distributed 
from V to the sub-variables `V.*F*` in a regular way, just as we 
distributed the plain reads and writes.

This seems workable.  Let’s test it by considering a hybrid scenario 
where the class C includes a slowly-varying field and a rapidly-varying 
one.  Maybe an array cursor:

```
value record C(Object[] array; int index) {
   public non-atomic companion type C.val;
   public boolean hasNext() { return index < array.length; }
   public Object get() { return array[index]; }
   public C next() { check(); return new C(array, index+1); }
   private void check() { if (!hasNext())  throw new Error(); }
}
```

(Such a type is reasonably safe to use even with a public value type: 
The failure modes are comparable to those you get if you race an 
iterator for an `ArrayList`.  One might even expect its null-capable 
reference type to flatten nicely in 64 or 128 bits.  But that’s beside 
my point here.)

Suppose I have a full-flat container of `C.val` and I have two racing 
writes; the first set up the variable from scratch, and the second 
changes the index but not the array (say, by calling `next`).

```
static final Object[] ARR = {22, 33, 55, 77};
static C.val V = new C(ARR, 0);
void T1() { V = V.next(); }
void T2() { System.out.println(V); }
// T1 and T2 execute concurrently
```

The effect I would like is for a racing read to receive either index 
value, but   always the same array value, as long as all racing writes 
have contributed the same array value.  I think this is true for the 
example code above.  What do you think?

The reason I want this effect is I want to enable an optimization like 
this:

```
void T1optimized() {
   if (false)  V = V.next();  //original code version
   if (false)  V = new C(V.array, V.index+1); //inlining
   if (false)  V = V __WithSubVariables {  //inlining
     array = V.array;
     index = V.index+1;
   }
   boolean MAYBE = false;
   if (MAYBE)  {  //unbundling sub-variables
     V.*array = V.*array; //useless store, kill it if you can
     V.*index = V.*index+1;
   } else {
     V.*index += 1;  //32-bit memory update
   }
}
```

To get to the pleasing end result, I think the JIT has to work through 
the intermediate phases, and have permission to stop at any of them at 
any time.  So I want the JIT to have the option (at its own whim) to 
either make a 32-bit memory update of just the V.*index sub-variable, or 
else a larger update to *both* sub-variables (the `if (MAYBE)` block in 
the example).

This will have no effect in the simple scenario described above.  But in 
more complex ones, where the `V.*array` sub-variable is changing as 
well, the JMM will allow arbitrarily strange mismatches between fields, 
such as a really obsolete array and the latest index (into a different 
array).  This could happen if a racing composite write stalled just 
before writing `V.*array`, waiting until just about now, wrote a really 
old value, and then stalled permanently before writing the associated 
`V.*index` value.  Meanwhile more normal threads are writing more or 
less coherent array/index pairs, but suddenly a racing read can pick up 
a recent index and the very old array component.

I guess this is all true whether or not the JIT makes that final 
optimization step of nullifying a useless write.  So (to finish with 
this laborious example) I guess the JIT has all it needs to optimize the 
processing of non-atomic full-flat values, without straying from the JMM 
(which is very permissive in the presence of races).

One thing to observe in passing is that when a method like `C::next` 
runs, it has a value `this` which is on the stack, not in heap.  This 
means that there can be no races on the fields of `this` during the 
execution of any method.  So as the body of any C method executes, the 
fields `this.array` and `this.index` cannot change. This is as one would 
expect from final fields.  But it’s true even if the original copy of 
`this` is being concurrently trashed by racing writes.  This means the 
JIT cannot treat race-prone heap containers as spilled copies of `this`, 
to be reloaded at leisure.  It has to pick up any and all fields that it 
might need just once per field needed in an inlined method call.  It 
might kill dead stores, of course.  If the class of `this` is atomic, it 
must use an atomic to pick up `this.*F*` (or the parts needed) at one 
time; otherwise it can pick up the needed parts of `this.*F*` as needed, 
but at most once per part.

Does the partial write technique sketched here work for atomic flat 
values?  I wish it would but I think it doesn’t, in general.  Suppose 
that C above (the cursor class) were atomic, as is surely more typical.  
If I update the `V.*index` sub-variable by itself, I have to make sure 
that the 32-bit update is atomic with respect to the neighboring 
`V.*array` variable.  If the hardware allows me to mix 32-bit and 64-bit 
atomics on the same word, well and good; I can do the narrow update.  
But it probably won’t work very often, and perhaps the hardware would 
have trouble sorting out the conflicting update sizes.

A special case of this is setting a half-flat `C.ref` variable to null.  
This should allow a narrow store (say just one byte to the null 
channel), leaving the other bytes as garbage to be dealt with later.  
(The GC can come along later and zero them out, kind of like with weak 
reference processing, but more certain and eager.)  Doing this requires 
care in ordering the null checks.  If the JIT sees the null channel set 
to the “null=yes” state (probably a zero bit or byte) then the JIT 
needs to cover its eyes and ignore any other bits it picked up in the 
same atomic read, because they might be non-zero garbage marooned by a 
partial write to the null channel.  Since writing a zero byte to memory 
is naturally atomic, the hardware might tolerate null-channel writes 
mixed in with full 64-bit and 128-bit reads.

An optimistic narrow-word null check might work on the read side.  If I 
read the null channel using a single-byte read, and observe the 
“null=yes” state, I don’t need to read anything else.  But, if I 
observe “null=no” using the narrow read, and do the full-width 
atomic read of 64 or 128 bits, I need to check the null channel again, 
in the full read, since a null might have come into memory between my 
two memory operations.  This is uncomfortably like the “test twice” 
anti-pattern, but I think it actually works.  Whether it is profitable 
is anybody’s guess.  I put this in as another example of VM 
shenanigans behind the “as if” rules.

Partial reads from flat atomic variables are probably a good idea in 
general. (That is, as long as they don’t interfere with hardware’s 
graceful execution of the atomic write instructions that populate the 
variables in the first place.)  If the cursor C is atomic, and I write 
`V.index()` (again V is of type C), the JIT doesn’t need to load the 
`V.*array` sub-variable, just the `V.*index` sub-variable.  No atomicity 
failures can be observed even if V has racing writes, since you can 
believe any `V.*array` value you like came with your sample of 
`V.*index`.  But methods which work on two or more fields must pick up 
all of the sub-variable values (for those fields) in an atomic 
operation.  So `V.hasNext()`, which looks at both `array.length` and 
`index`, needs to pick up the bundle  `V.*{index,array}` in a coherent 
manner, using an atomic 64-bit or 128-bit load.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/valhalla-spec-observers/attachments/20220713/cc102d51/attachment-0001.htm>