races on flat values
John Rose
john.r.rose at oracle.com
Thu Jul 14 03:31:46 UTC 2022
So, let’s talk about the Java Memory Model and consequent JVM support
for flattenable types.
(Racing on flats is either really cool, as on the Bonneville Salt Flats,
or else really dumb, if the flats are your flat tires. Either image
will do here. Races are a generally bad idea, but sometimes they are
how the cool kids get the best performance, just barely avoiding
application-defined failure modes.)
When a value is atomic, and/or when a reference type is used (not the
`C.val` companion), I think there does not need to be any impact on the
JMM. It is always enough to say that a load or store of the value
behaves as if (two very important words: “as if”) the value were
separately buffered in the heap, and accessed only via a safely
published pointer.
Consider a racing read of a composite (multi-field) value from a
variable V of type `C.val`, that has also been set from two or more
racing writes `V=X`, `V=Y`. The “as if” rule implies that the read
will see either X with its fields `X.*F*` or Y with its fields `Y.*F*`,
because X and Y behave “as if” they are references to separate
buffered instances of class C. (Either of the writes could be a
buffered snapshot of aboriginal value `C.default`.) This is all
straight out of the JMM’s repertoire of scenarios.
The JVM can try to pull off various shenanigans under the covers, as
long as the user observes behavior “as if” the values were buffered
in the heap. This is a good starting point for implementors, and it
takes us quickly into special cases where the structure of the `.*F*`
fields is simple enough. If they all pack in an atomically
readable/writable memory unit (64 or 128 bits typically), then the JVM
implementor can choose to quietly maintain the illusion of heap
buffering, while storing a composite like `X.*F*` by packing the `.*F*`
in a single memory unit, but omitting any physical reference for X. And
then of course it doesn’t matter whether X existed in the first place
or not, so don’t allocate it either.
There’s more where that came from. An example would be an optimistic
technique which stores both X and some of the `.*F*` in 128 bits, plus
the rest of the `.*F*` in more bits; some sort of pointer encoding would
say whether X must be followed (to a different heap block) or else (as
the optimist hopes) all the `.*F*` can be read, perhaps helped by with a
fast seq-lock or some other exclusion of races. I’m just getting
started here, but it gets very very tricky and performance can stall out
quickly.
As has been mentioned already, nullable reference types can potentially
be flattened as well as regular value types, and this would require
encoding an extra “null channel” in addition to the `.*F*` fields of
the value being pointed to. This could be a byte, or a bit, or less
than a bit if there is “slack” in any of the other `.*F*` fields.
So a tri-state `OptionalBoolean` could still be just one byte, but an
`OptionalLong` is cursed with the need to find that 65th bit. There’s
a lot to say about potential implementations under that benign “as
if” rule. With respect to JMM, the implementor has to ensure that if
a default value `C.default` is never stored into a `C.ref` flat
container, no possible race can observe that value. This is a special
hazard, depending on the order in which the null channel is read and
written, since an all zero flat `C.ref` might race to a `C.default`
value if the null channel were set to the not-null state, before any
`.*F*` components were stored. The safest way to stay within bounds is
to use what Brian calls “half-flat” formats that fit in 128 bits (or
whatever is the natural unit for a platform) and load and store those
formats with atomic instructions. This was already true for the `C.val`
flats discussed above, but it becomes that much harder when you have the
null channel along as a hitchhiker.
Those were the preliminaries; now we come to non-atomic value types of
the form `C.val`. When such a variable is flattened fully (this is
Brian’s “full flat” option), new kinds of races can create torn
values. A full account of these races requires new rules within the
JMM, describing what loads and stores (and other events) look like when
they involve the new value types.
I think we can say that a variable (field or array element) of a
non-atomic value type (but no other type) decomposes into independent
*sub-variables*, one for each field. This has to happen recursively,
for fields that are themselves Q-types. Each sub-variable is an
independent variable of the JMM.
Maybe that is enough to build up various useful and interesting events
and relations (read, write, happens-before, etc.), or maybe not. This
is regardless of whether the JVM actually flattens; again it is all
“as if”, but this time with more races.
So, suppose a read (`getfield` or `aaload`) grabs the `.*F*` field
values from some variable V, into which was previously stored both X
(including by reference all of `X.*F*`) and racing with X also Y
(including by reference `Y.*F*`). Let’s call these variables `V.*F*`.
I think we should then break the single read and both writes into one
read and two writes *per field* (of C).
I think the JMM can ignore the references X and Y and just track the
individual read and write events. Having null out of the picture helps
us forget the pointers as well.
From the POV of the write, a thread decided on a whole bundle of field
values and stored them, one by one, into the separate `.*F*` variables.
The more exotic events of the JMM can, I think, simply be distributed
from V to the sub-variables `V.*F*` in a regular way, just as we
distributed the plain reads and writes.
This seems workable. Let’s test it by considering a hybrid scenario
where the class C includes a slowly-varying field and a rapidly-varying
one. Maybe an array cursor:
```
value record C(Object[] array; int index) {
public non-atomic companion type C.val;
public boolean hasNext() { return index < array.length; }
public Object get() { return array[index]; }
public C next() { check(); return new C(array, index+1); }
private void check() { if (!hasNext()) throw new Error(); }
}
```
(Such a type is reasonably safe to use even with a public value type:
The failure modes are comparable to those you get if you race an
iterator for an `ArrayList`. One might even expect its null-capable
reference type to flatten nicely in 64 or 128 bits. But that’s beside
my point here.)
Suppose I have a full-flat container of `C.val` and I have two racing
writes; the first set up the variable from scratch, and the second
changes the index but not the array (say, by calling `next`).
```
static final Object[] ARR = {22, 33, 55, 77};
static C.val V = new C(ARR, 0);
void T1() { V = V.next(); }
void T2() { System.out.println(V); }
// T1 and T2 execute concurrently
```
The effect I would like is for a racing read to receive either index
value, but always the same array value, as long as all racing writes
have contributed the same array value. I think this is true for the
example code above. What do you think?
The reason I want this effect is I want to enable an optimization like
this:
```
void T1optimized() {
if (false) V = V.next(); //original code version
if (false) V = new C(V.array, V.index+1); //inlining
if (false) V = V __WithSubVariables { //inlining
array = V.array;
index = V.index+1;
}
boolean MAYBE = false;
if (MAYBE) { //unbundling sub-variables
V.*array = V.*array; //useless store, kill it if you can
V.*index = V.*index+1;
} else {
V.*index += 1; //32-bit memory update
}
}
```
To get to the pleasing end result, I think the JIT has to work through
the intermediate phases, and have permission to stop at any of them at
any time. So I want the JIT to have the option (at its own whim) to
either make a 32-bit memory update of just the V.*index sub-variable, or
else a larger update to *both* sub-variables (the `if (MAYBE)` block in
the example).
This will have no effect in the simple scenario described above. But in
more complex ones, where the `V.*array` sub-variable is changing as
well, the JMM will allow arbitrarily strange mismatches between fields,
such as a really obsolete array and the latest index (into a different
array). This could happen if a racing composite write stalled just
before writing `V.*array`, waiting until just about now, wrote a really
old value, and then stalled permanently before writing the associated
`V.*index` value. Meanwhile more normal threads are writing more or
less coherent array/index pairs, but suddenly a racing read can pick up
a recent index and the very old array component.
I guess this is all true whether or not the JIT makes that final
optimization step of nullifying a useless write. So (to finish with
this laborious example) I guess the JIT has all it needs to optimize the
processing of non-atomic full-flat values, without straying from the JMM
(which is very permissive in the presence of races).
One thing to observe in passing is that when a method like `C::next`
runs, it has a value `this` which is on the stack, not in heap. This
means that there can be no races on the fields of `this` during the
execution of any method. So as the body of any C method executes, the
fields `this.array` and `this.index` cannot change. This is as one would
expect from final fields. But it’s true even if the original copy of
`this` is being concurrently trashed by racing writes. This means the
JIT cannot treat race-prone heap containers as spilled copies of `this`,
to be reloaded at leisure. It has to pick up any and all fields that it
might need just once per field needed in an inlined method call. It
might kill dead stores, of course. If the class of `this` is atomic, it
must use an atomic to pick up `this.*F*` (or the parts needed) at one
time; otherwise it can pick up the needed parts of `this.*F*` as needed,
but at most once per part.
Does the partial write technique sketched here work for atomic flat
values? I wish it would but I think it doesn’t, in general. Suppose
that C above (the cursor class) were atomic, as is surely more typical.
If I update the `V.*index` sub-variable by itself, I have to make sure
that the 32-bit update is atomic with respect to the neighboring
`V.*array` variable. If the hardware allows me to mix 32-bit and 64-bit
atomics on the same word, well and good; I can do the narrow update.
But it probably won’t work very often, and perhaps the hardware would
have trouble sorting out the conflicting update sizes.
A special case of this is setting a half-flat `C.ref` variable to null.
This should allow a narrow store (say just one byte to the null
channel), leaving the other bytes as garbage to be dealt with later.
(The GC can come along later and zero them out, kind of like with weak
reference processing, but more certain and eager.) Doing this requires
care in ordering the null checks. If the JIT sees the null channel set
to the “null=yes” state (probably a zero bit or byte) then the JIT
needs to cover its eyes and ignore any other bits it picked up in the
same atomic read, because they might be non-zero garbage marooned by a
partial write to the null channel. Since writing a zero byte to memory
is naturally atomic, the hardware might tolerate null-channel writes
mixed in with full 64-bit and 128-bit reads.
An optimistic narrow-word null check might work on the read side. If I
read the null channel using a single-byte read, and observe the
“null=yes” state, I don’t need to read anything else. But, if I
observe “null=no” using the narrow read, and do the full-width
atomic read of 64 or 128 bits, I need to check the null channel again,
in the full read, since a null might have come into memory between my
two memory operations. This is uncomfortably like the “test twice”
anti-pattern, but I think it actually works. Whether it is profitable
is anybody’s guess. I put this in as another example of VM
shenanigans behind the “as if” rules.
Partial reads from flat atomic variables are probably a good idea in
general. (That is, as long as they don’t interfere with hardware’s
graceful execution of the atomic write instructions that populate the
variables in the first place.) If the cursor C is atomic, and I write
`V.index()` (again V is of type C), the JIT doesn’t need to load the
`V.*array` sub-variable, just the `V.*index` sub-variable. No atomicity
failures can be observed even if V has racing writes, since you can
believe any `V.*array` value you like came with your sample of
`V.*index`. But methods which work on two or more fields must pick up
all of the sub-variable values (for those fields) in an atomic
operation. So `V.hasNext()`, which looks at both `array.length` and
`index`, needs to pick up the bundle `V.*{index,array}` in a coherent
manner, using an atomic 64-bit or 128-bit load.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/valhalla-spec-observers/attachments/20220713/cc102d51/attachment-0001.htm>
More information about the valhalla-spec-observers
mailing list