[External] : RE: Question on the inline type flattening decision

Tue Jul 11 18:30:59 UTC 2023

On 10 Jul 2023, at 8:09, Frederic Parain wrote:

> Hi Xiaohong,
>
>
> Field flattening has two major side effects: atomicity and size.

Yes!  Well put.

Here’s some more fine print:

Atomicity of a value class will be something that its class
declaration can opt out of.  For a class that is non-atomic,
then (I think) both final and non-final instance fields of
its flattenable type (the null-excluding type, aka the “val
type” or Q-type) can use the same policy.

For a value class which is atomic (and that is the default),
it will not be possible (until the day we have efficient HTM)
to flatten fields of that type, if they are mutable.

(There’s even more fine print for nullable reference types
from value classes, if the VM ever tries to inline nullable
types, but that’s way in the future and will not be user visible.)

>
> Final fields are not subject to atomicity issues because they are immutable after their initialization.

So the current policy treats non-final instance fields as flattenable,
which means it treats them as immutable.  This is 99.99% correct.

There is a technical debt here concerning what sorts of indeterminate
behavior are allowed in the 0.01% case, where (a) the constructor
for the object containing the flattenable field allows “this” to
escape and another thread picks it up, and (b) the other thread
makes a racing read of the flattenable field at just the wrong
moment.  Here’s the debt:  Either we do not flatten the field
(at least when we know, statically, that this bad thing can happen)
or else we somehow delay the racing read until a safe moment (by
means of a mutex protocol of some sort), or (yet again) we somehow
detune the JVMS to allow atomic value classes to race, if their
containers are so rude as to allow concurrent reads through escaped
“this” pointers.  I think the most practical option is the first,
which means, sadly, the 99.99% correct policy for final fields
needs reconsideration.  But maybe I’ve missed some fortunate aspect
in the current policy, that allows it to avoid the 0.01% error.

(It’s corner cases like this that make JVM design exceedingly
difficult.  Most language specs. and runtimes don’t bother
to track all the details to this level, but Java does.)

>
> Both final and non-final fields have an impact on the object size, and potentially on cache behavior.

This is true for instance fields, because there are an indefinite
number of instances of them.  It is not really true for static fields,
and that’s a distinction that can have an effect on flattening policy.
The existing policy never flattens static fields; all the code quoted
in this thread is for non-static fields.

<digression topic=“flattening static fields”>

There would be zero benefit, and some harm, to flattening
static final fields.  The harm is to startup time, when that is
dominated by interpreter performance. The JIT doesn’t care either
way; it’s a compile-time constant.

Non-final statics are also very different from non-final instance
fields, so it is reasonable to use a different policy for them
as well.  Since statics are inherently shared across threads, maybe
the atomicity issue is more strongly felt; maybe.  Or maybe that’s
why we have “volatile”, to mark fields where we really care about
that.  The current policy makes all static reads and writes fully
race-free, at the cost of heap-buffering each stored value.

In any case, it is good that the flattening policy code in the
Valhalla VM has separate branches for static and non-static fields.

But, I have sometimes wondered if it would be a good idea to have
the VM buffer flattened static non-finals secretly in length-one
arrays, and tell the getstatic and putstatic opcodes to go look
there for their payloads.  It would be a little wasteful, but
not much.  The array references would be rooted immutably in
the class mirror object, just as if the fields were plain
references.  Unlike other heap buffers, a length-one array
creates a mutable variable for a value.  But it would make
such non-final statics be much more racy.  Maybe its better
to tilt to the side of non-raciness, which is what the current
policy does.

None of these musing should be taken as a call to consider
flattening statics directly in the normal container for
static fields, which is the Class mirror object for the
class declaring the statics.  This is a wild and tricky
tactic, which would probably become unmanageable (for
several reasons) if we tried to wedge flattened fields
into the poor Class mirror.  Class mirrors are weird
enough already.

</digression>

> Bigger objects are less likely to fit in data caches, and bigger distances between fields would require
> more cache lines and more cache misses to read them. This issue is not significant when accessing fields
> of a single object, but it can become dominant when accessing fields of objects stored in a flat array.

It’s an interesting tradeoff:  Indirections almost certainly depart
to new cache lines, while if you pile up enough size in flattened
variables, then you start departing the cache line just to get to
the other side of a single object.  (HW prefetchers often favor
contiguous block accesses, which make it favorable longer.)

Also, and semi-independently, memory traffic correlates with cache
line traffic.  So if your workload is very flat and cache-line-local,
but it loads a bunch of useless bits in every flat object, those useless
bits will have a similar effect to (prefetchable) cache line departures.
This can happen even if all the objects fit in one cache line, if the
alternative was to have two objects fit in each cache line, in a different
organization of the data.  The enabling condition there is that an object
might have “hot” and “cold” fields, in which case flattening the “cold”
fields will incur a tax (in data case traffic) on access to the “hot”
fields.  Because loading a cold field you don’t need into a full data
cache will displace some other object’s hot field which you do need.

The way I like to think about this latter effect is to envision, on
the one hand, everything flattened as much as possible, with no pointers
or headers loaded into the cache, but maybe with some “garbage” bits
mixed into the flat data.  (“Garbage” bits are, for example, bits which
are zero 99% of the time.)  And on the other hand, everything indirected
through pointers, which means every non-garbage data reference has to
thread through a pointer and jump past a header (loading those items
into the cache, AND making a non-prefetchable load), but also enjoying
the freedom from loading flattened garbage data.  You can flatten too
much, if what’s flattened into your containers has low entropy, and
eventually it can cause enough data cache traffic that you wish for
your pointers back, so you can refrain from loading cold garbage
at the other end of some of those pointers.

>
> So, in theory, the flattening test should be:
>
>   if (!((!fieldinfo.access_flags().is_final() && (too_atomic_to_flatten || too_volatile_to_flatten))
>         || too_big_to_flatten)) {
>
> Atomicity constraints are considered only for non-final fields, and size constraints are considered
> for all fields.
>
>
> That being said, we have always been more aggressive in the flattening of final fields because it was
> beneficial to C2.

Yes.  There’s a debt to pay here, though.  It might be that we end
up rethinking this policy, regarding non-static final fields, for
the two cases of declared-atomic (the default) and declared-non-atomic
(the racy power-user option).

— John