Infinispan server issue - putting it all together

John Rose john.r.rose at oracle.com
Sat Oct 5 16:51:42 UTC 2024


Here is the promised followup on the art of Balrog taming.

On 5 Oct 2024, at 8:22, John Rose wrote:

>> The NPE often occurs some distance away from the root cause.
> (That is, the init-loop and the static reference that pulls
> out a null from a static that “obviously cannot be null”.)
> The null often wanders some distance away before somebody
> trips over it.  We have some technical debt here, in that we
> have no good way to pinpoint the cause of such an NPE, if
> the null has “wandered away” before it trips the NPE.
>> In a followup mail I wish to address the tricky problem of
> diagnosing bootstrap loops, which (as we know) can manifest
> in the field as mysterious NPE errors.  Leyden has a fresh
> requirement for tools to detect and tame the Balrogs before
> they can rampage.

So, I think we need a couple of new tools.  Luckily they
are cheap and have (to some extent) already been prototyped.
Even better, we will need some of the new tools for Valhalla
when it ships, so we get extra points for anticipating
Valhalla needs.  (And no, this is not just a sneaky attempt
to get Valhalla checkins done early.)

First tool:  Detection of non-strict usage of static fields,
especially static finals.  I think we need a mechanism in the
JVM that tracks static field initialization, on a per-field
basis, and detects many anomalous conditions, chiefly
read-before-first-write, maybe also write-after-first-write.
It does not need to work on all kinds of static field
references (var-handles, unsafe, reflection) in order to
be immediately useful, since most static fields (not all)
in the JDK are written and read the old-fashioned way,
using putstatic and getstatic bytecodes.

I have a prototype of this.  It works like this:

A. When putstatic is linked, there is an extra handshake.
B. The handshake updates metadata to say that a put occurred.
C. A similar handshake (common code) for getstatic happens too.
D. That handshake checks the metadata to ask if a putstatic happened.

Here is where I put the handshake, in the interpreter
upcall which links a getstatic or putstatic bytecode:
https://github.com/rose00/jdk/blob/5962ba2f154b9da1eda3df3e6130ce929ed9de65/src/hotspot/share/interpreter/linkResolver.cpp#L1054

This is part of a bundle of changes for tracking class
initialization actions, specifically for “training mode”
in the premain work.

https://github.com/openjdk/jdk/compare/master...rose00:jdk:trace-init

It would have to be factored away from training mode,
and (instead) gated to some other predicate, such as
“are we fully bootstrapped yet into a production run?”
If not, the overhead of the extra handshakes is acceptable
and profitable.

I put the bookkeeping in a private branch that also tracks
other initialization events.  Teasing out the initialization
checks requires a simpler bookkeeping.  Luckily, the
InstanceKlass::field_status accessor can do this.
We would need a new flag for the new state; call it
something like “detected_tracked_field_set”, meaning
“we are tracking this field for whatever reason,
and we noticed it has been set at least once”.
That’s a building block for Valhalla strictness
checks and also for our init-checks.

(There is a possible optimization here, where the
VM looks first at the actual field value, and if it
is not default, it doesn’t bother to look for the
field status flag in the klass.  This is appealingly
simple, but it still requires the field status flag,
just for the case where a getstatic explicitly stores
the default value.  And that requires more wiring of
the getstatic bytecode, to detect whether the value
being stored is null/zero/false.  I will call that
a nice idea for later, not for now.  If you just
gate all the checks on the metadata status bit,
then you can perform them in the linker logic.)

Because the handshake does not cover reflection, var handles,
etc., it is not a full implementation of the tracking
necessary for Valhalla strict statics (in the strict sense).
But it is more than enough to pin down places where statics
are getting read before written, because of init-loops.

It will also get false positives, which will need some
ad hoc filter logic to get the signal we need from this tool.

I suggest tying it permanently to an -Xlog tag (static+init?)
for general use.  But more certainly there should be a mode
(enabled separately) which performs the bookkeeping just
during the Leyden assembly phase, and during VM startup,
which throws a VMError if there is any read-before-write
condition detected.

To cope with false positive, some statics will need to be
excluded from the error check, perhaps all non-finals
(but, non-finals mayt deserve these checks as well).
If only a few statics need exclusion, an annotation
such as @DefaultInitOK (like @Stable, JDK only) can be
created.  Later on maybe it is a general external tool
as well, but our problem now is specifically with booting
up the First Hundred Classes of the JDK, not anything
more general.

We can put off doing the above init-check feature as long
as we are willing to debug NPEs the slow way.

And there are other tools, which cut across the field
init check tool, which might be good enough to start with,
although they will demand more cooperation from JDK
maintainers.  I’m speaking of annotations on classes
(or if not annotations, side-lists in a file somewhere),
which detail our expectations about initialization order.

To my mind, the most interesting, and also restrictive,
class annotation would be @LeafInitialize.  This annotation
demands that the annotated class C, when initialized, must
not trigger any recursive initializations.  Furthermore,
if the class C “touches” another class D, that class D must
be fully initialized, not partially initialized.
That is, whatever other D is “touched” may not be
(a) uninitialized, (b) being initialized in another
thread, or (c ) being initialized in the current
thread — in which case C is being “touched” by
D, directly or indirectly.

(The penalty is a log message, or, in some VM
configurations, a VM exit or a VMError throw.)

It’s a simple idea which has deep consequences.

For starters, it means that if you have a bootstrap
loop, and you have marked any one class in that loop
as @LeafInitialize, you will get an immediate diagnosis.

That means that the normal source of NPEs due to
bad static init order is excluded, if at least one
of the culprit classes has been marked @LeafInitialize.

In addition, it means that any objects created by the
initialization logic of C are either C objects,
or else objects of some class D which has already
been fully initialized.  This means that the
transitive closure of those C objects is “safe”
with respect to class initialization if you load
it out of the AOT cache, as long as an equivalent
initialization order has been observed when booting
up the AOT cache.

If we go ahead and mark classes related to AOT holders
as @LeafInitialize, we can demonstrate more clearly
that the “stuff” those classes are surfacing from
the AOT cache is properly initialized.  We will need
to tinker with it to get the exact code shapes
we need.

There is a big debt to pay when using @LeafInitialize.
The price to pay is fixing all the violations detected.
For example, if you init MethodType, you will (surprisingly)
init MethodHandle as well, which will then recursively
init MethodType.  That shows a place where a cycle must
be cut to preserve the annotation validity.  The cycle
in this case is MethodType => per-MT invoker cache =>
MethodHandle => MethodHandle.type instance. (And maybe
there is a more graceful way to deal with this cycle.
The @LeafInitialize annotation is a harsh cleanser.)
But sprinkling around @LeafInitialize at least will
help us surface the problems quickly.

Suppose MethodType is marked @LeafInitialize, and
we fix the MethodHandle loop, but now we find that
MethodType wants to init a completely innocuous
class, like HashMap or StringBuilder, and (because
of Murphy’s Law) this is the very first time that
innocuous class is required.  What is the price
to pay for keeping @LeafInitialize?

There are two possibilities:  First, tinker with
a global list somewhere (the VM has one) to ask
that the innocuous class is handled first.
Second (and better IMO) add a component to
@LeafInitialize called LeafInitialize::requires,
an array of class names.  You can see the rest
coming:  When the VM processes a class annotated
with a requires list, it initializes those other
classes before entering the <clinit> method of
the annotated class.  It is as if somehow the
required classes were initialized concurrently
by an unspecified process, and also somewhat
like the case where <clinit> methods of supers
(and in Valhalla, flat fields) are executed
before the subclass (or Valhalla field holder).

I conjecture that we can hammer down many of
our First Hundred Classes into the @LeafInitialize
bucket.  If that is true, we will have an
orderly and maintainable initialization order
for them.

A third possible tool is another annotation,
@StartupInitialize.  If a class is marked
@StartupInitialize, then we throw an error
if somebody tries to initialize it too early.
When is that?  Well, to begin with, if the
assembly phase trips over a @StartupInitialize
class, it’s an error.  We mark a class with
@StartupInitialize if we know that it does
something (say with system properties) that
must be deferred until the production run.
Just initializing this class, in the assembly
phase, is declared to be a fatal error.
If a class has some parts we need for the AP,
we factor those out (say, into a nested
class in the startup class).  Or, we factor
the dirty parts (which read environmental
state) from the @StartupInitialize class,
into a smaller dirty class (so the dirt
is denser there) and move the annotation
accordingly, so that the original class
is now clean.  It is fine if the original
class makes method calls to the dirty class,
as long as those method calls do not happen
in the assembly phase.  (As you can see,
that would trigger init of the dirty class,
as soon as that call happened.  But if the
AP calls only clean methods, all is well.)

A fourth tool might be a complementary
annotation @AOTInitialize which means,
“please, pick me, pick me!”  The VM could
do useful checks on such classes, such as
ensuring that, while an @AOTInitialize
class is being initialized, there are no
“touches” (direct or indirect initializations)
of any classes marked @StartupInitialize.
Note that @LeafInitialize does this also,
more indiscriminately.  But @AOTInitialize
might have a role in shaping Leyden AOT
cache policy, so maybe it pulls its weight.

All of the above tools are of a common form:

A. Turn on the tool (add an annotation or whatever).
B. Build your JDK and boot it up.
C. Watch for errors.
D. Refactor/reannotate to fix the errors.
E. Repeat until we get clean boot.

Also:

F. Add appropriate sequence checks into pre-integration testing.
G. Fix regressions as needed (goto A).
H. Watch closely for regressions in the field.
I. Turn on the checking mode in the field and get the report.
J. Fix those regressions also (goto A).

The point is to put the human into the loop, and
not just at the point where a wild NPE crops up.
We need to demonstrate that we are correctly booting
our classes, especially when Leyden perturbs the order.

Much of the configuration logic used to build the
AOT cache is automatic.  But Java classes (like all
Turing machines) are too tricky and shifty to completely
characterize, by an automatic process.  The above
tools are, perhaps, necessary to add some more human
oversight into the task of shaping AOT bootstrapping.
Not that humans can master Turing machines either,
but the buck stops with us, so we have to try.

Volunteers, please?  The sooner we start tinkering
with rules and checkers the sooner we can get better
insight into init-order, even when perturbed.  And
I hope I demonstrated in my previous note that this
is not just a Leyden problem; it is a low-level
background cost of Java’s wonderful dynamism.

I’d prioritize as:  LeafInit, StartupInit, statics, AOTInit.

— John


More information about the leyden-dev mailing list