optimizing acmp in L-World

Wed Feb 7 13:40:21 UTC 2018

Hi,

Very nice writeup, thanks John!

I've started experimenting with C2 support/optimizations for the new acmp. For reference, here's the interpreter patch
I'm using to verify correctness of my tests and the C2 implementation:

@@ -2313,8 +2314,19 @@
 void TemplateTable::if_acmp(Condition cc) {
   transition(atos, vtos);
   // assume branch is more often taken than not (loops use backward branches)
-  Label not_taken;
+  Label not_taken, is_null;
   __ pop_ptr(rdx);
+
+  __ testptr(rdx, rdx);
+  __ jcc(Assembler::zero, is_null);
+  assert(Universe::oop_metadata_odd_mask() == 1, "should be");
+  __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
+  __ andptr(rbx, Universe::oop_metadata_odd_mask());
+  __ shlptr(rbx, 63);
+  __ sarptr(rbx, 63);
+  __ orptr(rdx, rbx);
+
+  __ bind(is_null);
   __ cmpoop(rdx, rax);
   __ jcc(j_not(cc), not

I'm shifting the is_value bit to the position of the sign bit and then do an arithmetic right shift to get all 1s in
case the operand is a value type (or all 0s otherwise). The result is used to perturb the original operand. There might
be a more efficient solution but I just leave this here in case someone wants to experiment as well.

Best regards,
Tobias


On 29.01.2018 21:42, John Rose wrote:
> Tobias, Roland, and I had a chat in Santa Clara last week
> about optimizing the acmp operation in the JVM.  Here are
> my notes on it.
> 
> Background:  The acmp operation, also known as reference
> comparison, is overloaded in L-World to handle values as well,
> with a specific semantics of returning false if either operand is
> a value (similar to NaN behavior for fcmp).  The rationale for
> this is that acmp is reference comparison, even in L-World,
> and since values have no references, they can never be
> equal as references.
> 
> (Alternatively, it is *as if* both operands were converted
> *to references* before the acmp operation, which means
> that either operand, if it is a value, *would be* boxed to a
> reference value, which then, being a new reference, is
> automatically unequal to any other reference.  This tricky
> account may be appealing in some circumstances.)
> 
> OK, so how do we make it fast?  Ideally, we would like
> to strength-reduce the enhanced acmp instruction down
> to the original acmp instruction, which after all is just a
> single instruction (such as cmpq) on all platforms.
> 
> The first thing to notice, before going into details, is
> that in L-world all Q-values are buffered in the interpreter.
> This means that, at least for interpreter-generated values,
> all values can be treated as physical pointers, even
> if they are logically Q-values.
> 
> (Crib sheet:  In L-world, all kinds of values except primitives
> are formally carried by L-values, under L-descriptors.  True
> references can be referred to as R-values of R-type, reference
> type, or object class.  New value types are referred to as Q-values
> of Q-type, value type, or value class.  When you don't know what
> you have, you can say it's a U-type, which just like a Q-type or
> R-type is carried by an L-value under an L-descriptor.)
> 
> This means that the old acmp semantics are almost correct,
> in many cases.  If you have two L-values and do a pointer
> comparison on their physical pointers, and if the pointers
> differ, you are done.  If the physical pointers compare equal,
> then there is one more thing you have to do:  Make sure that
> the L-value carried by both physical pointers is in fact a
> reference, an R-value.
> 
> So the logic looks like this, and the new semantics are in
> the second "if" statement:
> 
>   bool new_acmp(ptr x, ptr y) {
>     if (!old_acmp(x, y))  return false;  // x != y => false
>     if (is_value_ptr(x))  return false;  // x.is_value => false
>     return true;
>   }
>   bool old_acmp(ptr x, ptr y) {
>     return x == y;  // cmpq etc.
>   }
> 
> Note that since acmp is symmetric, the JVM has the option
> to silently test either x or y for "is_value_ptr" if the two physical
> pointers compare equal.
> 
> This leads immediately to the first set of optimizations, which
> probably are enough to get us where we want to go.  (There
> is a second set we can add later.)  All the JVM has to do is
> prove or detect that one or the other physical pointer either
> *is certainly* a Q-value, or *is certainly not* a Q-value.
> 
> If either physical pointer is certainly a Q-value, then the
> new_acmp instruction is a constant false.  If either physical
> pointer is certainly not a Q-value, then the new_acmp
> instruction strength reduces to the old_acmp instruction.
> 
> The strength reduction to false can be done in any place
> where the type of either operand is known to be a Q-type.
> The interpreter doesn't track types, and the acmp instruction
> isn't resolved so it can't be quickened to a type context,
> but the JIT has lots of type information.  Thus, if either
> operand of acmp is statically known (or accurately speculated)
> to be a value type, then the acmp instruction constant folds
> to false.
> 
> Likewise, if either operand is known to be an R-type
> (true reference) the acmp may be strength reduced to
> a simple reference comparison.  This also is likely to
> happen in the JIT.  It also happens in the interpreter
> in the acmp_null instruction, where (obviously) the
> second operand is a reference (i.e., null).
> 
> (Note that the type Object is a U-type in L-world, so
> Object-to-Object comparisons such as are found in
> generic collection code do not benefit from this
> optimization.  More later on that.)  
> 
> The JIT can also track, in its profile, whether various
> instructions "see" only Q-values or only R-values.
> (The classic HotSpot already tracks "sightings" of
> null for similar purposes.)  In that case, speculative
> narrowings (only Q-types or only R-types) can be
> used to short circuit new_acmp down to old_acmp.
> 
> This is useful because the speculation can sometimes
> be verified out of a loop body, where the actual acmp
> instruction is executed in a loop.  Remember that the
> JIT can choose which operand of acmp to test for
> value-ness; if it chooses a loop invariant, and the
> invariant is always Q or always R, then the acmp
> inside the loop folds up, no matter what the dynamic
> value of the other operand.
> 
> For that matter, even if speculation fails, loop
> unswitching can get to a similar result, at the cost
> of two copies of the loop.
> 
> So, in many cases, contextual information can allow
> the JVM to fold up new_acmp to old_acmp or constant
> false.
> 
> In cases such folding fails, there is as simple trick
> which can make things easier.  Recall that folding
> fails when *both* operands are statically U-types,
> *neither* operand folding to a Q-type or R-type.
> 
> In that case, the JVM must pick one operand or
> the other (it doesn't matter which) and perform
> the is_value_ptr operation on it, in such a way
> that the result of that operand affects the final
> result in the correct way.  A naive implementation
> of new_acmp adds an extra test-and-branch
> to fold in the is_value_ptr test, but a clever might
> be able to avoid it.
> 
> Suppose there is a branch free technique for
> implementing is_value_ptr, such as extracting
> a bit from the physical pointer to the U-value,
> or (after a null check) extracting a bit from the
> physical pointer of the class metadata pointer
> in the value's header, or (after two indirections)
> extracting a bit from the structure of the class
> metadata.  (Mr. Simm's prototype puts such
> a bit in the metadata pointer, in the object
> header.)
> 
> If such a bit can be obtained in a branch-free
> manner, then it can be used to perturb the result
> of the old_acmp in a useful manner.  For example:
> 
>   bool new_acmp(ptr x, ptr y) {
>     if (x == null)  return y == null;
>     int bit = (x->class_in_header & Q_TYPE_BIT);
>     ptr x1 = x | swar_broadcast_bit(bit != 0);
>     return old_acmp(x1, y);
>   }
>   int swar_broadcast_bit(bool z) {
>     return z ? -1 : 0;  // use ALU ops
>   }
> 
> The idea is to extract a bit which signals the presence
> of a Q-type (in either operand) and use it to "mess up"
> the equality comparison, using and, or, or xor as needed.
> 
> This perturbation technique has two costs:  First,
> it requires a null check (unless the Q-bit is in the
> actual physical reference).  Second, it requires
> an ALU operation or two to spread the perturbation.
> But these costs will probably be negligible, except
> (perhaps) in very tight loops in generic code.
> 
> Generic code loops performing frequent acmp operations
> on untyped (Object) operands will need to perform
> extra null checks and/or value/reference detections
> if they are not already present (by luck) in the loop
> context.  In this case, loop unswitching may be useful.
> 
> There is one more contextual operation which may be
> helpful in such cases, if all else fails (and loop unswitching
> is not desirable).  Honestly, I haven't seen any real-world
> code yet that needs it, but it is comforting to have as a
> fall-back.  The idea is this:  If the acmp is contextually
> followed by a call to Object.equals, in the usual Legacy
> Idiom For Equality (L.I.F.E.), then there is one more trick
> we can play.  We'd like to force the new_acmp down to
> an old_acmp in these cases.  Can we?  Here's the choice:
> 
>   bool dutiful_life(ptr x, ptr y) {
>     if (new_acmp(x, y))  return true;
>     return x->Object_equals(y);
>   }
>   bool clever_life(ptr x, ptr y) {
>     if (old_acmp(x, y))  return true;
>     return x->Object_equals(y);
>   }
> 
> Note that the only semantic difference between old_acmp
> and new_acmp arises when *both* operands are the *same*
> value.  (Work out the cases:  In every other case, the simple
> pointer comparison detects the difference and produces the
> correct "false" answer.)  Thus, in the L.I.F.E., the only case
> where old_acmp gives the wrong answer produces an
> immediate answer of "true", instead of following up with
> a call to x.equals(x) on the Q-value.
> 
> To put it in a nutshell, the only difference between dutiful_life
> and clever_life is that the former always calls Object::equals
> if either operand is a value, and the the latter calls
> Object::equals in a subset of those cases (when the
> physical pointers differ).  When the physical pointers
> do *not* differ, there is no call to Object::equals.  But
> clever_life always delivers the same answer as dutiful_life.
> 
> So is there a problem?  Not much, although it requires us to treat
> Object::equals as an intrinsic with some predictable semantics.
> The crucial inside that allows us to adopt clever_life is that,
> if you call x.equals(x) on a non-null x, the result must always
> be true.  Now, this is not enforced by the JVM, but it is clearly
> documented in the API specification for Object::equals, and
> lots of stuff would be already broken if that were ever false.
> 
> Thus, if we are willing to rule out implementations of Object::equals
> for which a value can ever compare *not equal* to itself, then
> we can substitute clever_life for dutiful_life.  We also have to
> allow one more thing:  That we can silently drop calls to equals
> (at least in the context of L.I.F.E.) when the JVM can prove that
> the contract of equals allows the JVM to predict the outcome.
> This means that if you put side effects (like print statements)
> into an equals method on a value type, you might not see the
> side effects, after the JVM has optimized things.  This corner
> case may require further thought in order to straighten it out.
> 
> Note that L.I.F.E. is frequently found in generic code, where
> instead of operating on Object values it operates on type
> parameter types.  (The JVM just sees Object, after erasure.)
> In that case, if the JVM were to specialize the generic code
> separately for some types (such as value types), along the
> lines of the template class proposals, then the L.I.F.E.
> code would fold up in the JIT using the rules given at the
> top of this note.  The generic expression "a == b", where
> a or b was a value of a specialized type parameter T,
> would fold up to false in every specialization where T was
> a Q-type.  Likewise, as noted above, if L.I.F.E. is applied
> to a true R-type (or specialized to such a type), then any
> new_acmp on that type can be short-circuited to the faster
> old_acmp.
> 
> All in all, it appears that the cost of the new acmp semantics
> can, for all practical purposes, be pushed down into the noise.
> 
> — John
>