optimizing acmp in L-World

Wed Feb 7 13:44:24 UTC 2018

Sorry, I've sent wrong version of the patch. Here's the correct one (without the unnecessary 'andptr'):

@@ -2313,8 +2313,18 @@
 void TemplateTable::if_acmp(Condition cc) {
   transition(atos, vtos);
   // assume branch is more often taken than not (loops use backward branches)
-  Label not_taken;
+  Label not_taken, is_null;
   __ pop_ptr(rdx);
+
+  assert(Universe::oop_metadata_odd_mask() == 1, "should be");
+  __ testptr(rdx, rdx);
+  __ jcc(Assembler::zero, is_null);
+  __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
+  __ shlptr(rbx, 63);
+  __ sarptr(rbx, 63);
+  __ orptr(rdx, rbx);
+
+  __ bind(is_null);
   __ cmpoop(rdx, rax);
   __ jcc(j_not(cc), not_taken);
   branch(false, false);

Best regards,
Tobias

On 07.02.2018 14:40, Tobias Hartmann wrote:
> Hi,
> 
> Very nice writeup, thanks John!
> 
> I've started experimenting with C2 support/optimizations for the new acmp. For reference, here's the interpreter patch
> I'm using to verify correctness of my tests and the C2 implementation:
> 
> @@ -2313,8 +2314,19 @@
>  void TemplateTable::if_acmp(Condition cc) {
>    transition(atos, vtos);
>    // assume branch is more often taken than not (loops use backward branches)
> -  Label not_taken;
> +  Label not_taken, is_null;
>    __ pop_ptr(rdx);
> +
> +  __ testptr(rdx, rdx);
> +  __ jcc(Assembler::zero, is_null);
> +  assert(Universe::oop_metadata_odd_mask() == 1, "should be");
> +  __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
> +  __ andptr(rbx, Universe::oop_metadata_odd_mask());
> +  __ shlptr(rbx, 63);
> +  __ sarptr(rbx, 63);
> +  __ orptr(rdx, rbx);
> +
> +  __ bind(is_null);
>    __ cmpoop(rdx, rax);
>    __ jcc(j_not(cc), not
> 
> I'm shifting the is_value bit to the position of the sign bit and then do an arithmetic right shift to get all 1s in
> case the operand is a value type (or all 0s otherwise). The result is used to perturb the original operand. There might
> be a more efficient solution but I just leave this here in case someone wants to experiment as well.
> 
> Best regards,
> Tobias
> 
> 
> On 29.01.2018 21:42, John Rose wrote:
>> Tobias, Roland, and I had a chat in Santa Clara last week
>> about optimizing the acmp operation in the JVM.  Here are
>> my notes on it.
>>
>> Background:  The acmp operation, also known as reference
>> comparison, is overloaded in L-World to handle values as well,
>> with a specific semantics of returning false if either operand is
>> a value (similar to NaN behavior for fcmp).  The rationale for
>> this is that acmp is reference comparison, even in L-World,
>> and since values have no references, they can never be
>> equal as references.
>>
>> (Alternatively, it is *as if* both operands were converted
>> *to references* before the acmp operation, which means
>> that either operand, if it is a value, *would be* boxed to a
>> reference value, which then, being a new reference, is
>> automatically unequal to any other reference.  This tricky
>> account may be appealing in some circumstances.)
>>
>> OK, so how do we make it fast?  Ideally, we would like
>> to strength-reduce the enhanced acmp instruction down
>> to the original acmp instruction, which after all is just a
>> single instruction (such as cmpq) on all platforms.
>>
>> The first thing to notice, before going into details, is
>> that in L-world all Q-values are buffered in the interpreter.
>> This means that, at least for interpreter-generated values,
>> all values can be treated as physical pointers, even
>> if they are logically Q-values.
>>
>> (Crib sheet:  In L-world, all kinds of values except primitives
>> are formally carried by L-values, under L-descriptors.  True
>> references can be referred to as R-values of R-type, reference
>> type, or object class.  New value types are referred to as Q-values
>> of Q-type, value type, or value class.  When you don't know what
>> you have, you can say it's a U-type, which just like a Q-type or
>> R-type is carried by an L-value under an L-descriptor.)
>>
>> This means that the old acmp semantics are almost correct,
>> in many cases.  If you have two L-values and do a pointer
>> comparison on their physical pointers, and if the pointers
>> differ, you are done.  If the physical pointers compare equal,
>> then there is one more thing you have to do:  Make sure that
>> the L-value carried by both physical pointers is in fact a
>> reference, an R-value.
>>
>> So the logic looks like this, and the new semantics are in
>> the second "if" statement:
>>
>>   bool new_acmp(ptr x, ptr y) {
>>     if (!old_acmp(x, y))  return false;  // x != y => false
>>     if (is_value_ptr(x))  return false;  // x.is_value => false
>>     return true;
>>   }
>>   bool old_acmp(ptr x, ptr y) {
>>     return x == y;  // cmpq etc.
>>   }
>>
>> Note that since acmp is symmetric, the JVM has the option
>> to silently test either x or y for "is_value_ptr" if the two physical
>> pointers compare equal.
>>
>> This leads immediately to the first set of optimizations, which
>> probably are enough to get us where we want to go.  (There
>> is a second set we can add later.)  All the JVM has to do is
>> prove or detect that one or the other physical pointer either
>> *is certainly* a Q-value, or *is certainly not* a Q-value.
>>
>> If either physical pointer is certainly a Q-value, then the
>> new_acmp instruction is a constant false.  If either physical
>> pointer is certainly not a Q-value, then the new_acmp
>> instruction strength reduces to the old_acmp instruction.
>>
>> The strength reduction to false can be done in any place
>> where the type of either operand is known to be a Q-type.
>> The interpreter doesn't track types, and the acmp instruction
>> isn't resolved so it can't be quickened to a type context,
>> but the JIT has lots of type information.  Thus, if either
>> operand of acmp is statically known (or accurately speculated)
>> to be a value type, then the acmp instruction constant folds
>> to false.
>>
>> Likewise, if either operand is known to be an R-type
>> (true reference) the acmp may be strength reduced to
>> a simple reference comparison.  This also is likely to
>> happen in the JIT.  It also happens in the interpreter
>> in the acmp_null instruction, where (obviously) the
>> second operand is a reference (i.e., null).
>>
>> (Note that the type Object is a U-type in L-world, so
>> Object-to-Object comparisons such as are found in
>> generic collection code do not benefit from this
>> optimization.  More later on that.)  
>>
>> The JIT can also track, in its profile, whether various
>> instructions "see" only Q-values or only R-values.
>> (The classic HotSpot already tracks "sightings" of
>> null for similar purposes.)  In that case, speculative
>> narrowings (only Q-types or only R-types) can be
>> used to short circuit new_acmp down to old_acmp.
>>
>> This is useful because the speculation can sometimes
>> be verified out of a loop body, where the actual acmp
>> instruction is executed in a loop.  Remember that the
>> JIT can choose which operand of acmp to test for
>> value-ness; if it chooses a loop invariant, and the
>> invariant is always Q or always R, then the acmp
>> inside the loop folds up, no matter what the dynamic
>> value of the other operand.
>>
>> For that matter, even if speculation fails, loop
>> unswitching can get to a similar result, at the cost
>> of two copies of the loop.
>>
>> So, in many cases, contextual information can allow
>> the JVM to fold up new_acmp to old_acmp or constant
>> false.
>>
>> In cases such folding fails, there is as simple trick
>> which can make things easier.  Recall that folding
>> fails when *both* operands are statically U-types,
>> *neither* operand folding to a Q-type or R-type.
>>
>> In that case, the JVM must pick one operand or
>> the other (it doesn't matter which) and perform
>> the is_value_ptr operation on it, in such a way
>> that the result of that operand affects the final
>> result in the correct way.  A naive implementation
>> of new_acmp adds an extra test-and-branch
>> to fold in the is_value_ptr test, but a clever might
>> be able to avoid it.
>>
>> Suppose there is a branch free technique for
>> implementing is_value_ptr, such as extracting
>> a bit from the physical pointer to the U-value,
>> or (after a null check) extracting a bit from the
>> physical pointer of the class metadata pointer
>> in the value's header, or (after two indirections)
>> extracting a bit from the structure of the class
>> metadata.  (Mr. Simm's prototype puts such
>> a bit in the metadata pointer, in the object
>> header.)
>>
>> If such a bit can be obtained in a branch-free
>> manner, then it can be used to perturb the result
>> of the old_acmp in a useful manner.  For example:
>>
>>   bool new_acmp(ptr x, ptr y) {
>>     if (x == null)  return y == null;
>>     int bit = (x->class_in_header & Q_TYPE_BIT);
>>     ptr x1 = x | swar_broadcast_bit(bit != 0);
>>     return old_acmp(x1, y);
>>   }
>>   int swar_broadcast_bit(bool z) {
>>     return z ? -1 : 0;  // use ALU ops
>>   }
>>
>> The idea is to extract a bit which signals the presence
>> of a Q-type (in either operand) and use it to "mess up"
>> the equality comparison, using and, or, or xor as needed.
>>
>> This perturbation technique has two costs:  First,
>> it requires a null check (unless the Q-bit is in the
>> actual physical reference).  Second, it requires
>> an ALU operation or two to spread the perturbation.
>> But these costs will probably be negligible, except
>> (perhaps) in very tight loops in generic code.
>>
>> Generic code loops performing frequent acmp operations
>> on untyped (Object) operands will need to perform
>> extra null checks and/or value/reference detections
>> if they are not already present (by luck) in the loop
>> context.  In this case, loop unswitching may be useful.
>>
>> There is one more contextual operation which may be
>> helpful in such cases, if all else fails (and loop unswitching
>> is not desirable).  Honestly, I haven't seen any real-world
>> code yet that needs it, but it is comforting to have as a
>> fall-back.  The idea is this:  If the acmp is contextually
>> followed by a call to Object.equals, in the usual Legacy
>> Idiom For Equality (L.I.F.E.), then there is one more trick
>> we can play.  We'd like to force the new_acmp down to
>> an old_acmp in these cases.  Can we?  Here's the choice:
>>
>>   bool dutiful_life(ptr x, ptr y) {
>>     if (new_acmp(x, y))  return true;
>>     return x->Object_equals(y);
>>   }
>>   bool clever_life(ptr x, ptr y) {
>>     if (old_acmp(x, y))  return true;
>>     return x->Object_equals(y);
>>   }
>>
>> Note that the only semantic difference between old_acmp
>> and new_acmp arises when *both* operands are the *same*
>> value.  (Work out the cases:  In every other case, the simple
>> pointer comparison detects the difference and produces the
>> correct "false" answer.)  Thus, in the L.I.F.E., the only case
>> where old_acmp gives the wrong answer produces an
>> immediate answer of "true", instead of following up with
>> a call to x.equals(x) on the Q-value.
>>
>> To put it in a nutshell, the only difference between dutiful_life
>> and clever_life is that the former always calls Object::equals
>> if either operand is a value, and the the latter calls
>> Object::equals in a subset of those cases (when the
>> physical pointers differ).  When the physical pointers
>> do *not* differ, there is no call to Object::equals.  But
>> clever_life always delivers the same answer as dutiful_life.
>>
>> So is there a problem?  Not much, although it requires us to treat
>> Object::equals as an intrinsic with some predictable semantics.
>> The crucial inside that allows us to adopt clever_life is that,
>> if you call x.equals(x) on a non-null x, the result must always
>> be true.  Now, this is not enforced by the JVM, but it is clearly
>> documented in the API specification for Object::equals, and
>> lots of stuff would be already broken if that were ever false.
>>
>> Thus, if we are willing to rule out implementations of Object::equals
>> for which a value can ever compare *not equal* to itself, then
>> we can substitute clever_life for dutiful_life.  We also have to
>> allow one more thing:  That we can silently drop calls to equals
>> (at least in the context of L.I.F.E.) when the JVM can prove that
>> the contract of equals allows the JVM to predict the outcome.
>> This means that if you put side effects (like print statements)
>> into an equals method on a value type, you might not see the
>> side effects, after the JVM has optimized things.  This corner
>> case may require further thought in order to straighten it out.
>>
>> Note that L.I.F.E. is frequently found in generic code, where
>> instead of operating on Object values it operates on type
>> parameter types.  (The JVM just sees Object, after erasure.)
>> In that case, if the JVM were to specialize the generic code
>> separately for some types (such as value types), along the
>> lines of the template class proposals, then the L.I.F.E.
>> code would fold up in the JIT using the rules given at the
>> top of this note.  The generic expression "a == b", where
>> a or b was a value of a specialized type parameter T,
>> would fold up to false in every specialization where T was
>> a Q-type.  Likewise, as noted above, if L.I.F.E. is applied
>> to a true R-type (or specialized to such a type), then any
>> new_acmp on that type can be short-circuited to the faster
>> old_acmp.
>>
>> All in all, it appears that the cost of the new acmp semantics
>> can, for all practical purposes, be pushed down into the noise.
>>
>> — John
>>