optimizing acmp in L-World

Tobias Hartmann tobias.hartmann at oracle.com
Wed Feb 7 16:27:00 UTC 2018


Knowing that ObjectAlignmentInBytes is guaranteed to be >= 2, we can do even better:

  __ testptr(rdx, rdx);
  __ jcc(Assembler::zero, is_null);
  __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
  __ andptr(rbx, Universe::oop_metadata_odd_mask());
  __ orptr(rdx, rbx);

Best regards,
Tobias

On 07.02.2018 14:44, Tobias Hartmann wrote:
> Sorry, I've sent wrong version of the patch. Here's the correct one (without the unnecessary 'andptr'):
> 
> @@ -2313,8 +2313,18 @@
>  void TemplateTable::if_acmp(Condition cc) {
>    transition(atos, vtos);
>    // assume branch is more often taken than not (loops use backward branches)
> -  Label not_taken;
> +  Label not_taken, is_null;
>    __ pop_ptr(rdx);
> +
> +  assert(Universe::oop_metadata_odd_mask() == 1, "should be");
> +  __ testptr(rdx, rdx);
> +  __ jcc(Assembler::zero, is_null);
> +  __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
> +  __ shlptr(rbx, 63);
> +  __ sarptr(rbx, 63);
> +  __ orptr(rdx, rbx);
> +
> +  __ bind(is_null);
>    __ cmpoop(rdx, rax);
>    __ jcc(j_not(cc), not_taken);
>    branch(false, false);
> 
> Best regards,
> Tobias
> 
> On 07.02.2018 14:40, Tobias Hartmann wrote:
>> Hi,
>>
>> Very nice writeup, thanks John!
>>
>> I've started experimenting with C2 support/optimizations for the new acmp. For reference, here's the interpreter patch
>> I'm using to verify correctness of my tests and the C2 implementation:
>>
>> @@ -2313,8 +2314,19 @@
>>  void TemplateTable::if_acmp(Condition cc) {
>>    transition(atos, vtos);
>>    // assume branch is more often taken than not (loops use backward branches)
>> -  Label not_taken;
>> +  Label not_taken, is_null;
>>    __ pop_ptr(rdx);
>> +
>> +  __ testptr(rdx, rdx);
>> +  __ jcc(Assembler::zero, is_null);
>> +  assert(Universe::oop_metadata_odd_mask() == 1, "should be");
>> +  __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
>> +  __ andptr(rbx, Universe::oop_metadata_odd_mask());
>> +  __ shlptr(rbx, 63);
>> +  __ sarptr(rbx, 63);
>> +  __ orptr(rdx, rbx);
>> +
>> +  __ bind(is_null);
>>    __ cmpoop(rdx, rax);
>>    __ jcc(j_not(cc), not
>>
>> I'm shifting the is_value bit to the position of the sign bit and then do an arithmetic right shift to get all 1s in
>> case the operand is a value type (or all 0s otherwise). The result is used to perturb the original operand. There might
>> be a more efficient solution but I just leave this here in case someone wants to experiment as well.
>>
>> Best regards,
>> Tobias
>>
>>
>> On 29.01.2018 21:42, John Rose wrote:
>>> Tobias, Roland, and I had a chat in Santa Clara last week
>>> about optimizing the acmp operation in the JVM.  Here are
>>> my notes on it.
>>>
>>> Background:  The acmp operation, also known as reference
>>> comparison, is overloaded in L-World to handle values as well,
>>> with a specific semantics of returning false if either operand is
>>> a value (similar to NaN behavior for fcmp).  The rationale for
>>> this is that acmp is reference comparison, even in L-World,
>>> and since values have no references, they can never be
>>> equal as references.
>>>
>>> (Alternatively, it is *as if* both operands were converted
>>> *to references* before the acmp operation, which means
>>> that either operand, if it is a value, *would be* boxed to a
>>> reference value, which then, being a new reference, is
>>> automatically unequal to any other reference.  This tricky
>>> account may be appealing in some circumstances.)
>>>
>>> OK, so how do we make it fast?  Ideally, we would like
>>> to strength-reduce the enhanced acmp instruction down
>>> to the original acmp instruction, which after all is just a
>>> single instruction (such as cmpq) on all platforms.
>>>
>>> The first thing to notice, before going into details, is
>>> that in L-world all Q-values are buffered in the interpreter.
>>> This means that, at least for interpreter-generated values,
>>> all values can be treated as physical pointers, even
>>> if they are logically Q-values.
>>>
>>> (Crib sheet:  In L-world, all kinds of values except primitives
>>> are formally carried by L-values, under L-descriptors.  True
>>> references can be referred to as R-values of R-type, reference
>>> type, or object class.  New value types are referred to as Q-values
>>> of Q-type, value type, or value class.  When you don't know what
>>> you have, you can say it's a U-type, which just like a Q-type or
>>> R-type is carried by an L-value under an L-descriptor.)
>>>
>>> This means that the old acmp semantics are almost correct,
>>> in many cases.  If you have two L-values and do a pointer
>>> comparison on their physical pointers, and if the pointers
>>> differ, you are done.  If the physical pointers compare equal,
>>> then there is one more thing you have to do:  Make sure that
>>> the L-value carried by both physical pointers is in fact a
>>> reference, an R-value.
>>>
>>> So the logic looks like this, and the new semantics are in
>>> the second "if" statement:
>>>
>>>   bool new_acmp(ptr x, ptr y) {
>>>     if (!old_acmp(x, y))  return false;  // x != y => false
>>>     if (is_value_ptr(x))  return false;  // x.is_value => false
>>>     return true;
>>>   }
>>>   bool old_acmp(ptr x, ptr y) {
>>>     return x == y;  // cmpq etc.
>>>   }
>>>
>>> Note that since acmp is symmetric, the JVM has the option
>>> to silently test either x or y for "is_value_ptr" if the two physical
>>> pointers compare equal.
>>>
>>> This leads immediately to the first set of optimizations, which
>>> probably are enough to get us where we want to go.  (There
>>> is a second set we can add later.)  All the JVM has to do is
>>> prove or detect that one or the other physical pointer either
>>> *is certainly* a Q-value, or *is certainly not* a Q-value.
>>>
>>> If either physical pointer is certainly a Q-value, then the
>>> new_acmp instruction is a constant false.  If either physical
>>> pointer is certainly not a Q-value, then the new_acmp
>>> instruction strength reduces to the old_acmp instruction.
>>>
>>> The strength reduction to false can be done in any place
>>> where the type of either operand is known to be a Q-type.
>>> The interpreter doesn't track types, and the acmp instruction
>>> isn't resolved so it can't be quickened to a type context,
>>> but the JIT has lots of type information.  Thus, if either
>>> operand of acmp is statically known (or accurately speculated)
>>> to be a value type, then the acmp instruction constant folds
>>> to false.
>>>
>>> Likewise, if either operand is known to be an R-type
>>> (true reference) the acmp may be strength reduced to
>>> a simple reference comparison.  This also is likely to
>>> happen in the JIT.  It also happens in the interpreter
>>> in the acmp_null instruction, where (obviously) the
>>> second operand is a reference (i.e., null).
>>>
>>> (Note that the type Object is a U-type in L-world, so
>>> Object-to-Object comparisons such as are found in
>>> generic collection code do not benefit from this
>>> optimization.  More later on that.)  
>>>
>>> The JIT can also track, in its profile, whether various
>>> instructions "see" only Q-values or only R-values.
>>> (The classic HotSpot already tracks "sightings" of
>>> null for similar purposes.)  In that case, speculative
>>> narrowings (only Q-types or only R-types) can be
>>> used to short circuit new_acmp down to old_acmp.
>>>
>>> This is useful because the speculation can sometimes
>>> be verified out of a loop body, where the actual acmp
>>> instruction is executed in a loop.  Remember that the
>>> JIT can choose which operand of acmp to test for
>>> value-ness; if it chooses a loop invariant, and the
>>> invariant is always Q or always R, then the acmp
>>> inside the loop folds up, no matter what the dynamic
>>> value of the other operand.
>>>
>>> For that matter, even if speculation fails, loop
>>> unswitching can get to a similar result, at the cost
>>> of two copies of the loop.
>>>
>>> So, in many cases, contextual information can allow
>>> the JVM to fold up new_acmp to old_acmp or constant
>>> false.
>>>
>>> In cases such folding fails, there is as simple trick
>>> which can make things easier.  Recall that folding
>>> fails when *both* operands are statically U-types,
>>> *neither* operand folding to a Q-type or R-type.
>>>
>>> In that case, the JVM must pick one operand or
>>> the other (it doesn't matter which) and perform
>>> the is_value_ptr operation on it, in such a way
>>> that the result of that operand affects the final
>>> result in the correct way.  A naive implementation
>>> of new_acmp adds an extra test-and-branch
>>> to fold in the is_value_ptr test, but a clever might
>>> be able to avoid it.
>>>
>>> Suppose there is a branch free technique for
>>> implementing is_value_ptr, such as extracting
>>> a bit from the physical pointer to the U-value,
>>> or (after a null check) extracting a bit from the
>>> physical pointer of the class metadata pointer
>>> in the value's header, or (after two indirections)
>>> extracting a bit from the structure of the class
>>> metadata.  (Mr. Simm's prototype puts such
>>> a bit in the metadata pointer, in the object
>>> header.)
>>>
>>> If such a bit can be obtained in a branch-free
>>> manner, then it can be used to perturb the result
>>> of the old_acmp in a useful manner.  For example:
>>>
>>>   bool new_acmp(ptr x, ptr y) {
>>>     if (x == null)  return y == null;
>>>     int bit = (x->class_in_header & Q_TYPE_BIT);
>>>     ptr x1 = x | swar_broadcast_bit(bit != 0);
>>>     return old_acmp(x1, y);
>>>   }
>>>   int swar_broadcast_bit(bool z) {
>>>     return z ? -1 : 0;  // use ALU ops
>>>   }
>>>
>>> The idea is to extract a bit which signals the presence
>>> of a Q-type (in either operand) and use it to "mess up"
>>> the equality comparison, using and, or, or xor as needed.
>>>
>>> This perturbation technique has two costs:  First,
>>> it requires a null check (unless the Q-bit is in the
>>> actual physical reference).  Second, it requires
>>> an ALU operation or two to spread the perturbation.
>>> But these costs will probably be negligible, except
>>> (perhaps) in very tight loops in generic code.
>>>
>>> Generic code loops performing frequent acmp operations
>>> on untyped (Object) operands will need to perform
>>> extra null checks and/or value/reference detections
>>> if they are not already present (by luck) in the loop
>>> context.  In this case, loop unswitching may be useful.
>>>
>>> There is one more contextual operation which may be
>>> helpful in such cases, if all else fails (and loop unswitching
>>> is not desirable).  Honestly, I haven't seen any real-world
>>> code yet that needs it, but it is comforting to have as a
>>> fall-back.  The idea is this:  If the acmp is contextually
>>> followed by a call to Object.equals, in the usual Legacy
>>> Idiom For Equality (L.I.F.E.), then there is one more trick
>>> we can play.  We'd like to force the new_acmp down to
>>> an old_acmp in these cases.  Can we?  Here's the choice:
>>>
>>>   bool dutiful_life(ptr x, ptr y) {
>>>     if (new_acmp(x, y))  return true;
>>>     return x->Object_equals(y);
>>>   }
>>>   bool clever_life(ptr x, ptr y) {
>>>     if (old_acmp(x, y))  return true;
>>>     return x->Object_equals(y);
>>>   }
>>>
>>> Note that the only semantic difference between old_acmp
>>> and new_acmp arises when *both* operands are the *same*
>>> value.  (Work out the cases:  In every other case, the simple
>>> pointer comparison detects the difference and produces the
>>> correct "false" answer.)  Thus, in the L.I.F.E., the only case
>>> where old_acmp gives the wrong answer produces an
>>> immediate answer of "true", instead of following up with
>>> a call to x.equals(x) on the Q-value.
>>>
>>> To put it in a nutshell, the only difference between dutiful_life
>>> and clever_life is that the former always calls Object::equals
>>> if either operand is a value, and the the latter calls
>>> Object::equals in a subset of those cases (when the
>>> physical pointers differ).  When the physical pointers
>>> do *not* differ, there is no call to Object::equals.  But
>>> clever_life always delivers the same answer as dutiful_life.
>>>
>>> So is there a problem?  Not much, although it requires us to treat
>>> Object::equals as an intrinsic with some predictable semantics.
>>> The crucial inside that allows us to adopt clever_life is that,
>>> if you call x.equals(x) on a non-null x, the result must always
>>> be true.  Now, this is not enforced by the JVM, but it is clearly
>>> documented in the API specification for Object::equals, and
>>> lots of stuff would be already broken if that were ever false.
>>>
>>> Thus, if we are willing to rule out implementations of Object::equals
>>> for which a value can ever compare *not equal* to itself, then
>>> we can substitute clever_life for dutiful_life.  We also have to
>>> allow one more thing:  That we can silently drop calls to equals
>>> (at least in the context of L.I.F.E.) when the JVM can prove that
>>> the contract of equals allows the JVM to predict the outcome.
>>> This means that if you put side effects (like print statements)
>>> into an equals method on a value type, you might not see the
>>> side effects, after the JVM has optimized things.  This corner
>>> case may require further thought in order to straighten it out.
>>>
>>> Note that L.I.F.E. is frequently found in generic code, where
>>> instead of operating on Object values it operates on type
>>> parameter types.  (The JVM just sees Object, after erasure.)
>>> In that case, if the JVM were to specialize the generic code
>>> separately for some types (such as value types), along the
>>> lines of the template class proposals, then the L.I.F.E.
>>> code would fold up in the JIT using the rules given at the
>>> top of this note.  The generic expression "a == b", where
>>> a or b was a value of a specialized type parameter T,
>>> would fold up to false in every specialization where T was
>>> a Q-type.  Likewise, as noted above, if L.I.F.E. is applied
>>> to a true R-type (or specialized to such a type), then any
>>> new_acmp on that type can be short-circuited to the faster
>>> old_acmp.
>>>
>>> All in all, it appears that the cost of the new acmp semantics
>>> can, for all practical purposes, be pushed down into the noise.
>>>
>>> — John
>>>



More information about the valhalla-dev mailing list