optimizing acmp in L-World
Tobias Hartmann
tobias.hartmann at oracle.com
Wed Feb 7 13:44:24 UTC 2018
Sorry, I've sent wrong version of the patch. Here's the correct one (without the unnecessary 'andptr'):
@@ -2313,8 +2313,18 @@
void TemplateTable::if_acmp(Condition cc) {
transition(atos, vtos);
// assume branch is more often taken than not (loops use backward branches)
- Label not_taken;
+ Label not_taken, is_null;
__ pop_ptr(rdx);
+
+ assert(Universe::oop_metadata_odd_mask() == 1, "should be");
+ __ testptr(rdx, rdx);
+ __ jcc(Assembler::zero, is_null);
+ __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
+ __ shlptr(rbx, 63);
+ __ sarptr(rbx, 63);
+ __ orptr(rdx, rbx);
+
+ __ bind(is_null);
__ cmpoop(rdx, rax);
__ jcc(j_not(cc), not_taken);
branch(false, false);
Best regards,
Tobias
On 07.02.2018 14:40, Tobias Hartmann wrote:
> Hi,
>
> Very nice writeup, thanks John!
>
> I've started experimenting with C2 support/optimizations for the new acmp. For reference, here's the interpreter patch
> I'm using to verify correctness of my tests and the C2 implementation:
>
> @@ -2313,8 +2314,19 @@
> void TemplateTable::if_acmp(Condition cc) {
> transition(atos, vtos);
> // assume branch is more often taken than not (loops use backward branches)
> - Label not_taken;
> + Label not_taken, is_null;
> __ pop_ptr(rdx);
> +
> + __ testptr(rdx, rdx);
> + __ jcc(Assembler::zero, is_null);
> + assert(Universe::oop_metadata_odd_mask() == 1, "should be");
> + __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
> + __ andptr(rbx, Universe::oop_metadata_odd_mask());
> + __ shlptr(rbx, 63);
> + __ sarptr(rbx, 63);
> + __ orptr(rdx, rbx);
> +
> + __ bind(is_null);
> __ cmpoop(rdx, rax);
> __ jcc(j_not(cc), not
>
> I'm shifting the is_value bit to the position of the sign bit and then do an arithmetic right shift to get all 1s in
> case the operand is a value type (or all 0s otherwise). The result is used to perturb the original operand. There might
> be a more efficient solution but I just leave this here in case someone wants to experiment as well.
>
> Best regards,
> Tobias
>
>
> On 29.01.2018 21:42, John Rose wrote:
>> Tobias, Roland, and I had a chat in Santa Clara last week
>> about optimizing the acmp operation in the JVM. Here are
>> my notes on it.
>>
>> Background: The acmp operation, also known as reference
>> comparison, is overloaded in L-World to handle values as well,
>> with a specific semantics of returning false if either operand is
>> a value (similar to NaN behavior for fcmp). The rationale for
>> this is that acmp is reference comparison, even in L-World,
>> and since values have no references, they can never be
>> equal as references.
>>
>> (Alternatively, it is *as if* both operands were converted
>> *to references* before the acmp operation, which means
>> that either operand, if it is a value, *would be* boxed to a
>> reference value, which then, being a new reference, is
>> automatically unequal to any other reference. This tricky
>> account may be appealing in some circumstances.)
>>
>> OK, so how do we make it fast? Ideally, we would like
>> to strength-reduce the enhanced acmp instruction down
>> to the original acmp instruction, which after all is just a
>> single instruction (such as cmpq) on all platforms.
>>
>> The first thing to notice, before going into details, is
>> that in L-world all Q-values are buffered in the interpreter.
>> This means that, at least for interpreter-generated values,
>> all values can be treated as physical pointers, even
>> if they are logically Q-values.
>>
>> (Crib sheet: In L-world, all kinds of values except primitives
>> are formally carried by L-values, under L-descriptors. True
>> references can be referred to as R-values of R-type, reference
>> type, or object class. New value types are referred to as Q-values
>> of Q-type, value type, or value class. When you don't know what
>> you have, you can say it's a U-type, which just like a Q-type or
>> R-type is carried by an L-value under an L-descriptor.)
>>
>> This means that the old acmp semantics are almost correct,
>> in many cases. If you have two L-values and do a pointer
>> comparison on their physical pointers, and if the pointers
>> differ, you are done. If the physical pointers compare equal,
>> then there is one more thing you have to do: Make sure that
>> the L-value carried by both physical pointers is in fact a
>> reference, an R-value.
>>
>> So the logic looks like this, and the new semantics are in
>> the second "if" statement:
>>
>> bool new_acmp(ptr x, ptr y) {
>> if (!old_acmp(x, y)) return false; // x != y => false
>> if (is_value_ptr(x)) return false; // x.is_value => false
>> return true;
>> }
>> bool old_acmp(ptr x, ptr y) {
>> return x == y; // cmpq etc.
>> }
>>
>> Note that since acmp is symmetric, the JVM has the option
>> to silently test either x or y for "is_value_ptr" if the two physical
>> pointers compare equal.
>>
>> This leads immediately to the first set of optimizations, which
>> probably are enough to get us where we want to go. (There
>> is a second set we can add later.) All the JVM has to do is
>> prove or detect that one or the other physical pointer either
>> *is certainly* a Q-value, or *is certainly not* a Q-value.
>>
>> If either physical pointer is certainly a Q-value, then the
>> new_acmp instruction is a constant false. If either physical
>> pointer is certainly not a Q-value, then the new_acmp
>> instruction strength reduces to the old_acmp instruction.
>>
>> The strength reduction to false can be done in any place
>> where the type of either operand is known to be a Q-type.
>> The interpreter doesn't track types, and the acmp instruction
>> isn't resolved so it can't be quickened to a type context,
>> but the JIT has lots of type information. Thus, if either
>> operand of acmp is statically known (or accurately speculated)
>> to be a value type, then the acmp instruction constant folds
>> to false.
>>
>> Likewise, if either operand is known to be an R-type
>> (true reference) the acmp may be strength reduced to
>> a simple reference comparison. This also is likely to
>> happen in the JIT. It also happens in the interpreter
>> in the acmp_null instruction, where (obviously) the
>> second operand is a reference (i.e., null).
>>
>> (Note that the type Object is a U-type in L-world, so
>> Object-to-Object comparisons such as are found in
>> generic collection code do not benefit from this
>> optimization. More later on that.)
>>
>> The JIT can also track, in its profile, whether various
>> instructions "see" only Q-values or only R-values.
>> (The classic HotSpot already tracks "sightings" of
>> null for similar purposes.) In that case, speculative
>> narrowings (only Q-types or only R-types) can be
>> used to short circuit new_acmp down to old_acmp.
>>
>> This is useful because the speculation can sometimes
>> be verified out of a loop body, where the actual acmp
>> instruction is executed in a loop. Remember that the
>> JIT can choose which operand of acmp to test for
>> value-ness; if it chooses a loop invariant, and the
>> invariant is always Q or always R, then the acmp
>> inside the loop folds up, no matter what the dynamic
>> value of the other operand.
>>
>> For that matter, even if speculation fails, loop
>> unswitching can get to a similar result, at the cost
>> of two copies of the loop.
>>
>> So, in many cases, contextual information can allow
>> the JVM to fold up new_acmp to old_acmp or constant
>> false.
>>
>> In cases such folding fails, there is as simple trick
>> which can make things easier. Recall that folding
>> fails when *both* operands are statically U-types,
>> *neither* operand folding to a Q-type or R-type.
>>
>> In that case, the JVM must pick one operand or
>> the other (it doesn't matter which) and perform
>> the is_value_ptr operation on it, in such a way
>> that the result of that operand affects the final
>> result in the correct way. A naive implementation
>> of new_acmp adds an extra test-and-branch
>> to fold in the is_value_ptr test, but a clever might
>> be able to avoid it.
>>
>> Suppose there is a branch free technique for
>> implementing is_value_ptr, such as extracting
>> a bit from the physical pointer to the U-value,
>> or (after a null check) extracting a bit from the
>> physical pointer of the class metadata pointer
>> in the value's header, or (after two indirections)
>> extracting a bit from the structure of the class
>> metadata. (Mr. Simm's prototype puts such
>> a bit in the metadata pointer, in the object
>> header.)
>>
>> If such a bit can be obtained in a branch-free
>> manner, then it can be used to perturb the result
>> of the old_acmp in a useful manner. For example:
>>
>> bool new_acmp(ptr x, ptr y) {
>> if (x == null) return y == null;
>> int bit = (x->class_in_header & Q_TYPE_BIT);
>> ptr x1 = x | swar_broadcast_bit(bit != 0);
>> return old_acmp(x1, y);
>> }
>> int swar_broadcast_bit(bool z) {
>> return z ? -1 : 0; // use ALU ops
>> }
>>
>> The idea is to extract a bit which signals the presence
>> of a Q-type (in either operand) and use it to "mess up"
>> the equality comparison, using and, or, or xor as needed.
>>
>> This perturbation technique has two costs: First,
>> it requires a null check (unless the Q-bit is in the
>> actual physical reference). Second, it requires
>> an ALU operation or two to spread the perturbation.
>> But these costs will probably be negligible, except
>> (perhaps) in very tight loops in generic code.
>>
>> Generic code loops performing frequent acmp operations
>> on untyped (Object) operands will need to perform
>> extra null checks and/or value/reference detections
>> if they are not already present (by luck) in the loop
>> context. In this case, loop unswitching may be useful.
>>
>> There is one more contextual operation which may be
>> helpful in such cases, if all else fails (and loop unswitching
>> is not desirable). Honestly, I haven't seen any real-world
>> code yet that needs it, but it is comforting to have as a
>> fall-back. The idea is this: If the acmp is contextually
>> followed by a call to Object.equals, in the usual Legacy
>> Idiom For Equality (L.I.F.E.), then there is one more trick
>> we can play. We'd like to force the new_acmp down to
>> an old_acmp in these cases. Can we? Here's the choice:
>>
>> bool dutiful_life(ptr x, ptr y) {
>> if (new_acmp(x, y)) return true;
>> return x->Object_equals(y);
>> }
>> bool clever_life(ptr x, ptr y) {
>> if (old_acmp(x, y)) return true;
>> return x->Object_equals(y);
>> }
>>
>> Note that the only semantic difference between old_acmp
>> and new_acmp arises when *both* operands are the *same*
>> value. (Work out the cases: In every other case, the simple
>> pointer comparison detects the difference and produces the
>> correct "false" answer.) Thus, in the L.I.F.E., the only case
>> where old_acmp gives the wrong answer produces an
>> immediate answer of "true", instead of following up with
>> a call to x.equals(x) on the Q-value.
>>
>> To put it in a nutshell, the only difference between dutiful_life
>> and clever_life is that the former always calls Object::equals
>> if either operand is a value, and the the latter calls
>> Object::equals in a subset of those cases (when the
>> physical pointers differ). When the physical pointers
>> do *not* differ, there is no call to Object::equals. But
>> clever_life always delivers the same answer as dutiful_life.
>>
>> So is there a problem? Not much, although it requires us to treat
>> Object::equals as an intrinsic with some predictable semantics.
>> The crucial inside that allows us to adopt clever_life is that,
>> if you call x.equals(x) on a non-null x, the result must always
>> be true. Now, this is not enforced by the JVM, but it is clearly
>> documented in the API specification for Object::equals, and
>> lots of stuff would be already broken if that were ever false.
>>
>> Thus, if we are willing to rule out implementations of Object::equals
>> for which a value can ever compare *not equal* to itself, then
>> we can substitute clever_life for dutiful_life. We also have to
>> allow one more thing: That we can silently drop calls to equals
>> (at least in the context of L.I.F.E.) when the JVM can prove that
>> the contract of equals allows the JVM to predict the outcome.
>> This means that if you put side effects (like print statements)
>> into an equals method on a value type, you might not see the
>> side effects, after the JVM has optimized things. This corner
>> case may require further thought in order to straighten it out.
>>
>> Note that L.I.F.E. is frequently found in generic code, where
>> instead of operating on Object values it operates on type
>> parameter types. (The JVM just sees Object, after erasure.)
>> In that case, if the JVM were to specialize the generic code
>> separately for some types (such as value types), along the
>> lines of the template class proposals, then the L.I.F.E.
>> code would fold up in the JIT using the rules given at the
>> top of this note. The generic expression "a == b", where
>> a or b was a value of a specialized type parameter T,
>> would fold up to false in every specialization where T was
>> a Q-type. Likewise, as noted above, if L.I.F.E. is applied
>> to a true R-type (or specialized to such a type), then any
>> new_acmp on that type can be short-circuited to the faster
>> old_acmp.
>>
>> All in all, it appears that the cost of the new acmp semantics
>> can, for all practical purposes, be pushed down into the noise.
>>
>> — John
>>
More information about the valhalla-dev
mailing list