optimizing acmp in L-World
Tobias Hartmann
tobias.hartmann at oracle.com
Wed Feb 7 16:27:00 UTC 2018
Knowing that ObjectAlignmentInBytes is guaranteed to be >= 2, we can do even better:
__ testptr(rdx, rdx);
__ jcc(Assembler::zero, is_null);
__ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
__ andptr(rbx, Universe::oop_metadata_odd_mask());
__ orptr(rdx, rbx);
Best regards,
Tobias
On 07.02.2018 14:44, Tobias Hartmann wrote:
> Sorry, I've sent wrong version of the patch. Here's the correct one (without the unnecessary 'andptr'):
>
> @@ -2313,8 +2313,18 @@
> void TemplateTable::if_acmp(Condition cc) {
> transition(atos, vtos);
> // assume branch is more often taken than not (loops use backward branches)
> - Label not_taken;
> + Label not_taken, is_null;
> __ pop_ptr(rdx);
> +
> + assert(Universe::oop_metadata_odd_mask() == 1, "should be");
> + __ testptr(rdx, rdx);
> + __ jcc(Assembler::zero, is_null);
> + __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
> + __ shlptr(rbx, 63);
> + __ sarptr(rbx, 63);
> + __ orptr(rdx, rbx);
> +
> + __ bind(is_null);
> __ cmpoop(rdx, rax);
> __ jcc(j_not(cc), not_taken);
> branch(false, false);
>
> Best regards,
> Tobias
>
> On 07.02.2018 14:40, Tobias Hartmann wrote:
>> Hi,
>>
>> Very nice writeup, thanks John!
>>
>> I've started experimenting with C2 support/optimizations for the new acmp. For reference, here's the interpreter patch
>> I'm using to verify correctness of my tests and the C2 implementation:
>>
>> @@ -2313,8 +2314,19 @@
>> void TemplateTable::if_acmp(Condition cc) {
>> transition(atos, vtos);
>> // assume branch is more often taken than not (loops use backward branches)
>> - Label not_taken;
>> + Label not_taken, is_null;
>> __ pop_ptr(rdx);
>> +
>> + __ testptr(rdx, rdx);
>> + __ jcc(Assembler::zero, is_null);
>> + assert(Universe::oop_metadata_odd_mask() == 1, "should be");
>> + __ movl(rbx, Address(rdx, oopDesc::klass_offset_in_bytes()));
>> + __ andptr(rbx, Universe::oop_metadata_odd_mask());
>> + __ shlptr(rbx, 63);
>> + __ sarptr(rbx, 63);
>> + __ orptr(rdx, rbx);
>> +
>> + __ bind(is_null);
>> __ cmpoop(rdx, rax);
>> __ jcc(j_not(cc), not
>>
>> I'm shifting the is_value bit to the position of the sign bit and then do an arithmetic right shift to get all 1s in
>> case the operand is a value type (or all 0s otherwise). The result is used to perturb the original operand. There might
>> be a more efficient solution but I just leave this here in case someone wants to experiment as well.
>>
>> Best regards,
>> Tobias
>>
>>
>> On 29.01.2018 21:42, John Rose wrote:
>>> Tobias, Roland, and I had a chat in Santa Clara last week
>>> about optimizing the acmp operation in the JVM. Here are
>>> my notes on it.
>>>
>>> Background: The acmp operation, also known as reference
>>> comparison, is overloaded in L-World to handle values as well,
>>> with a specific semantics of returning false if either operand is
>>> a value (similar to NaN behavior for fcmp). The rationale for
>>> this is that acmp is reference comparison, even in L-World,
>>> and since values have no references, they can never be
>>> equal as references.
>>>
>>> (Alternatively, it is *as if* both operands were converted
>>> *to references* before the acmp operation, which means
>>> that either operand, if it is a value, *would be* boxed to a
>>> reference value, which then, being a new reference, is
>>> automatically unequal to any other reference. This tricky
>>> account may be appealing in some circumstances.)
>>>
>>> OK, so how do we make it fast? Ideally, we would like
>>> to strength-reduce the enhanced acmp instruction down
>>> to the original acmp instruction, which after all is just a
>>> single instruction (such as cmpq) on all platforms.
>>>
>>> The first thing to notice, before going into details, is
>>> that in L-world all Q-values are buffered in the interpreter.
>>> This means that, at least for interpreter-generated values,
>>> all values can be treated as physical pointers, even
>>> if they are logically Q-values.
>>>
>>> (Crib sheet: In L-world, all kinds of values except primitives
>>> are formally carried by L-values, under L-descriptors. True
>>> references can be referred to as R-values of R-type, reference
>>> type, or object class. New value types are referred to as Q-values
>>> of Q-type, value type, or value class. When you don't know what
>>> you have, you can say it's a U-type, which just like a Q-type or
>>> R-type is carried by an L-value under an L-descriptor.)
>>>
>>> This means that the old acmp semantics are almost correct,
>>> in many cases. If you have two L-values and do a pointer
>>> comparison on their physical pointers, and if the pointers
>>> differ, you are done. If the physical pointers compare equal,
>>> then there is one more thing you have to do: Make sure that
>>> the L-value carried by both physical pointers is in fact a
>>> reference, an R-value.
>>>
>>> So the logic looks like this, and the new semantics are in
>>> the second "if" statement:
>>>
>>> bool new_acmp(ptr x, ptr y) {
>>> if (!old_acmp(x, y)) return false; // x != y => false
>>> if (is_value_ptr(x)) return false; // x.is_value => false
>>> return true;
>>> }
>>> bool old_acmp(ptr x, ptr y) {
>>> return x == y; // cmpq etc.
>>> }
>>>
>>> Note that since acmp is symmetric, the JVM has the option
>>> to silently test either x or y for "is_value_ptr" if the two physical
>>> pointers compare equal.
>>>
>>> This leads immediately to the first set of optimizations, which
>>> probably are enough to get us where we want to go. (There
>>> is a second set we can add later.) All the JVM has to do is
>>> prove or detect that one or the other physical pointer either
>>> *is certainly* a Q-value, or *is certainly not* a Q-value.
>>>
>>> If either physical pointer is certainly a Q-value, then the
>>> new_acmp instruction is a constant false. If either physical
>>> pointer is certainly not a Q-value, then the new_acmp
>>> instruction strength reduces to the old_acmp instruction.
>>>
>>> The strength reduction to false can be done in any place
>>> where the type of either operand is known to be a Q-type.
>>> The interpreter doesn't track types, and the acmp instruction
>>> isn't resolved so it can't be quickened to a type context,
>>> but the JIT has lots of type information. Thus, if either
>>> operand of acmp is statically known (or accurately speculated)
>>> to be a value type, then the acmp instruction constant folds
>>> to false.
>>>
>>> Likewise, if either operand is known to be an R-type
>>> (true reference) the acmp may be strength reduced to
>>> a simple reference comparison. This also is likely to
>>> happen in the JIT. It also happens in the interpreter
>>> in the acmp_null instruction, where (obviously) the
>>> second operand is a reference (i.e., null).
>>>
>>> (Note that the type Object is a U-type in L-world, so
>>> Object-to-Object comparisons such as are found in
>>> generic collection code do not benefit from this
>>> optimization. More later on that.)
>>>
>>> The JIT can also track, in its profile, whether various
>>> instructions "see" only Q-values or only R-values.
>>> (The classic HotSpot already tracks "sightings" of
>>> null for similar purposes.) In that case, speculative
>>> narrowings (only Q-types or only R-types) can be
>>> used to short circuit new_acmp down to old_acmp.
>>>
>>> This is useful because the speculation can sometimes
>>> be verified out of a loop body, where the actual acmp
>>> instruction is executed in a loop. Remember that the
>>> JIT can choose which operand of acmp to test for
>>> value-ness; if it chooses a loop invariant, and the
>>> invariant is always Q or always R, then the acmp
>>> inside the loop folds up, no matter what the dynamic
>>> value of the other operand.
>>>
>>> For that matter, even if speculation fails, loop
>>> unswitching can get to a similar result, at the cost
>>> of two copies of the loop.
>>>
>>> So, in many cases, contextual information can allow
>>> the JVM to fold up new_acmp to old_acmp or constant
>>> false.
>>>
>>> In cases such folding fails, there is as simple trick
>>> which can make things easier. Recall that folding
>>> fails when *both* operands are statically U-types,
>>> *neither* operand folding to a Q-type or R-type.
>>>
>>> In that case, the JVM must pick one operand or
>>> the other (it doesn't matter which) and perform
>>> the is_value_ptr operation on it, in such a way
>>> that the result of that operand affects the final
>>> result in the correct way. A naive implementation
>>> of new_acmp adds an extra test-and-branch
>>> to fold in the is_value_ptr test, but a clever might
>>> be able to avoid it.
>>>
>>> Suppose there is a branch free technique for
>>> implementing is_value_ptr, such as extracting
>>> a bit from the physical pointer to the U-value,
>>> or (after a null check) extracting a bit from the
>>> physical pointer of the class metadata pointer
>>> in the value's header, or (after two indirections)
>>> extracting a bit from the structure of the class
>>> metadata. (Mr. Simm's prototype puts such
>>> a bit in the metadata pointer, in the object
>>> header.)
>>>
>>> If such a bit can be obtained in a branch-free
>>> manner, then it can be used to perturb the result
>>> of the old_acmp in a useful manner. For example:
>>>
>>> bool new_acmp(ptr x, ptr y) {
>>> if (x == null) return y == null;
>>> int bit = (x->class_in_header & Q_TYPE_BIT);
>>> ptr x1 = x | swar_broadcast_bit(bit != 0);
>>> return old_acmp(x1, y);
>>> }
>>> int swar_broadcast_bit(bool z) {
>>> return z ? -1 : 0; // use ALU ops
>>> }
>>>
>>> The idea is to extract a bit which signals the presence
>>> of a Q-type (in either operand) and use it to "mess up"
>>> the equality comparison, using and, or, or xor as needed.
>>>
>>> This perturbation technique has two costs: First,
>>> it requires a null check (unless the Q-bit is in the
>>> actual physical reference). Second, it requires
>>> an ALU operation or two to spread the perturbation.
>>> But these costs will probably be negligible, except
>>> (perhaps) in very tight loops in generic code.
>>>
>>> Generic code loops performing frequent acmp operations
>>> on untyped (Object) operands will need to perform
>>> extra null checks and/or value/reference detections
>>> if they are not already present (by luck) in the loop
>>> context. In this case, loop unswitching may be useful.
>>>
>>> There is one more contextual operation which may be
>>> helpful in such cases, if all else fails (and loop unswitching
>>> is not desirable). Honestly, I haven't seen any real-world
>>> code yet that needs it, but it is comforting to have as a
>>> fall-back. The idea is this: If the acmp is contextually
>>> followed by a call to Object.equals, in the usual Legacy
>>> Idiom For Equality (L.I.F.E.), then there is one more trick
>>> we can play. We'd like to force the new_acmp down to
>>> an old_acmp in these cases. Can we? Here's the choice:
>>>
>>> bool dutiful_life(ptr x, ptr y) {
>>> if (new_acmp(x, y)) return true;
>>> return x->Object_equals(y);
>>> }
>>> bool clever_life(ptr x, ptr y) {
>>> if (old_acmp(x, y)) return true;
>>> return x->Object_equals(y);
>>> }
>>>
>>> Note that the only semantic difference between old_acmp
>>> and new_acmp arises when *both* operands are the *same*
>>> value. (Work out the cases: In every other case, the simple
>>> pointer comparison detects the difference and produces the
>>> correct "false" answer.) Thus, in the L.I.F.E., the only case
>>> where old_acmp gives the wrong answer produces an
>>> immediate answer of "true", instead of following up with
>>> a call to x.equals(x) on the Q-value.
>>>
>>> To put it in a nutshell, the only difference between dutiful_life
>>> and clever_life is that the former always calls Object::equals
>>> if either operand is a value, and the the latter calls
>>> Object::equals in a subset of those cases (when the
>>> physical pointers differ). When the physical pointers
>>> do *not* differ, there is no call to Object::equals. But
>>> clever_life always delivers the same answer as dutiful_life.
>>>
>>> So is there a problem? Not much, although it requires us to treat
>>> Object::equals as an intrinsic with some predictable semantics.
>>> The crucial inside that allows us to adopt clever_life is that,
>>> if you call x.equals(x) on a non-null x, the result must always
>>> be true. Now, this is not enforced by the JVM, but it is clearly
>>> documented in the API specification for Object::equals, and
>>> lots of stuff would be already broken if that were ever false.
>>>
>>> Thus, if we are willing to rule out implementations of Object::equals
>>> for which a value can ever compare *not equal* to itself, then
>>> we can substitute clever_life for dutiful_life. We also have to
>>> allow one more thing: That we can silently drop calls to equals
>>> (at least in the context of L.I.F.E.) when the JVM can prove that
>>> the contract of equals allows the JVM to predict the outcome.
>>> This means that if you put side effects (like print statements)
>>> into an equals method on a value type, you might not see the
>>> side effects, after the JVM has optimized things. This corner
>>> case may require further thought in order to straighten it out.
>>>
>>> Note that L.I.F.E. is frequently found in generic code, where
>>> instead of operating on Object values it operates on type
>>> parameter types. (The JVM just sees Object, after erasure.)
>>> In that case, if the JVM were to specialize the generic code
>>> separately for some types (such as value types), along the
>>> lines of the template class proposals, then the L.I.F.E.
>>> code would fold up in the JIT using the rules given at the
>>> top of this note. The generic expression "a == b", where
>>> a or b was a value of a specialized type parameter T,
>>> would fold up to false in every specialization where T was
>>> a Q-type. Likewise, as noted above, if L.I.F.E. is applied
>>> to a true R-type (or specialized to such a type), then any
>>> new_acmp on that type can be short-circuited to the faster
>>> old_acmp.
>>>
>>> All in all, it appears that the cost of the new acmp semantics
>>> can, for all practical purposes, be pushed down into the noise.
>>>
>>> — John
>>>
More information about the valhalla-dev
mailing list