[vector] binary operations with a scalar
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Mon Feb 19 15:18:10 UTC 2018
Paul,
> Here is an initial step towards this focusing on the add operation for IntVector:
>
> http://cr.openjdk.java.net/~psandoz/panama/bin-op-with-scalar/webrev/
>
> I don’t wanna proceed further until we agree on the direction/optimization strategy.
>
> If we want to elide explicit broadcasts (hidden in the implementation) we would need to push this down into the intrinsics. This would require 6 additional intrinsic methods on VectorIntrinsics.
I like proposed API methods. Ability to hide broadcasts behind the
scenes definitely improve the API.
Regarding implementation, from JVM/JIT perspective, IMO there's no
compelling reason to intrinsify those methods. As you noted, new methods
can be implemented by composing existing operations (broadcast +
add/sub/...) and JVM has to do the same if it intrinsifies them.
The only difference between composing vs intrinsifying is what happens
when intrinsification fails. I share your desire to provide optimized
implementation for non-intrinsified case, but don't think it justifies
intrinsification.
What do you think about the following alternative?
@Override
@ForceInline
public Int256Vector add(int o) {
Int256Vector v = Int256Species.broadcastImpl(o);
return (Int256Vector) VectorIntrinsics.binaryOp(
VECTOR_OP_ADD, Int256Vector.class, int.class, LENGTH,
this, v,
(v1, __) -> v1.bOp(o, (i, a, b) -> (int)(a + b)));
}
static final class Int256Species extends ... {
...
@ForceInline
static Int256Vector broadcastImpl(int e) {
return VectorIntrinsics.broadcastCoerced(
Int256Vector.class, int.class, LENGTH,
e,
((long bits) -> SPECIES.op(i -> (int)bits)));
}
@Override
@ForceInline
public Int256Vector broadcast(int e) {
return broadcastImpl(e);
}
...
}
All vector shape information is available, so both operations can be
reliably intrinsified: e.g., no need to dispatch through vector species
to broadcast the scalar.
If VI.binaryOp can't be intrinsified, default implementation is
specialized and doesn't use broadcasted vector, so C2 should be able to
get rid of broadcast operation most of the time. (The only case I see
when it's not possible is when there's a safepoint in between, which IMO
shouldn't be a problem in practice.)
Another peculiarity is that default implementation is now represented
with capturing lambda which should be allocated on every invocation. But
if proper inlining happens, C2 should eliminate the allocation as well.
Best regards,
Vladimir Ivanov
> The current approach may not be optimal performance-wise leveraging existing intrinsics and is not optimal from the pure Java perspective either.
>
> Thanks,
> Paul.
>
More information about the panama-dev
mailing list