[vector] reducing the cast implementation

Tue Jun 5 20:15:31 UTC 2018

Looks good - thanks for trying it out! I believe the test in loop is range check, my guess is that if you run your tests with “-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0” flag, you should see just vector load, convert, and store without the extra tests. Anyway, thanks again!

--Razvan

From: Paul Sandoz [mailto:paul.sandoz at oracle.com]
Sent: Tuesday, June 05, 2018 11:52 AM
To: Lupusoru, Razvan A <razvan.a.lupusoru at intel.com>
Cc: panama-dev at openjdk.java.net
Subject: Re: [vector] reducing the cast implementation

Hi,

I wrote a simple benchmark (see below) and analyzed the generated code. I limited the testing mostly to shapes that are hardware supported on my laptop.

For these simple tests I can reduce the cast down to the following with no change in generated code:

@Override
@ForceInline
@SuppressWarnings("unchecked")
public <F, T extends Shape> Float128Vector cast(Vector<F, T> o) {
    if (o.length() != LENGTH)
        throw new IllegalArgumentException("Vector length this species length differ");

    return VectorIntrinsics.cast(
        (Class<Vector<F, T>>) o.getClass(), o.elementType(), LENGTH,
        float.class, LENGTH, o,
        (v, t) -> (Float128Vector) super.cast(v)
    );
}
An example of generated code for int to float conversion (with loop unrolling switched off) is:

 0.21%  ↗    0x0000000107510ab1: mov    %edx,%r11d
22.26%  │ ↗  0x0000000107510ab4: vmovdqu 0x10(%r8,%r11,4),%xmm0
 8.91%  │ │  0x0000000107510abb: vcvtdq2ps %xmm0,%xmm0
33.35%  │ │  0x0000000107510abf: cmp    %r9d,%r11d
        │ │  0x0000000107510ac2: jae    0x0000000107510b7c
 0.16%  │ │  0x0000000107510ac8: vmovdqu %xmm0,0x10(%rcx,%r11,4)
21.42%  │ │  0x0000000107510acf: add    $0x4,%edx
 9.81%  │ │  0x0000000107510ad2: cmp    %ebx,%edx
        ╰ │  0x0000000107510ad4: jl     0x0000000107510ab1

It appears we don’t need the explicit if/else for the element type if o is type profiled.

I can clean this up further by adjusting the VectorIntrinsics signature and the generics. Further, i think we can remove the capturing lambda by placing the super cast implementation in a static method.

Paul.

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(2)
public class CastTest {

    static final IntVector.IntSpecies<Shapes.S128Bit> INT_SPECIES =
            IntVector.species(Shapes.S_128_BIT);

    static final FloatVector.FloatSpecies<Shapes.S128Bit> FLOAT_SPECIES =
            FloatVector.species(Shapes.S_128_BIT);

    static final ShortVector.ShortSpecies<Shapes.S64Bit> SHORT_SPECIES =
            ShortVector.species(Shapes.S_64_BIT);

    static final LongVector.LongSpecies<Shapes.S256Bit> LONG_SPECIES =
            LongVector.species(Shapes.S_256_BIT);

    @Param({"1024"})
    private int size;

    private int[] a;
    private int[] ri;
    private float[] rf;
    private short[] rs;
    private long[] rl;

    @Setup
    public void setUp() {
        a = new int[size];
        ri = new int[size];
        rf = new float[size];
        rs = new short[size];
        rl = new long[size];
        for (int i = 0; i < size; i++) {
            a[i] = 1;
        }
    }

    @Benchmark
    public int[] castIntInt() {
        for (int i = 0; i < a.length; i += INT_SPECIES.length()) {
            IntVector<Shapes.S128Bit> av = INT_SPECIES.fromArray(a, i);
            INT_SPECIES.cast(av).intoArray(ri, i);
        }
        return ri;
    }

    @Benchmark
    public float[] castIntFloat() {
        for (int i = 0; i < a.length; i += INT_SPECIES.length()) {
            IntVector<Shapes.S128Bit> av = INT_SPECIES.fromArray(a, i);
            FLOAT_SPECIES.cast(av).intoArray(rf, i);
        }
        return rf;
    }

    @Benchmark
    public short[] castIntShort() {
        for (int i = 0; i < a.length; i += INT_SPECIES.length()) {
            IntVector<Shapes.S128Bit> av = INT_SPECIES.fromArray(a, i);
            SHORT_SPECIES.cast(av).intoArray(rs, i);
        }
        return rs;
    }

    @Benchmark
    public long[] castIntLong() {
        for (int i = 0; i < a.length; i += INT_SPECIES.length()) {
            IntVector<Shapes.S128Bit> av = INT_SPECIES.fromArray(a, i);
            LONG_SPECIES.cast(av).intoArray(rl, i);
        }
        return rl;
    }
}

On Jun 4, 2018, at 4:14 PM, Paul Sandoz <Paul.Sandoz at oracle.com<mailto:Paul.Sandoz at oracle.com>> wrote:

On Jun 4, 2018, at 3:59 PM, Lupusoru, Razvan A <razvan.a.lupusoru at intel.com<mailto:razvan.a.lupusoru at intel.com>> wrote:

Hey Paul,

I am not sure just from looking at it, but I believe it should work. Hotspot already inlines o.bitSize() and this is based on type profile. Thus technically the cast is not needed since it should know by that point what type “o” is. The only part I am unsure about is whether the call to o.getClass() gets inlined so that Hotspot intrinsification resolves the class to a “constant oop”. Would you be able to do a simple cast micro and see if generated code looks still good? If yes, then you can go ahead with your change.

Ok, i can write micro benchmark to check.

—

Separately should we simplify the cast intrinsic itself? that would likely require a split of shared code for _VectorReinterpret and
_VectorCast, which may be a good thing in terms of clarity.

Thanks,
Paul.