Vector API using MemorySegment <was> Re: Separation between MemorySegment and MemoryScope
Paul Sandoz
paul.sandoz at oracle.com
Tue Mar 30 18:38:35 UTC 2021
I looked a little more closely at loading from ByteBuffer and byte[] compared to say int[].
While there might be some bounds checks issues for ByteBuffer I think the real problem is the addressing calculations are not optimized. Often this is all interrelated. Loading a Vector from a ByteBuffer uses Unsafe to access the base and address [1].
Loading from byte[] is much improved, however unrolling does not occur. Loading from int[] results in unrolling.
If I use a VarHandle to load int values from ByteBuffer all is good from the perspective of bounds checks and addressing.
The odd thing is Vector load from BB does a similar thing as VarHandle access in the use of Unsafe to access the buffer’s state. It may be the case for Vector that adjustments, in the loop bound and step using the species size, are confusing C2. Or perhaps there is something more specific about the vector access intrinsics that need to be adjusted. Needs more investigation.
Benchmark here [2].
Paul.
[1]
abstract
IntVector fromByteBuffer0(ByteBuffer bb, int offset);
@ForceInline
final
IntVector fromByteBuffer0Template(ByteBuffer bb, int offset) {
IntSpecies vsp = vspecies();
return VectorSupport.load(
vsp.vectorType(), vsp.elementType(), vsp.laneCount(),
bufferBase(bb), bufferAddress(bb, offset),
bb, offset, vsp,
(buf, off, s) -> {
ByteBuffer wb = wrapper(buf, NATIVE_ENDIAN);
return s.ldOp(wb, off,
(wb_, o, i) -> wb_.getInt(o + i * 4));
});
}
[2]
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(value = 1, jvmArgsPrepend = {"--add-modules=jdk.incubator.vector,jdk.incubator.foreign", "-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2"})
public class TestBufferLoadStore {
static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_PREFERRED;
@Param("4096")
int size;
int[] ia;
byte[] ba;
ByteBuffer bb;
MemorySegment ms;
@Setup
public void init() {
size += size % SPECIES.length();
SplittableRandom sr = new SplittableRandom();
ia = new int[size];
ba = new byte[size * 4];
bb = ByteBuffer.allocateDirect(size * 4);
ms = MemorySegment.allocateNative(size * 4);
for (int i = 0; i < bb.capacity(); i += 4) {
bb.putInt(i, sr.nextInt());
}
}
@Benchmark
public int buffer() {
IntVector sum = IntVector.zero(SPECIES);
int bound = bb.limit() - (SPECIES.length() << 2);
for (int i = 0; i <= bound; i += SPECIES.length() << 2) {
IntVector va = IntVector.fromByteBuffer(SPECIES, bb, i, ByteOrder.nativeOrder());
sum = sum.add(va);
}
return sum.reduceLanes(VectorOperators.ADD);
}
@Benchmark
public int bytes() {
IntVector sum = IntVector.zero(SPECIES);
int bound = ba.length - (SPECIES.length() << 2);
for (int i = 0; i <= bound; i += SPECIES.length() << 2) {
IntVector va = IntVector.fromByteArray(SPECIES, ba, i, ByteOrder.nativeOrder());
sum = sum.add(va);
}
return sum.reduceLanes(VectorOperators.ADD);
}
@Benchmark
public int ints() {
IntVector sum = IntVector.zero(SPECIES);
for (int i = 0; i < SPECIES.loopBound(ia.length); i += SPECIES.length()) {
IntVector va = IntVector.fromArray(SPECIES, ia, i);
sum = sum.add(va);
}
return sum.reduceLanes(VectorOperators.ADD);
}
static final VarHandle H = MethodHandles.byteBufferViewVarHandle(int[].class, ByteOrder.nativeOrder());
@Benchmark
public int varHandleBuffer() {
int bound = bb.limit();
int sum = 0;
for (int i = 0; i < bound; i += 4) {
int v = (int) H.get(bb, i);
sum += v;
}
return sum;
}
@Benchmark
public int memorySegment() {
int bound = (int) ms.byteSize();
int sum = 0;
for (int i = 0; i < bound; i += 4) {
int v = MemoryAccess.getIntAtOffset(ms, i);
sum += v;
}
return sum;
}
}
> On Mar 29, 2021, at 3:38 PM, forax at univ-mlv.fr wrote:
>
> ----- Mail original -----
>> De: "Paul Sandoz" <paul.sandoz at oracle.com>
>> À: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>
>> Cc: "Remi Forax" <forax at univ-mlv.fr>, "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>> Envoyé: Lundi 29 Mars 2021 21:37:33
>> Objet: Vector API using MemorySegment <was> Re: Separation between MemorySegment and MemoryScope
>
>> Yeah, sounds messy, would prefer to enhance the Vector API to use MemorySegment
>> when Panama exits incubation (and/or preview).
>>
>> Regarding bounds checks: Remi, you may have noticed we have a system property to
>> disable them [1] for vector load/stores from arrays and buffers, but this is
>> really just for performance investigation.
>
> yes, i've noticed that but I don't care about most of the bound checks but the one that are in tight loops closed to vectorized operations, because those bounds checks appears on the profiler and are causing a 2x to 3x slowdown when the vectorized loop is very simple (moving/copying primitive values around).
>
>> I consider it a bug if HotSpot could
>> but does not strength reduce ‘em and remove ‘em from a loop body (although C2
>> does have some limitations that are hard to overcome). I would be reluctant to
>> add a supported feature to MemoryScope to remove bounds checks on access.
>
> So you can log a bug because bounds checks are removed by c2 when using the *Vector.fromArray but not using *Vector.fromByteBuffer,
> c2 doesn't care to try to prove that an index is always between [0 .. buffer.capacity) (see [1])
>
> Also having the ByteBuffer to be a constant doesn't lead to a better code too,
> the generated assembly code still use a register to access to the address of the underlying buffer instead of using directly to the address which is a constant.
>
> I can see both those optimizations on primitive arrays but not on ByteBuffer, which is kind a shame because the only way to currently access a MemorySegment from the Vector API is through a ByteBuffer.
>
>>
>> Paul.
>
> Rémi
>
> [1] https://urldefense.com/v3/__https://github.com/forax/tomahawk/blob/master/src/main/java/com/github/forax/tomahawk/perf/VecOpPerfTest.java*L105__;Iw!!GqivPVa7Brio!M-uozwmQN_tO7qdK4mpQRP0gfYrDqRlsqnXwI99r0aAuCYUQwXZV-n2YngMn0EoXmQ$
>
>>
>> static final int VECTOR_ACCESS_OOB_CHECK =
>> Integer.getInteger("jdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK", 2);
>> …
>>
>> @ForceInline
>> static int checkFromIndexSize(int ix, int vlen, int length) {
>> switch (VectorIntrinsics.VECTOR_ACCESS_OOB_CHECK) {
>> case 0: return ix; // no range check
>> case 1: return Objects.checkFromIndexSize(ix, vlen, length);
>> case 2: return Objects.checkIndex(ix, length - (vlen - 1));
>> default: throw new InternalError();
>> }
>> }
More information about the panama-dev
mailing list