Allocation caused by vector rebracketing twice
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Mon Jan 21 19:27:10 UTC 2019
I think what you are seeing is a consequence of missing support for
LongVector.addAll() on pre-AVX512 CPUs [1]: in the latter case (1
rebracket) ByteVector.addAll() is called.
You can verify that by looking at -XX:+PrintInlining output:
3106 166 Benchmark::vectorBitCount (54 bytes)
...
@ 40 jdk.incubator.vector.Long256Vector::addAll (19 bytes)
force inline by annotation
@ 10 java.lang.invoke.Invokers$Holder::linkToTargetMethod (8
bytes) force inline by annotation
@ 4 java.lang.invoke.LambdaForm$MH/0x00000008011cb440::invoke
(8 bytes) force inline by annotation
@ 15 jdk.incubator.vector.VectorIntrinsics::reductionCoerced
(16 bytes) failed to inline (intrinsic)
Best regards,
Vladimir Ivanov
[1]
http://hg.openjdk.java.net/panama/dev/file/a8516a4be714/src/hotspot/cpu/x86/x86.ad#l1421
On 19/01/2019 11:28, Richard Startin wrote:
> It's great to see that there is now a shiftR method because this was the missing link to make vector bit counts possible. With great excitement, I tried this at 54348:a8516a4be714 but it doesn't work very well yet.
>
>
> @Benchmark
> public int vectorBitCount() {
> int bitCount = 0;
> var lookupPos = YMM_BYTE.fromArray(LOOKUP_POS, 0);
> var lookupNeg = YMM_BYTE.fromArray(LOOKUP_NEG, 0);
> var lowMask = YMM_BYTE.broadcast((byte)0x0F);
> for (int i = 0; i < data.length; i+= 4) {
> var bytes = (ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE);
> bitCount += (int)((LongVector)lookupPos.rearrange(bytes.and(lowMask).toShuffle())
> .add(lookupNeg.rearrange(bytes.shiftR(4).and(lowMask).toShuffle()))
> .rebracket(YMM_LONG)).addAll();
> }
> return bitCount;
> }
>
>
> JMH -prof gc shows high allocation rates:
>
>
> Iteration 1: 0.008 ops/us
> ·gc.alloc.rate: 1112.997 MB/sec
> ·gc.alloc.rate.norm: 219264.051 B/op
> ·gc.churn.G1_Eden_Space: 1042.296 MB/sec
> ·gc.churn.G1_Eden_Space.norm: 205335.753 B/op
> ·gc.churn.G1_Old_Gen: 0.001 MB/sec
> ·gc.churn.G1_Old_Gen.norm: 0.258 B/op
> ·gc.count: 6.000 counts
> ·gc.time: 6.000 ms
>
> Stripping the code down until negligible allocation rates are observed, the smallest reproducer is where the vector is rebracketed and then the reverse rebracket is performed:
>
> @Benchmark
> public int vectorBitCount() {
> int bitCount = 0;
> for (int i = 0; i < data.length; i+= 4) {
> bitCount += (int)((LongVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE).rebracket(YMM_LONG)).addAll();
> }
> return bitCount;
> }
>
> -prof gc:
>
> Iteration 1: 0.310 ops/us
> ·gc.alloc.rate: 3223.539 MB/sec
> ·gc.alloc.rate.norm: 16384.001 B/op
> ·gc.churn.G1_Eden_Space: 3084.887 MB/sec
> ·gc.churn.G1_Eden_Space.norm: 15679.287 B/op
> ·gc.churn.G1_Old_Gen: 0.005 MB/sec
> ·gc.churn.G1_Old_Gen.norm: 0.026 B/op
> ·gc.count: 8.000 counts
> ·gc.time: 9.000 ms
>
> Without reversing the rebracket I see negligible allocation
>
> @Benchmark
> public int vectorBitCount() {
> int bitCount = 0;
> for (int i = 0; i < data.length; i+= 4) {
> bitCount += (int)((ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE)).addAll();
> }
> return bitCount;
> }
>
> Iteration 1: 1.456 ops/us
> ·gc.alloc.rate: ≈ 10⁻⁴ MB/sec
> ·gc.alloc.rate.norm: ≈ 10⁻⁴ B/op
> ·gc.count: ≈ 0 counts
>
> Thanks,
> Richard
>
>
>
More information about the panama-dev
mailing list