Allocation caused by vector rebracketing twice

Mon Jan 21 19:27:10 UTC 2019

I think what you are seeing is a consequence of missing support for 
LongVector.addAll() on pre-AVX512 CPUs [1]: in the latter case (1 
rebracket) ByteVector.addAll() is called.

You can verify that by looking at -XX:+PrintInlining output:
    3106  166      Benchmark::vectorBitCount (54 bytes)
...
     @ 40   jdk.incubator.vector.Long256Vector::addAll (19 bytes) 
force inline by annotation
       @ 10   java.lang.invoke.Invokers$Holder::linkToTargetMethod (8 
bytes)   force inline by annotation
         @ 4   java.lang.invoke.LambdaForm$MH/0x00000008011cb440::invoke 
(8 bytes)   force inline by annotation
       @ 15   jdk.incubator.vector.VectorIntrinsics::reductionCoerced 
(16 bytes)   failed to inline (intrinsic)

Best regards,
Vladimir Ivanov

[1] 
http://hg.openjdk.java.net/panama/dev/file/a8516a4be714/src/hotspot/cpu/x86/x86.ad#l1421

On 19/01/2019 11:28, Richard Startin wrote:
> It's great to see that there is now a shiftR method because this was the missing link to make vector bit counts possible. With great excitement, I tried this at 54348:a8516a4be714 but it doesn't work very well yet.
> 
> 
>    @Benchmark
>    public int vectorBitCount() {
>      int bitCount = 0;
>      var lookupPos = YMM_BYTE.fromArray(LOOKUP_POS, 0);
>      var lookupNeg = YMM_BYTE.fromArray(LOOKUP_NEG, 0);
>      var lowMask = YMM_BYTE.broadcast((byte)0x0F);
>      for (int i = 0; i < data.length; i+= 4) {
>        var bytes = (ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE);
>        bitCount += (int)((LongVector)lookupPos.rearrange(bytes.and(lowMask).toShuffle())
>                .add(lookupNeg.rearrange(bytes.shiftR(4).and(lowMask).toShuffle()))
>                .rebracket(YMM_LONG)).addAll();
>      }
>      return bitCount;
>    }
> 
> 
> JMH -prof gc shows high allocation rates:
> 
> 
> Iteration   1: 0.008 ops/us
>                   ·gc.alloc.rate:               1112.997 MB/sec
>                   ·gc.alloc.rate.norm:          219264.051 B/op
>                   ·gc.churn.G1_Eden_Space:      1042.296 MB/sec
>                   ·gc.churn.G1_Eden_Space.norm: 205335.753 B/op
>                   ·gc.churn.G1_Old_Gen:         0.001 MB/sec
>                   ·gc.churn.G1_Old_Gen.norm:    0.258 B/op
>                   ·gc.count:                    6.000 counts
>                   ·gc.time:                     6.000 ms
> 
> Stripping the code down until negligible allocation rates are observed, the smallest reproducer is where the vector is rebracketed and then the reverse rebracket is performed:
> 
> @Benchmark
>    public int vectorBitCount() {
>      int bitCount = 0;
>      for (int i = 0; i < data.length; i+= 4) {
>        bitCount += (int)((LongVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE).rebracket(YMM_LONG)).addAll();
>      }
>      return bitCount;
>    }
> 
> -prof gc:
> 
> Iteration   1: 0.310 ops/us
>                   ·gc.alloc.rate:               3223.539 MB/sec
>                   ·gc.alloc.rate.norm:          16384.001 B/op
>                   ·gc.churn.G1_Eden_Space:      3084.887 MB/sec
>                   ·gc.churn.G1_Eden_Space.norm: 15679.287 B/op
>                   ·gc.churn.G1_Old_Gen:         0.005 MB/sec
>                   ·gc.churn.G1_Old_Gen.norm:    0.026 B/op
>                   ·gc.count:                    8.000 counts
>                   ·gc.time:                     9.000 ms
> 
> Without reversing the rebracket I see negligible allocation
> 
>    @Benchmark
>    public int vectorBitCount() {
>      int bitCount = 0;
>      for (int i = 0; i < data.length; i+= 4) {
>        bitCount += (int)((ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE)).addAll();
>      }
>      return bitCount;
>    }
> 
> Iteration   1: 1.456 ops/us
>                   ·gc.alloc.rate:      ≈ 10⁻⁴ MB/sec
>                   ·gc.alloc.rate.norm: ≈ 10⁻⁴ B/op
>                   ·gc.count:           ≈ 0 counts
> 
> Thanks,
> Richard
> 
> 
>