Allocation caused by vector rebracketing twice

Richard Startin richard at openkappa.co.uk
Sat Jan 19 19:28:40 UTC 2019


It's great to see that there is now a shiftR method because this was the missing link to make vector bit counts possible. With great excitement, I tried this at 54348:a8516a4be714 but it doesn't work very well yet.


  @Benchmark
  public int vectorBitCount() {
    int bitCount = 0;
    var lookupPos = YMM_BYTE.fromArray(LOOKUP_POS, 0);
    var lookupNeg = YMM_BYTE.fromArray(LOOKUP_NEG, 0);
    var lowMask = YMM_BYTE.broadcast((byte)0x0F);
    for (int i = 0; i < data.length; i+= 4) {
      var bytes = (ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE);
      bitCount += (int)((LongVector)lookupPos.rearrange(bytes.and(lowMask).toShuffle())
              .add(lookupNeg.rearrange(bytes.shiftR(4).and(lowMask).toShuffle()))
              .rebracket(YMM_LONG)).addAll();
    }
    return bitCount;
  }


JMH -prof gc shows high allocation rates:


Iteration   1: 0.008 ops/us
                 ·gc.alloc.rate:               1112.997 MB/sec
                 ·gc.alloc.rate.norm:          219264.051 B/op
                 ·gc.churn.G1_Eden_Space:      1042.296 MB/sec
                 ·gc.churn.G1_Eden_Space.norm: 205335.753 B/op
                 ·gc.churn.G1_Old_Gen:         0.001 MB/sec
                 ·gc.churn.G1_Old_Gen.norm:    0.258 B/op
                 ·gc.count:                    6.000 counts
                 ·gc.time:                     6.000 ms

Stripping the code down until negligible allocation rates are observed, the smallest reproducer is where the vector is rebracketed and then the reverse rebracket is performed:

@Benchmark
  public int vectorBitCount() {
    int bitCount = 0;
    for (int i = 0; i < data.length; i+= 4) {
      bitCount += (int)((LongVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE).rebracket(YMM_LONG)).addAll();
    }
    return bitCount;
  }

-prof gc:

Iteration   1: 0.310 ops/us
                 ·gc.alloc.rate:               3223.539 MB/sec
                 ·gc.alloc.rate.norm:          16384.001 B/op
                 ·gc.churn.G1_Eden_Space:      3084.887 MB/sec
                 ·gc.churn.G1_Eden_Space.norm: 15679.287 B/op
                 ·gc.churn.G1_Old_Gen:         0.005 MB/sec
                 ·gc.churn.G1_Old_Gen.norm:    0.026 B/op
                 ·gc.count:                    8.000 counts
                 ·gc.time:                     9.000 ms

Without reversing the rebracket I see negligible allocation

  @Benchmark
  public int vectorBitCount() {
    int bitCount = 0;
    for (int i = 0; i < data.length; i+= 4) {
      bitCount += (int)((ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE)).addAll();
    }
    return bitCount;
  }

Iteration   1: 1.456 ops/us
                 ·gc.alloc.rate:      ≈ 10⁻⁴ MB/sec
                 ·gc.alloc.rate.norm: ≈ 10⁻⁴ B/op
                 ·gc.count:           ≈ 0 counts

Thanks,
Richard





More information about the panama-dev mailing list