Allocation caused by vector rebracketing twice

Mon Jan 21 19:29:16 UTC 2019

Hi Richard,

Thanks a lot for your feedback. I analyzed this further and it looks like the high allocation rate is happening because the Long addAll reduction intrinsic is not in place for AVX < 3 and so boxing is happening to call the Java implementation. I will send out a patch shortly which should fix this. 

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev [mailto:panama-dev-bounces at openjdk.java.net] On Behalf Of Richard Startin
Sent: Saturday, January 19, 2019 11:29 AM
To: panama-dev at openjdk.java.net
Subject: Allocation caused by vector rebracketing twice

It's great to see that there is now a shiftR method because this was the missing link to make vector bit counts possible. With great excitement, I tried this at 54348:a8516a4be714 but it doesn't work very well yet.

  @Benchmark
  public int vectorBitCount() {
    int bitCount = 0;
    var lookupPos = YMM_BYTE.fromArray(LOOKUP_POS, 0);
    var lookupNeg = YMM_BYTE.fromArray(LOOKUP_NEG, 0);
    var lowMask = YMM_BYTE.broadcast((byte)0x0F);
    for (int i = 0; i < data.length; i+= 4) {
      var bytes = (ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE);
      bitCount += (int)((LongVector)lookupPos.rearrange(bytes.and(lowMask).toShuffle())
              .add(lookupNeg.rearrange(bytes.shiftR(4).and(lowMask).toShuffle()))
              .rebracket(YMM_LONG)).addAll();
    }
    return bitCount;
  }

JMH -prof gc shows high allocation rates:

Iteration   1: 0.008 ops/us
                 ·gc.alloc.rate:               1112.997 MB/sec
                 ·gc.alloc.rate.norm:          219264.051 B/op
                 ·gc.churn.G1_Eden_Space:      1042.296 MB/sec
                 ·gc.churn.G1_Eden_Space.norm: 205335.753 B/op
                 ·gc.churn.G1_Old_Gen:         0.001 MB/sec
                 ·gc.churn.G1_Old_Gen.norm:    0.258 B/op
                 ·gc.count:                    6.000 counts
                 ·gc.time:                     6.000 ms

Stripping the code down until negligible allocation rates are observed, the smallest reproducer is where the vector is rebracketed and then the reverse rebracket is performed:

@Benchmark
  public int vectorBitCount() {
    int bitCount = 0;
    for (int i = 0; i < data.length; i+= 4) {
      bitCount += (int)((LongVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE).rebracket(YMM_LONG)).addAll();
    }
    return bitCount;
  }

-prof gc:

Iteration   1: 0.310 ops/us
                 ·gc.alloc.rate:               3223.539 MB/sec
                 ·gc.alloc.rate.norm:          16384.001 B/op
                 ·gc.churn.G1_Eden_Space:      3084.887 MB/sec
                 ·gc.churn.G1_Eden_Space.norm: 15679.287 B/op
                 ·gc.churn.G1_Old_Gen:         0.005 MB/sec
                 ·gc.churn.G1_Old_Gen.norm:    0.026 B/op
                 ·gc.count:                    8.000 counts
                 ·gc.time:                     9.000 ms

Without reversing the rebracket I see negligible allocation

  @Benchmark
  public int vectorBitCount() {
    int bitCount = 0;
    for (int i = 0; i < data.length; i+= 4) {
      bitCount += (int)((ByteVector)YMM_LONG.fromArray(data, i).rebracket(YMM_BYTE)).addAll();
    }
    return bitCount;
  }

Iteration   1: 1.456 ops/us
                 ·gc.alloc.rate:      ≈ 10⁻⁴ MB/sec
                 ·gc.alloc.rate.norm: ≈ 10⁻⁴ B/op
                 ·gc.count:           ≈ 0 counts

Thanks,
Richard