RFR: 8310843: Reimplement ByteArray and ByteArrayLittleEndian with Unsafe [v10]

Fri Jul 21 09:46:43 UTC 2023

On Thu, 20 Jul 2023 21:43:49 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>>> By the way, I ran `LoopOverNonConstantHeap` on the 3700x platform, and the performance of ByteBuffer was also poor:
>> 
>> I finally see it.
>> 
>> 
>> Benchmark                           (polluteProfile)  Mode  Cnt  Score   Error  Units
>> LoopOverNonConstantHeap.BB_get                 false  avgt   30  1.801 ± 0.020  ns/op
>> LoopOverNonConstantHeap.unsafe_get             false  avgt   30  0.567 ± 0.007  
>> 
>> 
>> It seems that, between updating JMH and rebuilding the JDK from scratch, *something* did the trick.
>> 
>> While I knew that random access on a BB is slower than Unsafe (as there's an extra check), whereas looped access is as fast (as C2 is good at hoisting the checks outside the loop, as shown in the benchmark). Note also that we are in the nanosecond realm, so each instruction here counts.
>> 
>> Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.
>
>> Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.
> 
> I've tried to run the benchmark in test/micro/java/io/DataInputStream.java. This is the baseline:
> 
> 
> Benchmark                     Mode  Cnt  Score   Error  Units
> DataInputStreamTest.readChar  avgt   20  7.583 ± 0.026  us/op
> DataInputStreamTest.readInt   avgt   20  3.804 ± 0.045  us/op
> 
> 
> And this is with a patch similar to the one I shared above, to use ByteBuffer internally:
> 
> 
> Benchmark                     Mode  Cnt  Score   Error  Units
> DataInputStreamTest.readChar  avgt   20  7.594 ± 0.106  us/op
> DataInputStreamTest.readInt   avgt   20  3.795 ± 0.030  us/op
> 
> 
> There does not seem to be any extra overhead. That said, access occurs in a counted loop, and in these cases we know buffer/segment access is optimized quite well.
> 
> I believe the question here is: do we have benchmark which are representative of the kind of gain that would be introduced by micro-optimizing ByteArray? It can be quite tricky to estimate real benefits from synthetic benchmark on the ByteArray class, especially when fetching a single element outside of a loop - as those are not representative of how the clients will use this. I note that the original benchmark made by Per used a loop with two iterations to assess the cost of the ByteArray operations:
> 
> http://minborgsjavapot.blogspot.com/2023/01/java-21-performance-improvements.html
> 
> If I change the benchmark to do 2 iterations, I see this:
> 
> 
> Benchmark                      Mode  Cnt       Score       Error   Units
> ByteArray.readByte            thrpt    5  704199.172 ± 34101.508  ops/ms
> ByteArray.readByteFromBuffer  thrpt    5  474321.828 ±  6588.471  ops/ms
> ByteArray.readInt             thrpt    5  662411.181 ±  4470.951  ops/ms
> ByteArray.readIntFromBuffer   thrpt    5  496900.429 ±  3705.737  ops/ms
> ByteArray.readLong            thrpt    5  665138.063 ±  5944.814  ops/ms
> ByteArray.readLongFromBuffer  thrpt    5  517781.548 ± 27106.331  ops/ms
> 
> The more the iterations, the less the cost (and you don't need many iterations to break even). This probably explains why the DataInputStream benchmark doesn't change - there's 1024 iterations in there.
> 
> I guess all this is to say that excessively focussing on microbenchmark of a simple class such as ByteArray in conditions that are likely unrealistic (e.g. single access) is IMHO the wrong way to look at thing...

Just some feedback about this discussion:
- I agree that the DataInput/OutputStreams should maybe use `ByteBuffer` directly as they use buffering already. So the patch above looks fine. In my project Apache Lucene (which has many performance critical methods like this), we have already implemented ByteBuffer based access like this for all IO-stream-based classes (we call then `DataInput/DataOutput`). I don't know why you have seen differences in using a `ByteBuffer` as final field in the class. That's common and used in most frameworks out there (NETTY,...) and is bullet proof (unless theres a bug in optimizer which sometimes happened in the past).
- We noticed that wrapping a byte array on each access by ByteBuffer causes a lot of overhead and GC activity if used in hot loops. In addition we have seen cases where it is not optimized anymore (not sure why). @mcimadamore: You remember the similar discussions about the `MemorySegment` slices and copying them around between heap/foreign? Maybe inside the JDK you can do better by using `@ForceInline`. Our code can't do this, so we try to avoid creating instances of classes in such low-level code.
- The original VarHandle approach is now used in Lucene's code at all places (basically the idea to use VarHandles for this class was suggested by me a while back). We often have byte arrays and can't wrap them as ByteBuffer on each call (because its not always inlined). For code outside of the JDK this looks like the best approach to have fast access to short/int/long values at specific (not necessarily aligned) positions. We have seen LZ4 compression getting much faster after changing the code from manually constructing logs/floats from bytes like in the reference code. With ByteBuffer it was often getting slower (depending on how it was called, I think because we can't do `@ForceInline` in code outside the JDK.

Generally: A class like this is very nice and also very much needed in code outside the JDK. A lot of code like encoding/decoding network bytes or compression algorithms often has the pattern that they want to read primitive types from byte arrays from. The overhead with wrapping looks bad in code and also causes long startup times and sometimes also OOM (if used multithreaded from different threads hammering the byte array accessors). Also you don not want to write a LZ4 decompressor using ByteBuffer as its only source of data... :-(

So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?

Maybe `java.util.ByteArrays` with solely static methods. The internal implementation of such a useful basic utility class could definitly be using `Unsafe` internally, so I would leave out the discussion here. If you use Unsafe there are no surprises! Personally I have no problem with the current implementation in this PR! I would just put little/big endian impl in the same class and move it to java.util (this is just my comment about this, coming from a library which does this low level stuff all the time).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14636#discussion_r1270479763