RFR: 8310843: Reimplement ByteArray and ByteArrayLittleEndian with Unsafe [v10]
Uwe Schindler
uschindler at openjdk.org
Fri Jul 21 09:46:43 UTC 2023
On Thu, 20 Jul 2023 21:43:49 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:
>>> By the way, I ran `LoopOverNonConstantHeap` on the 3700x platform, and the performance of ByteBuffer was also poor:
>>
>> I finally see it.
>>
>>
>> Benchmark (polluteProfile) Mode Cnt Score Error Units
>> LoopOverNonConstantHeap.BB_get false avgt 30 1.801 ± 0.020 ns/op
>> LoopOverNonConstantHeap.unsafe_get false avgt 30 0.567 ± 0.007
>>
>>
>> It seems that, between updating JMH and rebuilding the JDK from scratch, *something* did the trick.
>>
>> While I knew that random access on a BB is slower than Unsafe (as there's an extra check), whereas looped access is as fast (as C2 is good at hoisting the checks outside the loop, as shown in the benchmark). Note also that we are in the nanosecond realm, so each instruction here counts.
>>
>> Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.
>
>> Is there any benchmark for DataInput/Output stream that can be used? I mean, it would be interesting to understand how these numbers translate when running the stuff that is built on top.
>
> I've tried to run the benchmark in test/micro/java/io/DataInputStream.java. This is the baseline:
>
>
> Benchmark Mode Cnt Score Error Units
> DataInputStreamTest.readChar avgt 20 7.583 ± 0.026 us/op
> DataInputStreamTest.readInt avgt 20 3.804 ± 0.045 us/op
>
>
> And this is with a patch similar to the one I shared above, to use ByteBuffer internally:
>
>
> Benchmark Mode Cnt Score Error Units
> DataInputStreamTest.readChar avgt 20 7.594 ± 0.106 us/op
> DataInputStreamTest.readInt avgt 20 3.795 ± 0.030 us/op
>
>
> There does not seem to be any extra overhead. That said, access occurs in a counted loop, and in these cases we know buffer/segment access is optimized quite well.
>
> I believe the question here is: do we have benchmark which are representative of the kind of gain that would be introduced by micro-optimizing ByteArray? It can be quite tricky to estimate real benefits from synthetic benchmark on the ByteArray class, especially when fetching a single element outside of a loop - as those are not representative of how the clients will use this. I note that the original benchmark made by Per used a loop with two iterations to assess the cost of the ByteArray operations:
>
> http://minborgsjavapot.blogspot.com/2023/01/java-21-performance-improvements.html
>
> If I change the benchmark to do 2 iterations, I see this:
>
>
> Benchmark Mode Cnt Score Error Units
> ByteArray.readByte thrpt 5 704199.172 ± 34101.508 ops/ms
> ByteArray.readByteFromBuffer thrpt 5 474321.828 ± 6588.471 ops/ms
> ByteArray.readInt thrpt 5 662411.181 ± 4470.951 ops/ms
> ByteArray.readIntFromBuffer thrpt 5 496900.429 ± 3705.737 ops/ms
> ByteArray.readLong thrpt 5 665138.063 ± 5944.814 ops/ms
> ByteArray.readLongFromBuffer thrpt 5 517781.548 ± 27106.331 ops/ms
>
> The more the iterations, the less the cost (and you don't need many iterations to break even). This probably explains why the DataInputStream benchmark doesn't change - there's 1024 iterations in there.
>
> I guess all this is to say that excessively focussing on microbenchmark of a simple class such as ByteArray in conditions that are likely unrealistic (e.g. single access) is IMHO the wrong way to look at thing...
Just some feedback about this discussion:
- I agree that the DataInput/OutputStreams should maybe use `ByteBuffer` directly as they use buffering already. So the patch above looks fine. In my project Apache Lucene (which has many performance critical methods like this), we have already implemented ByteBuffer based access like this for all IO-stream-based classes (we call then `DataInput/DataOutput`). I don't know why you have seen differences in using a `ByteBuffer` as final field in the class. That's common and used in most frameworks out there (NETTY,...) and is bullet proof (unless theres a bug in optimizer which sometimes happened in the past).
- We noticed that wrapping a byte array on each access by ByteBuffer causes a lot of overhead and GC activity if used in hot loops. In addition we have seen cases where it is not optimized anymore (not sure why). @mcimadamore: You remember the similar discussions about the `MemorySegment` slices and copying them around between heap/foreign? Maybe inside the JDK you can do better by using `@ForceInline`. Our code can't do this, so we try to avoid creating instances of classes in such low-level code.
- The original VarHandle approach is now used in Lucene's code at all places (basically the idea to use VarHandles for this class was suggested by me a while back). We often have byte arrays and can't wrap them as ByteBuffer on each call (because its not always inlined). For code outside of the JDK this looks like the best approach to have fast access to short/int/long values at specific (not necessarily aligned) positions. We have seen LZ4 compression getting much faster after changing the code from manually constructing logs/floats from bytes like in the reference code. With ByteBuffer it was often getting slower (depending on how it was called, I think because we can't do `@ForceInline` in code outside the JDK.
Generally: A class like this is very nice and also very much needed in code outside the JDK. A lot of code like encoding/decoding network bytes or compression algorithms often has the pattern that they want to read primitive types from byte arrays from. The overhead with wrapping looks bad in code and also causes long startup times and sometimes also OOM (if used multithreaded from different threads hammering the byte array accessors). Also you don not want to write a LZ4 decompressor using ByteBuffer as its only source of data... :-(
So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?
Maybe `java.util.ByteArrays` with solely static methods. The internal implementation of such a useful basic utility class could definitly be using `Unsafe` internally, so I would leave out the discussion here. If you use Unsafe there are no surprises! Personally I have no problem with the current implementation in this PR! I would just put little/big endian impl in the same class and move it to java.util (this is just my comment about this, coming from a library which does this low level stuff all the time).
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14636#discussion_r1270479763
More information about the core-libs-dev
mailing list