RFR: 8310843: Reimplement ByteArray and ByteArrayLittleEndian with Unsafe [v10]

Fri Jul 21 15:16:46 UTC 2023

On Fri, 21 Jul 2023 09:43:43 GMT, Uwe Schindler <uschindler at openjdk.org> wrote:

> 
> So have you thought of making this low-level classes public so we outside users no longer need to deal with VarHandles?
> 
I believe this is beyond the scope of this PR.

As for what do we do in the JDK, I can see few options:

1. We keep things as they are in current mainline.
2. We keep changes in this PR.
3. We rewrite most uses of ByteArray in java.io to use BB and remove ByteArray
4. We remove ByteArray and provide some static helper function to generate an unsafe offset from an array

I agree with @uschindler that wrapping stuff in ByteBuffer "on the fly" might be problematic for code that is not inlined, so I don't think we should do that.

I have to admit that I'm a little unclear as to what the goal of this PR is. Initially, it started as an "improve startup" effort, which then morphed into a "let's make ByteArray" more usable, even for other clients (like classfile API), or Long::toString. I'm unsure about the latter use cases, because (a) Long/Integer are core classes and should probably use Unsafe directly, where needed and (b) for classfile API, using ByteBuffer seems a good candidate on paper (of course there is the unknown of how well the byte buffer access will optimize in the classfile API code - but if there's more than one access on the same buffer, we should be more than ok).

I'd like to add some more words of caution against the synthetic benchmarks that we tried above. These benchmarks are quite peculiar, for at least two reasons:

* we only ever access one element
* the accessed offset is always zero

No general API can equal Unsafe under this set of conditions. When playing with the benchmark I realize that every little thing mattered (we're really measuring the number of instructions emitted by C2)  - for instance, the fact that when access occurs with a byte buffer, the underlying array and limit have to be fetched from their fields has a cost. Also, the fact that ByteBuffer has a hierarchy has an even bigger cost (as C2 has to make sure you are really invoking HeapByteBuffer). The mutable endianness state in byte buffer also adds up to the noise. The above is what ends up in a big fat "2x slower" label.

That said, all these "factors" are only relevant because we're looking at a _single_ buffer operation. In fact, all such costs can be easily be amortized as soon as there more than one access. Or as soon as you start accessing offsets that are not known statically (unlike in the benchmark).

So, there's a question of what's the code idiom that leads to the absolute fastest code (and I agree that Unsafe + static wrappers seems the best here). And then there's the question of "but, what do we need to get the performance number/startup behavior we want". I feel the important question is the second, but we keep arguing about the former.

And, to assess that second question, we need to understand better what the goals are (which, so far, seems a bit fuzzy).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14636#discussion_r1270795253