[foreign-memaccess] layout constants and endianness - take two
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jul 18 12:10:28 UTC 2019
Quoting from an email sent by Brian few days ago on this list:
> Let's just bear in mind that we can think of at least three categories
> of users, each of which will have different ideas about endianness:
>
> - Traditional FFI users. These folks are calling native methods, and
> dealing with native layouts. Some of the time, these folks don't care
> about endianness, but sometimes they do -- such as if they are laying
> out a network packet. They want the ability to control endianness,
> but probably want simple defaults.
>
> - "Pure" Java off-heap users. These folks are using off-heap memory
> strictly to sidestep the GC and runtime; the data is never leaving the
> machine. They aggressively don't care about endianness, and will hate
> you if you make them think about it.
>
> - Java interop users. These folks are doing things like protobuf,
> which by definition is leaving the machine. These guys need explicit
> control over endianness all the time.
>
> There might be others too! I guess my plea here is: don't sacrifice
> the "Pure" java users for the sake of the others. There needs to be an
> easy way to say "Java primitive int please, with all that entails",
> and not get wrapped up in bit sizes, order, etc.
This is a good characterization of who the users are. Now, when we move
on to consider which set of layout constants we should provide, I think
there are three kinds of them (one per user):
1) explicit sized constants - such as INT8, INT16, INT32, FLOAT32,
FLOAT64... - these are useful for message protocols
2) Java-like constants such as JAVA_INT, JAVA_FLOAT, JAVA_CHAR... -
these are useful for pure off-heap
3) ABI-dependent constants, such as C_INT, C_FLOAT ... - these are
useful for FFI users
I think it's fair to say, that, no matter what we do, an hypothetical
class listing all relevant layout constants should list all the possible
combinations, e.g. for a Java int, there should be at least two
constants e.g. JAVA_INT_BE and JAVA_INT_LE, in case users want to reach
for them.
The big question is, what do endianness-less constant mean - e.g.
something like a plan JAVA_INT? Here are some options:
A) Nothing - endianness is always explicit, we simply do NOT provide any
constant that has no endianness suffix in it.
A2) Like A, but provide the constants in 'bundles', so that you can
import them separately. E.g. instead of having JAVA_INT_BE, let's have
BE.JAVA_INT instead (and the user can static import BE.*)
B) A default endianness value takes precedence over others - e.g. big
endian, as in ByteBuffer API - that is, JAVA_INT == JAVA_INT_BE
C) We allow layouts to be created w/o endianness - meaning that, upon
use, clients will have to force desired endianness - e.g.
JAVA_INT.order(ByteOrder.BIG_ORDER)
D) We use machine endianness for all constants that do not have explicit
endianness prefix
The simplest option is, of course (A). This option side-steps the
'default' problem entirely. It's actually even not that bad in the sense
that, if a user really doesn't care about endianness because he's
serializing and deserializing from same machine, in a way, whichever
endianness he picks he'll be fine. But for constants like the ones in
(3) which are used with FFI, such an approach would be too taxing - if I
want to call a native function, of course I want the same endianness as
the machine I'm running on?
A2 improves on that, by offering a one-shot move to import all the
constants you want at the top of the file. FFI users will pick the one
that corresponds to the platform they want to work with; pure off-heap
users don't care, so they will just 'pick one', but they'll do so only
once. For message protocol users it's a bit more complex, in that they
can't just static import both sets, as there will be conflicts - so they
will have to explicitly use qualified names such as BE.JAVA_INT and
LE.JAVA_INT. Not much worse than JAVA_INT_BE and JAVA_INT_LE, anyway.
(B) kind of builds on the idea that, if you remain on the same machine,
you don't really care about endianness, so whatever default is picked is
gonna work fine. Plus, if the default is the same as the one used by
ByteBuffer, there's less impedance mismatch when going from
MemorySegments to ByteBuffers. But this still does nothing for our FFI
users who have to be endianness explicit nearly all the time (since the
vast majority of platforms are little endian).
(C) was initially co-proposed by me/John [1] in a discussion revolving
around the foreign branch. The idea is that we have a third endianness
state - NO_ENDIAN and, if the user tries to use a NO_ENDIAN layout e.g.
to produce a memory access VarHandle, he will be met with an error
because there's missing endianness info. I'd say a solution like this,
while elegant, looks like overkill for pure off-heap cases (as we
already stated, endianness is NOT relevant there). And it's also not
great for FFI users who will have to go through a lot of goop to
instantiate the layout with the correct endianness. So, all things
considered, while more elegant, this has the same usability issues that
(A) has.
(D) is based on the idea that, if you don't care about endianness, well,
you don't - so whatever we pick will be fine (including native
endianness). So constants/users in bucket (2) will be totally fine with
this. On the other hand, message protocol heavy clients will need to be
explicit anyway, and will probably prefer the endianness explicit
constants to the implicit ones (e.g. they will use JAVA_INT_BE, not
JAVA_INT, to adhere to the protocol more explicitly). And, 99.99% of
times, when doing FFI, you really just do want native endianness.
So, looking at the scoreboard, it seems that D and A2 are the only
solutions that have some chance to cater to all the various use cases.
When it comes to D, precise, endianness-ful constants are still there
for people who want/need to reach for them, but handy defaults are also
provided. On the other hand, D is itself not perfect, and it has some
pain points:
* it will bite when interfacing with ByteBuffer, which are BE by
definition (yes, I've been a victim on this when writing a test for the
memory access API)
* the same source code won't mean the same thing on all platforms; some
differences can be poked at if the code 'reflectively' looks at the
endianness property of a layout
* if we have native order-dependent constants, Constable/folding support
kind of goes out of the window, or is made more complex by the fact that
endianness at compile-time might be different from that at run-time
A2 is of course free from all these issues - since it basically
side-steps the question of setting a default, but in a clever way which
can be worked-around by using the right set of static imports. Of course
this would still mean that endianness-agnostic users will still have to
make an endianness-dependent choice in their imports - but given this is
a one-off, maybe that's not so bad.
Of course, we don't have to use a single solution for all the constants
in buckets 1-3, so we could do something like this:
* use D (or, better, A2) for FFI-related constants - this will give FFI
users the set of constants they want - either implicitly or via explicit
import (a la A2)
* use B for constants in (2) - pure off-heap users don't care much about
endianness, and, in this case, compatibility with ByteBuffer is more
important (since most of the use cases in this space use BB at the moment)
* use A for constants in (1) - after all, message protocol users care
about endiannes anyway
So I guess there there are some questions here:
* how worried are we about the problem with D listed above?
* how odd would it be to apply different endianness decisions to
constants in different buckets?
* is A2 'good enough' - if we only did that, will people be happy with
adding an import at the top of the file to choose the polarity they want?
Comments welcome
Maurizio
[1] -
https://mail.openjdk.java.net/pipermail/panama-dev/2019-February/004147.html
More information about the panama-dev
mailing list