From atobia at ca.ibm.com Thu Dec 10 16:43:50 2015 From: atobia at ca.ibm.com (Tobi Ajila) Date: Thu, 10 Dec 2015 11:43:50 -0500 Subject: Panama Updates Message-ID: <201512101644.tBAGi2W0019606@d03av04.boulder.ibm.com> Based on discussions at JavaOne, it sounds like you (Oracle) have some interesting results / updates on Panama. Can you share your progress on this list? In terms of what we've been doing, our initial Layout runtime prototype is now available on github, along with some design docs. Have you had a chance to take a look at this? Link to prototype: https://github.com/J9Java/panama-layout-prototype Link to docs: http://j9java.github.io/panama-docs Regards, --Tobi From john.r.rose at oracle.com Sat Dec 19 03:35:07 2015 From: john.r.rose at oracle.com (John Rose) Date: Fri, 18 Dec 2015 19:35:07 -0800 Subject: Panama Updates In-Reply-To: <201512101644.tBAGi2W0019606@d03av04.boulder.ibm.com> References: <201512101644.tBAGi2W0019606@d03av04.boulder.ibm.com> Message-ID: <853E90D2-D21C-4345-A488-C5F28C9C2A53@oracle.com> Hi Tobi. On Dec 10, 2015, at 8:43 AM, Tobi Ajila wrote: > > Based on discussions at JavaOne, it sounds like you (Oracle) have some > interesting results / updates on Panama. Can you share your progress on > this list? > > In terms of what we've been doing, our initial Layout runtime prototype is > now available on github, along with some design docs. Have you had a > chance to take a look at this? > > Link to prototype: https://github.com/J9Java/panama-layout-prototype > Link to docs: http://j9java.github.io/panama-docs > > Regards, > ?Tobi Thanks for the links to your works-in-progress. I've downloaded your prototype code and verified that it passes tests. For us, it is all implementation and not much spec. so far, which is one reason why we've been quiet on this channel. (We wish we could share impl. code with you, but we understand it's not practical at present.) We have been experimenting furiously with the groveller (called "jextract") and also with native function binding. Some of our work had already been posted to the Panama project repository in OpenJDK. There is more to come, probably early next month, as we polish work done to get ready for JavaOne. We are able to extract, bind, and run some of the canonical examples: getpid, printf (varargs), qsort (callbacks). We can also run jextract using its own output for extracting bindings for the libclang API. The key spec. content related to jextract, of course, is defining the annotations and utility types present in the extracted JAR, along with rules for mapping native "source types" and "source names" to valid Java "carrier types" and "carrier names", plus rules about the structure of the interfaces and methods that embody the extracted API. A year-old prospectus of this spec. is here: http://cr.openjdk.java.net/~jrose/panama/metadata.html This needs to be updated, of course, to reflect what jextract currently puts out, and where our current thinking is leading us. I have posted a couple of diagrams sketching the tool chain we are looking at: http://cr.openjdk.java.net/~jrose/panama/Metadata-Flow-User-View.pdf http://cr.openjdk.java.net/~jrose/panama/Metadata-Flow-Possibilities.pdf The second one is much larger than the first, and is a range of possibilities, rather than a road map. In your prototype you have @LayoutDesc to annotate carrier types with their carrier names and types (or those of their components). This is very similar. For data structure, jextract stuffs data about original source types and source types in annotations on the extracted "carrier APIs". For functions, jextract generates interface methods whose descriptors have the corresponding carrier types, and adds annotations if necessary. It's all dumped into the JAR produced by the groveller. Henry Jen has been doing this work. The Panama runtime loads these interfaces, reads the annotations, and binds private implementations to them using Unsafe, etc. by spinning bytecodes for hidden classes. Vladimir Ivanov wrote JVM binding logic which uses a "linkToNative" primitive that wraps a method handle around a name entry point and associated native type descriptor. An year-old prospectus of this binder is here: http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html Recently, Mikael Vidstedt has been experimenting with lightweight mechanisms for specifying the bindings of function call arguments and return values, and binding them quickly at runtime. He has a way of flexibly mapping an ABI to a wide set of machine types, including multi-lane vectors (on X86). This notation, which is a little bit like a layout description, might possibly appear in groveller output, although it is likely that it can remain internal, since the function signatures are not ambiguous. Mikael is working on a presentation about Panama status, and he allowed me to post a version of his work in progress: http://cr.openjdk.java.net/~jrose/panama/panama-status-2015-1216.pdf Vladimir is experimenting with handling types larger than 64 bits from Java. The basic building block is a type called Long2 (or Long4 or Long8) which carries a tuple of longs, without assigning them any semantics. These types allow us to work with APIs that work on vector types, including APIs like intrinsic functions in . I had an intense lunchtime conversation with Dan Heidinga about the LDL. I will post more thoughts about this separately. Best wishes, ? John From john.r.rose at oracle.com Sat Dec 19 05:56:08 2015 From: john.r.rose at oracle.com (John Rose) Date: Fri, 18 Dec 2015 21:56:08 -0800 Subject: considering the LDL In-Reply-To: <201512101644.tBAGi2W0019606@d03av04.boulder.ibm.com> References: <201512101644.tBAGi2W0019606@d03av04.boulder.ibm.com> Message-ID: <235A9F11-939E-422E-AFDE-3151E6A148B8@oracle.com> On Dec 10, 2015, at 8:43 AM, Tobi Ajila wrote: > > Link to docs: http://j9java.github.io/panama-docs My basic approach to LDL is to try to describe its design of as simply and abstractly as possible, with a view towards figuring out the minimum number of interlocking concepts. One thing we can't do is try to emulate DWARF or some other full-service language description scheme. This is a weakness in the current jextract output, where we currently output C types in ASCII form. So, should language-specific names or types be in the LDL? Probably not, but (as you found) perhaps we need some sort of naming if only to provide attachment points to other metadata elsewhere. What exactly are the names in a LD? Must they map to some source language, or are they abstract link anchors to connect to other schemas? (If they are just anchors, can we drop the names and use just small integers?) Do we need to allow expression of "complex number" (I hope not!) or is it enough to allow a struct of two scalars (possibly labeled)? Do we even need the standard distinction between signed, unsigned, and float, and if so, why? Should layout types be drawn from C's primitives, or the set of operand types in some ISA, or something else? And what is the highest-leverage place to include concepts like endian, volatile, atomic, immutable, alignment, explicit offsets, bit-slices, and memory blocking factors? Do we need separate concepts for on-heap and off-heap or Java and native storage (I hope not)? Can we design APIs for layouts that make a minimum possible commitment to the design of the bookkeeping that happens around memory accesses? By that I mean I like the clean minimal feel of what you've done, and hope to (perhaps) shrink and sharpen it even more, by evaluating the above concepts and attempting to make it even more minimal. Or if not, to clearly identify the "extras" we want and be able to explain exactly why. For example, if a typical ISA makes an important distinction between floating loads and integer loads, that might be enough reason to make that distinction in layout types also. But the counter-argument is that the LDL could ignore the distinction (working just with bit blocks) and let the user decide what sorts of loads to issue, based on richer type information outside the LDL. A similar argument can be made even for endian-ness, although that seems even more extreme than ignoring float-ness. In some sense, it's up to the end user to decide which kind of endian load to issue, or, equivalently, whether to use a native-endian load and issue a byte-swap operation in the CPU. Here's a stab at starting from next to nothing: I think the absolute bare-metal minimum for a LDL starts by postulating "memory", "registers", and a "load/store unit" which moves bits between them. The memory is an effectively endless sequence of bits, grouped into bytes (octets), and (optionally) into larger chunks, as power-of-two multiples of octets with the usual alignment rules. The registers are an effectively unlimited set of fixed-sized variables, in a variety of sizes. Sizes will often be power-of-two, but I think we want virtual registers of any size, so we can talk about block copying without explict direct memory-to-memory operations. (Though I'd rather not, perhaps registers also come in classes, of integer, float, vector?maybe even pointer. But arguably worrying about register classes is a job for the user, and the user's register allocator.) The LSU takes commands to move data between the two places. It is more complicated than it looks, since loads and stores can differ according to size, alignment, endian-ness, atomicity, reorderability, and compatibility with register sizes or classes. (Though I'd rather not, perhaps we allow a register to be incompletely filled by a load, or incompletely used by a store. Then there the question of which register bits are ignored, or filled by zero or extended sign bits.) Every LSU command operates relative to a contiguous range of memory bytes located by an "effective address". All that seems both necessary and reasonably minimal to me. I am *not* trying to be hyper-general just for the sake of it; I don't care today about memories with 9-bit or 1-bit bytes. The exercise of generality is a way of simplifying the design space, and factoring out separable concerns. Adding in the LDL, we have a set of descriptions which help us operate the Mem/Reg/LSU machine according to meaningful "data access disciplines" and "data layouts" (as your draft points out). OK, so any particular formula of the LDL will describe a "layout description". I will call this a "Layout" for short. (Your prototype uses the term "Layout" for the binding of a layout description to a particular effective address. I prefer to call that binding a "Reference" or "Location". I don't have a better term. But I'm pretty sure that "layout" must be a term for something abstract that you work with before you load or store your first byte. My house *has* a layout which it may share with other houses, and which existed before the house was built; my house is not itself a layout.) What does a Layout do? When combined with an EA (effective address) it allows us to issue LSU commands to move stored data around in a safe and sane manner. To do this, it must specify one or more correspondences between bits in memory and bits in registers (before either the memory or register is chosen). Since we assumed that LSU commands operate on contiguous memory ranges, it follows that a layout will determine one or more such ranges. The embodiment of these ranges in a Layout are called "containers" in your draft. That's a fine term. The point is that every container is separately addressable via LSU commands. Odd-sized containers can be loaded and stored into odd-sized (virtual) registers, as if via block copy to unaliased memory. Some favored sizes and alignments get better LSU functionality, such as atomicity. (As a nice-to-have we could also consider breaking down data at the bit-slice level. Since this is not a natural memory operation, it must be layered on as an operation on registers. Because some native data formats are defined at the bit level, perhaps we want to induce a numbering of bits on Memory, and emulate a limited set of LSU commands on those bit slices. I'm skeptical at present whether we need this.) Given a Layout, a code generator could create a full set of tiny subroutines (or IR snippets) which execute the full repertoire of LSU commands for each element of the layout. The subroutines would be named according to commands and labels of elements within the Layout. So, let's look at the concepts I mentioned at the top of this message. == endian-ness, aka byte order I like the fact that your document makes byte order explicit, rather than defer to platform byte order. The LDL should make that explicit, just like it makes sizes explicit. The LDL must not express different layouts on machines with different byte order or word sizes. We want to use the LDL for storage and transmission on heterogeneous systems. (Preaching to the Java choir here.) Let's assume that the LSU can load in either byte order (scalars, I mean, not arbitrary blocks). We have to assume this because otherwise we can't have a single universal LSU model. This means that each LSU command (load or store) has a byte order attribute. What would happen if the LDL ignores byte order? Then the LDL cannot express a contract between network endpoints about how to reassemble a multi-byte scalar. That seems bad. So the LDL needs this attribute (as your draft points out). Typical designs don't mix byte order, so it is plausible to specify byte order only at the beginning of an LDL formula, and then wherever it changes (if it does). What about bit order? I hope we don't need to put this in the LDL, but the above argument would seem to apply to bits also. Bit-streaming formats tend to make arbitrary choices about bit order just like CPUs are arbitrary about byte order. If we add the "nice-to-have" feature of bit slices, I think we will have to add bit-order notation also. == size A layout has a size in bytes. If it consists of a single element (traditional scalar value or other block of bytes), its size is the size of that element. If it is composite, the size is derived from the composite; see below. == alignment Many machines (but not x86) care about alignment, when they load scalar types. Typically, there are separate CPU instructions for aligned and unaligned access. The LDL must give code generators enough information to choose between efficiently and generality. Thus, an LDL formula needs to express its expectations about alignment. (We assumed above that memory does clump together above the byte level; this is where we use the assumption.) What's an expectation? It will be a power of two 2^G, of course. For generality, the expectation could also include a modular offset H in [0..2^G-1], for an exotic layout with a small hanging prefix. Given an alignment (2^G,0) or (2^G,H), a code generator will be able to generate both forms of access, and be able to test (if necessary) the alignment of proposed EA's. Given several layouts and a composition rule (union, struct, array, etc.) alignments can inform offset generation and an overall alignment for a combined layout. This is what C structs do. That leads us to? == composition Layouts are really boring until you start combining them. What is the simplest possible combination rule? Concatenation, obviously, then replicated concatenation, then overlaying at a common address. Thus we get struct, union, array. A layout combination operator would take two layouts and create a composite one, assigning new offsets for each element in its new position in the composite. When combined with alignment restrictions, we start to see a demand for padding. The rules get suspiciously close to complex and arbitrary. I'd like to propose a simpler, less structured composition operator, a "nest" (taking a cue from "nested layout" in your draft). A "nest" is a sequence of one or more layouts, each associated with a non-negative offset. The offset of the first is zero. The first layout is probably not composite; it acts as the parent of the whole nest. The other layouts can be called children or nestmates. Crucially, the byte spans of all children must be wholly contained in the span of the parent. If all the offsets in a nest are zero, we have a union. If all the child offsets are increasing from zero, and the children are non-overlapping with no padding, we have a packed struct. The nest becomes more interesting if you add more constraints on formation. For example, the alignment of the parent must satisfy the alignment of all the children, in the obvious way. Imposing this constraint will force C-like structs to "grow" padding, if their children have alignment constraints. Or it will force the children to lose their constraints. There's no one right answer, so it has to be an option. I like the fact that there are no auto-padding rules in your draft. But in order to allow code generators to trust alignments, I think we need rules for composition that preserve alignment information in the way sketched above. Basically, if you are putting together child layouts that have alignment, you need to get the offsets right. If you make a mistake, the LDL constraints will tell you it is wrong, but they won't fix it up for you behind your back. Besides alignment checks, it is important that no child include bytes outside the span of the parent. Beyond that, there's no logical reason to prevent children from overlapping, or parents from containing unused padding. == names What gets named in an LDL formula? If the formula is simple (not composite) there's no need for a name. In the case of a nest, a code generator must be able to respond to requests to access the whole nest (the parent), or any of the children. That's where names come in. But small integers are just as good, if you don't mind keeping count. And machines don't mind. What you come to are little path expressions like (1,2,1,1). But I do think the LDL needs a pretty-printed version, to be used for documentation or LDL processing tools, in which names can be inserted (as if they were annotations) and later on translated to path expressions. What I'm suggesting is that when jextract emits LDL formulas into struct APIs, there is no need for it to mention names, as long as there is as way to derive the path expression for each API point. A tool which prints out a human-friendly version of jextract output could re-insert the names, by drawing them from the other classfile metadata. (I'm always suspicious of designs which require the repetition of information already derivable from some other source. What goes wrong, eventually, is the repeated bits go out of sync, and then we get surprises.) == types Are registers just bundles of bits and bytes, or do they have kinds? If they don't have kinds, then the LDL can remain silent about kinds also, since memory blocks don't have kinds either. I'd prefer it this way; leave it to the register allocator and code generator to pick register kinds. As noted before, we don't want to sign up to create a LTS (little type system) in the LDL. That's a job for other notations like DWARF or C itself. == atomicity Crucially, every LSU supports atomic reads and writes for some sizes, usually 1, 2, 4, and 8 bytes. A code generator will sometimes need to issue an atomic load or store (for a permitted size). Just as with alignment, a code generator has to be able to prove that the size and EA are compatible with atomicity. I think compatibility always amounts to a restriction on size and alignment, so the above LDL attributes are enough to ensure correct atomic commands. == immutability You might wonder if some sort of "const" attribute would make sense in an LDL. I think not, for similar reasons as atomicity: It's a decision in the user's hands which LSU commands to issue. == arrays, repetition An array is almost a struct, except you don't always know how many elements it has. As soon as the number of elements is known, you can reduce it to a nest of array elements. (More crudely, you can always turn it into a block with only a size and alignment.) Is there need for any more mechanism? I would like to think of the "array" operator in the LDL as a sort of macro for a variable nest. The hard part here is deciding how to deal with late-bound parts of the layout. So far, every layout and nested layout can be statically bound to a size and alignment, but arrays spoil that. Perhaps the answer is to give the LDL a replication operator, and allow an LDL formula to evaluate to a "variable-sized layout", with one or more size parameters to be supplied later. A code generator would just arrange to take the extra runtime parameters. Do we do the C-like thing and require a variable-length child layout to always occur at the end? Well, actually, the "nest" abstraction doesn't care about this. But it does require that the parent be provably large enough to include the children. If we add arrays to nests and stir vigorously, we get formulas like these from pseudo-C: struct TwoArrays { int n1, n2; double a1[n1]; byte a2[n2]; }; There are two problems here: The offset of a2 is a complex expression, and the size of the whole is another expression. In a parameterized LDL formula, TwoArrays can be like this: template struct TwoArrays { int n1, n2; double a1[$n1]; byte a2[$n2]; }; A code generator would need to be passed $n1 and $n2. (Extra cleverness would allow it to fetch them from n1 and n2, optionally.) More generally, all the numeric parts of an LDL formula could be either constants or variables (where the set of variables is declared at the beginning of the formula, like a lambda or a template). This seems overly general; I'm not sure what is the right reduced-strength design. In the end, it's clear we need to support a modest amount of runtime binding for offsets (of array elements), but we can't sign up to prove properties like "non-overlapping", or "tightly packed" or even "properly aligned", if a runtime-sized array has gotten in somewhere. Probably at some point you have to say to the static code generator "trust me, you won't run off the end", or maybe "here are some numbers you can use to check my math". Then efficient math-checking becomes the JIT's problem; that's how Java does it. == optional data What about data that may or may not be present under the control of a separate bitmask? That is a common trick for aggressive packing. It spoils C-like structs but we need not fear it. It can be modeled as a nest of arrays like TwoArrays, where each array is of runtime length 0 or 1. == variable-length scalars If we have arrays, can we get variable-length ints? There the bits that control length are right there in the adjacent byte(s), and in an arbitrary fashion. It's not clear how to merge this into an LDL. In fact, recall that the basic assumption above is that memory is random-access. All you need to do is take a base address, add some offsets (perhaps computed at runtime) and fetch your data. Variable-sized or parsed values belong to a streaming model of memory, where you read one thing first, make a decision, and then read the next thing. So, probably there is an LDL-like thing adapted to streaming, but we don't need to design it immediately. == non-contiguous data Does it ever make sense to load a layout element from separated bytes? Probably not, although I can imagine places where you'd want to express something like this. You can simulate this under layouts by allowing memory itself to be abstracted; the LSU would operate on a non-standard memory that is composed of overlays, in such a way that the non-contiguous bytes become briefly contiguous, long enough for the LSU to load via a layout. I'd say this is not a feature. == chunking, vectorization It would be helpful, on today's machines, if a code generator can create vectorized operations on memory data. That is, the LSU would have "vector mode" or "streaming" commands that work on vectors or streams of Layouts. Getting vectorized loops right basically requires testing size and alignment at runtime, and picking the right loop version. The LDL already supports size and alignment queries, so perhaps nothing more is required. One possibility: The LDL might include hints to allow code generators to speculate on particularly favorable sizes or alignments. So the "bigarray(N,L)" operator would be identical to "array(N,L)" but with an extra hint that N is likely to be large enough to vectorize profitably. Yuck, but you get the general idea. Similar to vectorization is cache line sparing. The creation or access of in-memory data might sometimes want to attempt to touch as few cache lines as possible. Again, size and alignment speculations (or preferences or hints) in the LDL might assist code generators with this. But in the end this feels like names and types: Nice to have, sometimes, but not strictly needed. Maybe the correct requirement to deduce here is that the LDL be "friendly" to occasional injections of annotated data, whether names, types, or speculation advice. Well, that was long. Merry Christmas, Happy Holidays, enjoy your break ? John