RFR: 8264774: Implementation of Foreign Function and Memory API (Incubator)

Tue Apr 27 18:01:41 UTC 2021

On Mon, 26 Apr 2021 17:10:13 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

> This PR contains the API and implementation changes for JEP-412 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment.
> 
> [1] - https://openjdk.java.net/jeps/412

Here we list the main changes introduced in this PR. As usual, a big thank to all who helped along the way: @ChrisHegarty, @iwanowww, @JornVernee, @PaulSandoz and @sundararajana.

### Managing memory segments: `ResourceScope`

This PR introduces a new abstraction (first discussed [here](https://inside.java/2021/01/25/memory-access-pulling-all-the-threads/)), namely `ResourceScope` which is used to manage the lifecycle of resources associated with off-heap memory (such as `MemorySegment`, `VaList`, etc). This is probably the biggest change in the API, as `MemorySegment` is no longer `AutoCloseable`: it instead features a *scope accessor* which can be used to access the memory segment's `ResourceScope`; the `ResourceScope` is the new  `AutoCloseable`. In other words, code like this:

try (MemorySegment segment = MemorySegment.allocateNative(100)) {
   ...
}   

Now becomes like this:

try (ResourceScope scope = ResourceScope.ofConfinedScope()) {
   MemorySegment segment = MemorySegment.allocateNative(100, scope);
   ...
}   

While simple cases where only one segment is allocated become a little more verbose, this new API idiom obviously scales much better when multiple segments are created with the same lifecycle. Another important fact, which is captured by the name of the `ResourceScope` factory in the above snippet, is that segments no longer feature dynamic ownership changes. These were cool, but ultimately too expensive to support in the shared case. Instead, the API now requires clients to make a choice upfront (confined, shared or *implicit* - where the latter means GC-managed, like direct buffers).

Implementation-wise, `ResourceScope` is implemented by a bunch of internal classes: `ResourceScopeImpl`, `ConfinedScope` and `SharedScope`. A resource scope impl has a so called *resource list* which can be also shared or confined. This is where cleanup actions are added; the resource list can be attached to a `Cleaner` to get implicit deallocation. There is a new test `TestResourceScope` to stress test the behavior of resource scopes, as well as a couple of microbenchmarks to assess the cost of creating/closing scopes (`ResourceScopeClose`) and acquiring/releasing them (`BulkMismatchAcquire`).

### IO operation on shared buffer views

In the previous iteration of the Memory Access API we have introduced the concept of *shared* segments. Shared segments are as easy to use as confined ones, and they are as fast. One problem with shared segments was that it wasn't clear how to support IO operations on byte buffers derived from such segments: since the segment memory could be released at any time, there was simply no way to guarantee that a shared segment could not be closed in the middle of a (possibly async) IO operation.

In this iteration, shared segments are just segments backed by a *shared resource scope*. The new API introduces way to manage the new complexity, in the form of two methods `ResourceScope::acquire` and `ResourceScope::release`, respectively, which can be used to *acquire* a resource scope. When a resource scope is in the acquired state, it cannot be closed (you can think of it as some slightly better and asymmetric form of an atomic reference counter).

This means we are now finally in a position to add support for IO operations on all byte buffers, including those derived from shared segments. A big thank to @ChrisHegarty who lead the effort here. More info on this work are included in his [writeup](https://inside.java/2021/04/21/fma-and-nio-channels/).

Most of the implementation for this feature occurs in the internal NIO packages; a new method on `Buffer` has been added to facilitate acquiring from NIO code - most of the logic associated with acquiring is in the `IOUtil` class. @ChrisHegarty has added many good tests for scoped IO operations under the `foreign/channels` folder, to check for all possible buffer/scope flavors.

### Allocating at speed: `SegmentAllocator`

Another abstraction introduced in this JEP is that of `SegmentAllocator`. A segment allocator is a functional interface which can be used to tell other APIs (and, crucially, the `CLinker` API) how segments should be allocated, if the need arise. For instance, think about some code which turns a Java string into a C string. Such code will invariably:

1. allocate a memory segment off heap
2. bulk copy (where possible) the content of the Java string into the off-heap segment
3. add a NULL terminator

So, in (1) such string conversion routine need to allocate a new off-heap segment; how is that done? Is that a call to malloc? Or something else? In the previous iteration of the Foreign Linker API, the method `CLinker::toCString` had two overloads: a simple version, only taking a Java string parameter; and a more advanced version taking a `NativeScope`. A `NativeScope` was, at its core, a custom segment allocator - but the allocation scheme was fixed in `NativeScope` as that class always acted as an arena-style allocator.

`SegmentAllocator` is like `NativeScope` in spirit, in that it helps programs allocating segments - but it does so in a more general way than `NativeScope`, since a `SegmentAllocator` is not tied to any specific allocation strategy: in fact the strategy is left there to be defined by the user. As before, `SegmentAllocator` does provide some common factories, e.g. to create an arena allocator similar to `NativeScope` - but the `CLinker` API is now free to work with _any_ implementations of the `SegmentAllocator` interface. This generalization is crucial, given that, when operating with off-heap memory, allocation performance is often the bottleneck.

Not only is `SegmentAllocator` accepted by all methods in the `CLinker` API which need to allocate memory: even the behavior of downcall method handle can now be affected by segment allocators: when linking a native function which returns a struct by value, the `CLinker` API will in fact need to dynamically allocate a segment to hold the result. In such cases, the method handle generated by `CLinker` will now accept an additional *prefix* parameter of type `SegmentAllocator` which tells `CLinker` *how* should memory be allocated for the result value. For instance, now clients can tell `CLinker` to return structs by value in *heap* segments, by using a `SegmentAllocator` which allocates memory on the heap; this might be useful if the segment is quickly discarded after use.

There's not much implementation for `SegmentAllocator` as most of it is defined in terms of `default` methods in the interface itself. However we do have implementation classes for the arena allocation scheme (`ArenaAllocator.java`). We support confined allocation and shared allocation. The shared allocation achieves lock-free by using a `ThreadLocal` of confined arena allocators. `TestSegmentAllocators` is the test which checks most of the arena allocation flavors.

### `MemoryAddress` as scoped entities

A natural consequence of introducing the `ResourceScope` abstraction is that now not only `MemorySegment` are associated with a scope, but even instances of `MemoryAddress` can be. This means extra safety, because passing addresses which are associated with a closed scope to a native function will issue an exception. As before, it is possible to have memory addresses which the runtime knows nothing about (those returned by native calls, or those created via `MemoryAddress::ofLong`); these addresses are simply associated with the so called *global scope* - meaning that they are not actively managed by the user and are considered to be "always alive" by the runtime (as before).

Implementation-wise, you will now see that `MemoryAddressImpl` is no longer a pair of `Object`/`long`. It is now a pair of `MemorySegment`/`long`. The `MemorySegment`, if present, tells us which segment this address has been obtained from (and hence which scope is associated with the address). If null, if means that the address has no segment, and therefore is associated with the global scope. The `long` part acts as an offset into the segment (if segment is non-null), or as an absolute address. A new test `SafeFunctionAccessTest` attempts to call native functions with (closed) scoped addresses to see if exceptions are thrown.

### *Virtual* downcall method handles

There are cases where the address of a downcall handle cannot be specified when a downcall method handle is linked, but can only be known subsequently, by doing more native calls. To better support these use cases, `CLinker` now provides a factory for downcall method handles which does *not* require any function entry point. Instead, such entry point will be provided *dynamically*, via an additional prefix parameter (of type `MemoryAddress`). Many thanks to @JornVernee who implemented this improvement.

The implementation changes for this range from tweaking the Java ABI support (to make sure that the prefix argument is handled as expected), to low-level hotspot changes to parameterize the generated compiled stub to use the address (dynamic) parameter. Note that regular downcall method handles (the ones that are bound to an address) are now simply obtained by getting a "virtual" method handle, and inserting a `MemoryAddress` coordinate in the first argument position. `TestVirtualCalls` has been written explicitly to test dynamic passing of address parameters but, in reality, all existing downcall tests are stressing the new implementation logic (since, as said before, the old logic is expressed as an adaptation of the new virtual method handles). The benchmark we used to test downcall performances `CallOverhead` has now been split into two: `CallOverheadConstant` vs. `CallOverheadVirtual`.

### Optimized upcall support

The previous iteration of the Foreign Linker API supported intrinsification of downcall handles, which allows calls to downcall method handles to perform as fast as a regular JNI call. The dual case, calling Java code from native code (upcalls) was left unoptimized. In this iteration, @JornVernee has added intrinsics support for upcalls as well as downcalls, based on some prior work from @iwanowww. As for downcalls, a lot of the adaptation now happens in Java code, before we jump into the target method handle. As for the code which calls such target handle, changes have been made so that the native code can jump to the optimized entry point (if one exists) for such method handle more directly. The performance improvements with this new approach are rather nice, with `CLinker` upcalls being 3x-4x faster compared with regular upcalls via JNI.

Again, here we have changes in the guts of the Java ABI support, as we needed to adjust the method handle specialization logic to be able to work in two directions (both from Java to native and from native to Java). On the Hotspot front, the optimization changes are in `universalUpcallHandler_x86_64.cpp`.

### Accessing restricted methods

It is still the case that some of the methods in the API are "restricted" and access to these methods is disabled by default. In previous iterations, access to such methods was granted by setting a JDK read-only runtime property: `-Dforeign.restricted=permit`. In this iteration we have refined the story for accessing restricted methods (thanks @sundararajana ), by introducing a new experimental command line option in the Java launcher, namely `--enable-native-access=<module list>`. This options accepts a list of modules (separated by commas), where a module name can also be `ALL-UNNAMED` (for the unnamed module). Adding this command line flag to the launcher has the effect of allowing access to restricted methods to a given set of modules (the list of modules specified in the command line option). Access to restricted methods from any other module not in the list is disallowed and will result in an `IllegalAccessException`.

When implementing this flag we considered two options: adding some resolution-time checks in the JVM (e.g. in `linkResolver`); or use `@CallerSensitive` methods. In the end we opted for the latter given that `@CallerSensitive` are generally well understood and optimized, and the general feeling was that inventing another form of callsite-dependent check might have been unnecessarily risky, given that the same checks can be implemented in Java using `@CallerSensitive`. We plan (not in 17) to add `javadoc` support by means of an annotation (like we do for preview API methods) so that the text that is currently copied and pasted in all restricted methods can be inferred automagically by javadoc.

### GitHub testing status

Most platforms build and tests pass. There are a bunch of *additional* Linux platforms which do not yet work correctly:

* Zero
* arm
* ppc
* s390

The first two can be addresses easily by stubbing out few functions (I'll do that shortly). The last two are harder, as this patch moves some static functions (e.g. `long_move`, `float_move`) up to `SharedRuntime`; unfortunately, while most platforms use the same signatures for these function, on ppc and s390 that's not the case and function with same name, but incompatible signatures are defined there, leading to build issues. We will try to tweak the code around this, to make sure that these platforms remain buildable.

Javadoc: http://cr.openjdk.java.net/~mcimadamore/JEP-412/v1/javadoc/jdk/incubator/foreign/package-summary.html
Specdiff: http://cr.openjdk.java.net/~mcimadamore/JEP-412/v1/specdiff/overview-summary.html

-------------

PR: https://git.openjdk.java.net/jdk/pull/3699