RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system. A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required. Thanks Maurizio Webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev Javadoc: http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc Specdiff: http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary... CSR: https://bugs.openjdk.java.net/browse/JDK-8243496 API changes =========== * MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced) Implementation changes ====================== * add support for VarHandle combinators (e.g. IndirectVH) The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call). All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast. * More ByteBuffer implementation changes Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment. Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation). Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force). * Rewritten memory segment hierarchy The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation). * Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination. * renaming of the various var handle classes to conform to "memory access var handle" terminology This is mostly stylistic, nothing to see here. Tests changes ============= In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected. To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness. We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly). [1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
Hi, Looks good. I have seen almost all of this in reviews on panama-dev hence the lack of substantial comments here. I suspect we are not gonna need the drop argument VH combinator, dropping coordinates feels a little suspicious to me, but I can see why it's there for completeness. Paul.
On Apr 23, 2020, at 1:33 PM, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 24/04/2020 01:35, Paul Sandoz wrote:
Hi,
Looks good. I have seen almost all of this in reviews on panama-dev hence the lack of substantial comments here.
I suspect we are not gonna need the drop argument VH combinator, dropping coordinates feels a little suspicious to me, but I can see why it's there for completeness.
Thanks Paul. Re. drop coordinates, note that we're actually using it in MemoryHandles.withStride, so that we can insert a dummy coordinate that is always discarded in case the stride is zero. We could do without it as well, of course, but sometimes it's hard to have a sense of how these pieces might be joined in practice. Maurizio
Paul.
On Apr 23, 2020, at 1:33 PM, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On Apr 23, 2020, at 5:45 PM, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
On 24/04/2020 01:35, Paul Sandoz wrote:
Hi,
Looks good. I have seen almost all of this in reviews on panama-dev hence the lack of substantial comments here.
I suspect we are not gonna need the drop argument VH combinator, dropping coordinates feels a little suspicious to me, but I can see why it's there for completeness.
Thanks Paul.
Re. drop coordinates, note that we're actually using it in MemoryHandles.withStride, so that we can insert a dummy coordinate that is always discarded in case the stride is zero. We could do without it as well, of course, but sometimes it's hard to have a sense of how these pieces might be joined in practice.
Ah, ok I can see its value now when dropping a coordinate that would otherwise result in some redundant calculation. However, looking at this more closely: * @param bytesStride the stride, in bytes, by which to multiply the coordinate value. Must be greater than zero. It implies that a zero value should be disallowed contrary to the implementation. I am wondering if your intent was support a signed stride value? In this case it could be argued that a zero stride is misleading, if supported, since any value passed for the coordinate X has no effect. But I can also see the other side from a position uniformity, and then why not support negative strides, which I think the implementation does support. Paul.
Maurizio
Paul.
On Apr 23, 2020, at 1:33 PM, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 24/04/2020 17:38, Paul Sandoz wrote:
On Apr 23, 2020, at 5:45 PM, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
On 24/04/2020 01:35, Paul Sandoz wrote:
Hi,
Looks good. I have seen almost all of this in reviews on panama-dev hence the lack of substantial comments here.
I suspect we are not gonna need the drop argument VH combinator, dropping coordinates feels a little suspicious to me, but I can see why it's there for completeness. Thanks Paul.
Re. drop coordinates, note that we're actually using it in MemoryHandles.withStride, so that we can insert a dummy coordinate that is always discarded in case the stride is zero. We could do without it as well, of course, but sometimes it's hard to have a sense of how these pieces might be joined in practice.
Ah, ok I can see its value now when dropping a coordinate that would otherwise result in some redundant calculation. However, looking at this more closely:
* @param bytesStride the stride, in bytes, by which to multiply the coordinate value. Must be greater than zero.
It implies that a zero value should be disallowed contrary to the implementation. I am wondering if your intent was support a signed stride value?
In this case it could be argued that a zero stride is misleading, if supported, since any value passed for the coordinate X has no effect. But I can also see the other side from a position uniformity, and then why not support negative strides, which I think the implementation does support.
That comment is bogus. In the previous iteration the stride was positive - now we removed all restrictions so that you can go forwards and backwards; we added zero for completeness but the javadoc does not reflect that (I mean, the @throws does, but not the main method description). I'll rectify that Thanks Maurizio
Paul.
Maurizio
Paul.
On Apr 23, 2020, at 1:33 PM, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
Another iteration, which addresses the following issues: * wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy) Webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev Delta webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/ Javadoc: http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc Specdiff: http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary... Cheers Maurizio On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
Hi Maurizio, Mandy, In https://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev/src/java.base/sha..., using a condy inside a static init make me sad, using a late loading constant to early load it in the static init seems counter natural. I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it, ASM will generates only one condy constant pool constant for all ldcs (so once one of the ldc is executed the classdata will be available to the other ldcs) and the JIT will fold the access to the array given the array is a constant. regards, Rémi ----- Mail original -----
De: "Maurizio Cimadamore" <maurizio.cimadamore@oracle.com> À: "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Lundi 27 Avril 2020 14:13:55 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
Another iteration, which addresses the following issues:
* wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary...
Cheers Maurizio
On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 4/27/20 2:42 PM, Remi Forax wrote:
Hi Maurizio, Mandy,
In https://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev/src/java.base/sha..., using a condy inside a static init make me sad, using a late loading constant to early load it in the static init seems counter natural.
This is an interim solution due to JDK-8243492 which causes VM crash if it lazily loads the constant.
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/ Mandy
ASM will generates only one condy constant pool constant for all ldcs (so once one of the ldc is executed the classdata will be available to the other ldcs) and the JIT will fold the access to the array given the array is a constant.
regards, Rémi
----- Mail original -----
De: "Maurizio Cimadamore" <maurizio.cimadamore@oracle.com> À: "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Lundi 27 Avril 2020 14:13:55 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator) Another iteration, which addresses the following issues:
* wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary...
Cheers Maurizio
On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
De: "mandy chung" <mandy.chung@oracle.com> À: "Remi Forax" <forax@univ-mlv.fr>, "Maurizio Cimadamore" <maurizio.cimadamore@oracle.com> Cc: "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Mardi 28 Avril 2020 01:17:05 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
On 4/27/20 2:42 PM, Remi Forax wrote:
Hi Maurizio, Mandy,
In [ https://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev/src/java.base/sha... | https://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev/src/java.base/sha... ] , using a condy inside a static init make me sad, using a late loading constant to early load it in the static init seems counter natural.
This is an interim solution due to JDK-8243492 which causes VM crash if it lazily loads the constant. interesting, it's because the BSM is an instance method and not a static method
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: [ http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/ | http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/ ]
if the accessors are declared ACC_STATIC, yes !
Mandy Rémi
ASM will generates only one condy constant pool constant for all ldcs (so once one of the ldc is executed the classdata will be available to the other ldcs) and the JIT will fold the access to the array given the array is a constant.
regards, Rémi
----- Mail original -----
De: "Maurizio Cimadamore" [ mailto:maurizio.cimadamore@oracle.com | <maurizio.cimadamore@oracle.com> ] À: "core-libs-dev" [ mailto:core-libs-dev@openjdk.java.net | <core-libs-dev@openjdk.java.net> ] Envoyé: Lundi 27 Avril 2020 14:13:55 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
Another iteration, which addresses the following issues:
* wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy)
Webrev: [ http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev | http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev ] Delta webrev: [ http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/ | http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/ ] Javadoc: [ http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc | http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc ] Specdiff: [ http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary... | http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary... ] Cheers Maurizio
On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev: [ http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev | http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev ] Javadoc: [ http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc | http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc ] Specdiff: [ http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary... | http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary... ] CSR: [ https://bugs.openjdk.java.net/browse/JDK-8243496 | https://bugs.openjdk.java.net/browse/JDK-8243496 ] API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - [ https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... | https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... ] [2] - [ https://bugs.openjdk.java.net/browse/JDK-8223051 | https://bugs.openjdk.java.net/browse/JDK-8223051 ] [3] - [ https://openjdk.java.net/jeps/383 | https://openjdk.java.net/jeps/383 ] [4] - [ https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump... | https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump... ]
On 4/28/20 12:58 AM, forax@univ-mlv.fr wrote:
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/
if the accessors are declared ACC_STATIC, yes !
Thanks for catching this and this way will not be hit JDK-824349. Here is the revised patch: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/ Maurizio - do you mind remerge MemoryAccessVarHandleGenerator.java with webrev.02? thanks Mandy
On 28/04/2020 19:09, Mandy Chung wrote:
On 4/28/20 12:58 AM, forax@univ-mlv.fr wrote:
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/
if the accessors are declared ACC_STATIC, yes !
Thanks for catching this and this way will not be hit JDK-824349.
Here is the revised patch: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/
Maurizio - do you mind remerge MemoryAccessVarHandleGenerator.java with webrev.02?
I'll take care of that Maurizio
thanks Mandy
On 28/04/2020 21:44, Maurizio Cimadamore wrote:
On 28/04/2020 19:09, Mandy Chung wrote:
On 4/28/20 12:58 AM, forax@univ-mlv.fr wrote:
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/
if the accessors are declared ACC_STATIC, yes !
Thanks for catching this and this way will not be hit JDK-824349.
Here is the revised patch: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/
Maurizio - do you mind remerge MemoryAccessVarHandleGenerator.java with webrev.02?
I'll take care of that
Not going down this road, sorry :-) I've added the changes (see attached patch), and all benchmarks are several order of magnitude slower. I think this is mostly caused by the fact that the addOffset/multiplyOffset handle are no longer cached in static constants. While I understand that there might be better ways to generate the code, I'd strongly prefer to leave the code as per last iteration. I can't honestly see in which way having 3-4 static fields in a synthetic VarHandle class is going to hurt (but I can see how much it hurts by not having said fields). Cheers Maurizio
Maurizio
thanks Mandy
Hi, I think the problem with perf might be caused by the fact that while the array is now a constant, the elements are not (the array is mutable after all). For fields you can fix this by using @Stable, but not for CP entries :) I think what could work is; rather than ldc'ing the array, and then looking up the values with 'normal' Java code, we could have another dynamic constant that does the the array lookup as well. Then the resolved value is stored in a separate CP slot as a true constant. We probably want to have a bootstrap method in ConstantBootstraps that can do an arbitrary array lookup given an array and an index for that. Given the amount of work, I'd say definitely something that should be saved for another time, also since there doesn't appear to be a major payoff for doing that at the moment. Jorn On 29/04/2020 03:13, Maurizio Cimadamore wrote:
On 28/04/2020 21:44, Maurizio Cimadamore wrote:
On 28/04/2020 19:09, Mandy Chung wrote:
On 4/28/20 12:58 AM, forax@univ-mlv.fr wrote:
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/
if the accessors are declared ACC_STATIC, yes !
Thanks for catching this and this way will not be hit JDK-824349.
Here is the revised patch: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/
Maurizio - do you mind remerge MemoryAccessVarHandleGenerator.java with webrev.02?
I'll take care of that
Not going down this road, sorry :-)
I've added the changes (see attached patch), and all benchmarks are several order of magnitude slower. I think this is mostly caused by the fact that the addOffset/multiplyOffset handle are no longer cached in static constants.
While I understand that there might be better ways to generate the code, I'd strongly prefer to leave the code as per last iteration. I can't honestly see in which way having 3-4 static fields in a synthetic VarHandle class is going to hurt (but I can see how much it hurts by not having said fields).
Cheers Maurizio
Maurizio
thanks Mandy
----- Mail original -----
De: "Jorn Vernee" <jorn.vernee@oracle.com> À: "Maurizio Cimadamore" <maurizio.cimadamore@oracle.com>, "mandy chung" <mandy.chung@oracle.com>, "Remi Forax" <forax@univ-mlv.fr> Cc: "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Mercredi 29 Avril 2020 22:09:47 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
Hi,
Hi Jorn,
I think the problem with perf might be caused by the fact that while the array is now a constant, the elements are not (the array is mutable after all). For fields you can fix this by using @Stable, but not for CP entries :)
I think you're right,
I think what could work is; rather than ldc'ing the array, and then looking up the values with 'normal' Java code, we could have another dynamic constant that does the the array lookup as well. Then the resolved value is stored in a separate CP slot as a true constant. We probably want to have a bootstrap method in ConstantBootstraps that can do an arbitrary array lookup given an array and an index for that.
Given the amount of work, I'd say definitely something that should be saved for another time, also since there doesn't appear to be a major payoff for doing that at the moment.
I think it's either to have a small class instead of an array with all the fields marked as @Stable, this will also avoid the cast because the fields will have the right type.
Jorn
Rémi
On 29/04/2020 03:13, Maurizio Cimadamore wrote:
On 28/04/2020 21:44, Maurizio Cimadamore wrote:
On 28/04/2020 19:09, Mandy Chung wrote:
On 4/28/20 12:58 AM, forax@univ-mlv.fr wrote:
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/
if the accessors are declared ACC_STATIC, yes !
Thanks for catching this and this way will not be hit JDK-824349.
Here is the revised patch: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/
Maurizio - do you mind remerge MemoryAccessVarHandleGenerator.java with webrev.02?
I'll take care of that
Not going down this road, sorry :-)
I've added the changes (see attached patch), and all benchmarks are several order of magnitude slower. I think this is mostly caused by the fact that the addOffset/multiplyOffset handle are no longer cached in static constants.
While I understand that there might be better ways to generate the code, I'd strongly prefer to leave the code as per last iteration. I can't honestly see in which way having 3-4 static fields in a synthetic VarHandle class is going to hurt (but I can see how much it hurts by not having said fields).
Cheers Maurizio
Maurizio
thanks Mandy
On 4/29/20 2:30 PM, forax@univ-mlv.fr wrote:
I think the problem with perf might be caused by the fact that while the array is now a constant, the elements are not (the array is mutable after all). For fields you can fix this by using @Stable, but not for CP entries :) I think you're right,
Ah, I missed that!
I think what could work is; rather than ldc'ing the array, and then looking up the values with 'normal' Java code, we could have another dynamic constant that does the the array lookup as well. Then the resolved value is stored in a separate CP slot as a true constant. We probably want to have a bootstrap method in ConstantBootstraps that can do an arbitrary array lookup given an array and an index for that.
FYI. I'm exploring is `classDataAt` to get a specific element/entry from a class data of immutable List or Map. [1] http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.00/src/java....
Not going down this road, sorry :-)
I've added the changes (see attached patch), and all benchmarks are several order of magnitude slower. I think this is mostly caused by the fact that the addOffset/multiplyOffset handle are no longer cached in static constants.
While I understand that there might be better ways to generate the code, I'd strongly prefer to leave the code as per last iteration. I can't honestly see in which way having 3-4 static fields in a synthetic VarHandle class is going to hurt (but I can see how much it hurts by not having said fields).
I agree to keep the code per last iteration. We can always improve this in the future with performance measurement. Mandy
Hi Mandy, i've taken a look to the code, i think it's better to have two methods, one for List and one for Map to avoid to have a bootstrap argument ( classDataType ) and to have a code more straightforward. Rémi
De: "mandy chung" <mandy.chung@oracle.com> À: "Remi Forax" <forax@univ-mlv.fr>, "Jorn Vernee" <jorn.vernee@oracle.com> Cc: "Maurizio Cimadamore" <maurizio.cimadamore@oracle.com>, "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Jeudi 30 Avril 2020 01:05:46 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
On 4/29/20 2:30 PM, [ mailto:forax@univ-mlv.fr | forax@univ-mlv.fr ] wrote:
I think the problem with perf might be caused by the fact that while the array is now a constant, the elements are not (the array is mutable after all). For fields you can fix this by using @Stable, but not for CP entries :)
I think you're right,
Ah, I missed that!
I think what could work is; rather than ldc'ing the array, and then looking up the values with 'normal' Java code, we could have another dynamic constant that does the the array lookup as well. Then the resolved value is stored in a separate CP slot as a true constant. We probably want to have a bootstrap method in ConstantBootstraps that can do an arbitrary array lookup given an array and an index for that.
FYI. I'm exploring is `classDataAt` to get a specific element/entry from a class data of immutable List or Map.
[1] [ http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.00/src/java.... | http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.00/src/java.... ]
Not going down this road, sorry :-)
I've added the changes (see attached patch), and all benchmarks are several order of magnitude slower. I think this is mostly caused by the fact that the addOffset/multiplyOffset handle are no longer cached in static constants.
While I understand that there might be better ways to generate the code, I'd strongly prefer to leave the code as per last iteration. I can't honestly see in which way having 3-4 static fields in a synthetic VarHandle class is going to hurt (but I can see how much it hurts by not having said fields).
I agree to keep the code per last iteration. We can always improve this in the future with performance measurement.
Mandy
Hi Remi, Thanks for the feedback. We can take this off from this review thread and roll it into JDK-8230501. Mandy On 4/30/20 11:03 AM, forax@univ-mlv.fr wrote:
Hi Mandy, i've taken a look to the code, i think it's better to have two methods, one for List and one for Map to avoid to have a bootstrap argument (classDataType) and to have a code more straightforward.
Rémi
------------------------------------------------------------------------
*De: *"mandy chung" <mandy.chung@oracle.com> *À: *"Remi Forax" <forax@univ-mlv.fr>, "Jorn Vernee" <jorn.vernee@oracle.com> *Cc: *"Maurizio Cimadamore" <maurizio.cimadamore@oracle.com>, "core-libs-dev" <core-libs-dev@openjdk.java.net> *Envoyé: *Jeudi 30 Avril 2020 01:05:46 *Objet: *Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
On 4/29/20 2:30 PM, forax@univ-mlv.fr wrote:
I think the problem with perf might be caused by the fact that while the array is now a constant, the elements are not (the array is mutable after all). For fields you can fix this by using @Stable, but not for CP entries :)
I think you're right,
Ah, I missed that!
I think what could work is; rather than ldc'ing the array, and then looking up the values with 'normal' Java code, we could have another dynamic constant that does the the array lookup as well. Then the resolved value is stored in a separate CP slot as a true constant. We probably want to have a bootstrap method in ConstantBootstraps that can do an arbitrary array lookup given an array and an index for that.
FYI. I'm exploring is `classDataAt` to get a specific element/entry from a class data of immutable List or Map.
[1] http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.00/src/java....
Not going down this road, sorry :-)
I've added the changes (see attached patch), and all benchmarks are several order of magnitude slower. I think this is mostly caused by the fact that the addOffset/multiplyOffset handle are no longer cached in static constants.
While I understand that there might be better ways to generate the code, I'd strongly prefer to leave the code as per last iteration. I can't honestly see in which way having 3-4 static fields in a synthetic VarHandle class is going to hurt (but I can see how much it hurts by not having said fields).
I agree to keep the code per last iteration. We can always improve this in the future with performance measurement.
Mandy
----- Mail original -----
De: "Maurizio Cimadamore" <maurizio.cimadamore@oracle.com> À: "mandy chung" <mandy.chung@oracle.com>, "Remi Forax" <forax@univ-mlv.fr> Cc: "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Mercredi 29 Avril 2020 03:13:35 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
On 28/04/2020 21:44, Maurizio Cimadamore wrote:
On 28/04/2020 19:09, Mandy Chung wrote:
On 4/28/20 12:58 AM, forax@univ-mlv.fr wrote:
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/
if the accessors are declared ACC_STATIC, yes !
Thanks for catching this and this way will not be hit JDK-824349.
Here is the revised patch: http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/
Maurizio - do you mind remerge MemoryAccessVarHandleGenerator.java with webrev.02?
I'll take care of that
Not going down this road, sorry :-)
I've added the changes (see attached patch), and all benchmarks are several order of magnitude slower. I think this is mostly caused by the fact that the addOffset/multiplyOffset handle are no longer cached in static constants.
While I understand that there might be better ways to generate the code, I'd strongly prefer to leave the code as per last iteration. I can't honestly see in which way having 3-4 static fields in a synthetic VarHandle class is going to hurt (but I can see how much it hurts by not having said fields).
I somehow miss that email, ok, let's back to the black board :) Rémi
Cheers Maurizio
Maurizio
thanks Mandy
De: "mandy chung" <mandy.chung@oracle.com> À: "Remi Forax" <forax@univ-mlv.fr> Cc: "Maurizio Cimadamore" <maurizio.cimadamore@oracle.com>, "core-libs-dev" <core-libs-dev@openjdk.java.net> Envoyé: Mardi 28 Avril 2020 20:09:07 Objet: Re: RFR 8243491: Implementation of Foreign-Memory Access API (Second Incubator)
On 4/28/20 12:58 AM, [ mailto:forax@univ-mlv.fr | forax@univ-mlv.fr ] wrote:
I don't think you need to store all the values into static fields, you can directly do a ldc + aaload with the right index right where you need it,
I think this is what you are thinking as reported in JDK-8243492: [ http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/ | http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.01-accessor/ ]
if the accessors are declared ACC_STATIC, yes !
Thanks for catching this and this way will not be hit JDK-824349.
Here is the revised patch: [ http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/ | http://cr.openjdk.java.net/~mchung/jdk15/webrevs/8239578/webrev.02/ ]
Looks good to me !
Maurizio - do you mind remerge MemoryAccessVarHandleGenerator.java with webrev.02?
thanks Mandy cheers, Rémi
Latest iteration - the follow issues were addressed: * fix a bug with alignment of native segments triggering spurious failures (contributed by Jorn) * fix javadoc and tests for access modes (contributed by Chris) * added benchmarks for Stream::findAny using segment spliterator (contributed by Peter) * addressed CSR comments Webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev Delta webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev_delta/ Javadoc: http://cr.openjdk.java.net/~mcimadamore/8243491_v3/javadoc Specdiff: http://cr.openjdk.java.net/~mcimadamore/8243491_v3/specdiff/overview-summary... Cheers Maurizio On 27/04/2020 13:13, Maurizio Cimadamore wrote:
Another iteration, which addresses the following issues:
* wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary...
Cheers Maurizio
On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
Another iteration which addresses the latest CSR comments (the CSR has now been approved): * make MemorySegment::withAccessModes/hasAccessMode throw IllegalArgumentException in cases where the provided mask is invalid (this required a test tweak) * sprinkled a couple of references to the JLS in the package javadoc, as per CSR suggestions * Fixed the ParallelSum::findAny_bulk benchmarks, which were (erroneously) not testing all the elements in the segment Webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev Delta webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev_delta/ Cheers Maurizio On 01/05/2020 12:06, Maurizio Cimadamore wrote:
Latest iteration - the follow issues were addressed:
* fix a bug with alignment of native segments triggering spurious failures (contributed by Jorn) * fix javadoc and tests for access modes (contributed by Chris) * added benchmarks for Stream::findAny using segment spliterator (contributed by Peter) * addressed CSR comments
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/specdiff/overview-summary...
Cheers Maurizio
On 27/04/2020 13:13, Maurizio Cimadamore wrote:
Another iteration, which addresses the following issues:
* wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary...
Cheers Maurizio
On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 13 May 2020, at 13:12, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Another iteration which addresses the latest CSR comments (the CSR has now been approved):
* make MemorySegment::withAccessModes/hasAccessMode throw IllegalArgumentException in cases where the provided mask is invalid (this required a test tweak) * sprinkled a couple of references to the JLS in the package javadoc, as per CSR suggestions * Fixed the ParallelSum::findAny_bulk benchmarks, which were (erroneously) not testing all the elements in the segment
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev_delta/
This latest version looks good to me. Reviewed. -Chris.
Another very small iteration which fixes a gap in the javadoc specification for MemorySegment::toArray (thanks Chris!) Webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev Delta webrev: http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev_delta/ Cheers Maurizio On 13/05/2020 13:12, Maurizio Cimadamore wrote:
Another iteration which addresses the latest CSR comments (the CSR has now been approved):
* make MemorySegment::withAccessModes/hasAccessMode throw IllegalArgumentException in cases where the provided mask is invalid (this required a test tweak) * sprinkled a couple of references to the JLS in the package javadoc, as per CSR suggestions * Fixed the ParallelSum::findAny_bulk benchmarks, which were (erroneously) not testing all the elements in the segment
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev_delta/
Cheers Maurizio
On 01/05/2020 12:06, Maurizio Cimadamore wrote:
Latest iteration - the follow issues were addressed:
* fix a bug with alignment of native segments triggering spurious failures (contributed by Jorn) * fix javadoc and tests for access modes (contributed by Chris) * added benchmarks for Stream::findAny using segment spliterator (contributed by Peter) * addressed CSR comments
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/specdiff/overview-summary...
Cheers Maurizio
On 27/04/2020 13:13, Maurizio Cimadamore wrote:
Another iteration, which addresses the following issues:
* wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary...
Cheers Maurizio
On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
+1 Paul.
On May 20, 2020, at 7:01 AM, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Another very small iteration which fixes a gap in the javadoc specification for MemorySegment::toArray (thanks Chris!)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev_delta/
Cheers Maurizio
On 20 May 2020, at 15:01, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Another very small iteration which fixes a gap in the javadoc specification for MemorySegment::toArray (thanks Chris!)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev_delta/
Still LGTM. -Chris.
I've just integrated this - thanks to all the contributors and reviewers for the help! Cheers Maurizio On 20/05/2020 15:01, Maurizio Cimadamore wrote:
Another very small iteration which fixes a gap in the javadoc specification for MemorySegment::toArray (thanks Chris!)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v5/webrev_delta/
Cheers Maurizio
On 13/05/2020 13:12, Maurizio Cimadamore wrote:
Another iteration which addresses the latest CSR comments (the CSR has now been approved):
* make MemorySegment::withAccessModes/hasAccessMode throw IllegalArgumentException in cases where the provided mask is invalid (this required a test tweak) * sprinkled a couple of references to the JLS in the package javadoc, as per CSR suggestions * Fixed the ParallelSum::findAny_bulk benchmarks, which were (erroneously) not testing all the elements in the segment
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v4/webrev_delta/
Cheers Maurizio
On 01/05/2020 12:06, Maurizio Cimadamore wrote:
Latest iteration - the follow issues were addressed:
* fix a bug with alignment of native segments triggering spurious failures (contributed by Jorn) * fix javadoc and tests for access modes (contributed by Chris) * added benchmarks for Stream::findAny using segment spliterator (contributed by Peter) * addressed CSR comments
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v3/specdiff/overview-summary...
Cheers Maurizio
On 27/04/2020 13:13, Maurizio Cimadamore wrote:
Another iteration, which addresses the following issues:
* wrong copyright headers in certain tests * issue with TestNative when running in debug mode caused by mismatched malloc/os::free (contributed by Jorn) * clarify javadoc of MemoryHandles::withStride * improved implementation of MemoryAccessVarHandleGenerator to use hidden classes rather than Unsafe.dAC (contributed by Mandy)
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev
Delta webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/webrev_delta/
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v2/specdiff/overview-summary...
Cheers Maurizio
On 23/04/2020 21:33, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
Hi Maurizio, I'm checking out the thread-confinement in the parallel stream case. I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's: 102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset, long newSize) { 103 return dup(offset, newSize, mask, owner, scope); 104 } ...so here the "owner" of the slice is still the same as that of parent segment... But then later in tryAdvance or forEachRemaining, the segment is acquired/closed for each element of the stream (in case of tryAdvance) or for the whole chunk to the end of spliterator (in case of forEachRemaining). So some pipelines will be more optimal than others... So I'm thinking. Would it be possible to "lazily" acquire scope just once in tryAdvance and then re-use the scope until the end? Unfortunately Spliterator does not have a close() method to be called when the pipeline is done with it. Perhaps it could be added to the API? This is not the 1st time I wished Spliterator had a close method. I had a similar problem when trying to create a Spliterator with a database backend. When using JDBC API a separate transaction (Connection) is typically required for each thread of execution since several frameworks bind it to the ThreadLocal. WDYT? Regards, Peter On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
Hi, The problem with current implementation of MemoryScope is that if a child scope is frequently acquired and closed (which increments and then decrements the parent scope counter atomically using CAS), and that is performed from multiple concurrent threads, contention might become prohibitive. And I think that is precisely what happens when a parallel pipeline is such that it might short-circuit the stream: final boolean forEachWithCancel(Spliterator<P_OUT> spliterator, Sink<P_OUT> sink) { boolean cancelled; do { } while (!(cancelled = sink.cancellationRequested()) && spliterator.tryAdvance(sink)); return cancelled; } 1st spliterators are created by trySplit (all of them inherit the same MemoryScope) and then FJPool threads are busy concurrently executing above method which calls tryAdvance for each element of the particular spliterator which does the following: public boolean tryAdvance(Consumer<? super MemorySegment> action) { Objects.requireNonNull(action); if (currentIndex < elemCount) { AbstractMemorySegmentImpl acquired = segment.acquire(); try { action.accept(acquired.asSliceNoCheck(currentIndex * elementSize, elementSize)); } finally { acquired.closeNoCheck(); currentIndex++; if (currentIndex == elemCount) { segment = null; } } return true; } else { return false; } } ... acquire/close at each call. If the Stream is played to the end (i.e. it can't short-circuit), then forEachRemaining is used which performs just one acquire/close for the whole remaining spliterator. So for short-circuiting streams it might be important to have a MemoryScope that is scalable. Here's one such attempt using a pair of scalable counters (just one pair per root memory scope): import java.util.concurrent.atomic.LongAdder; /** * @author Peter Levart */ public abstract class MemoryScope { public static MemoryScope create(Object ref, Runnable cleanupAction) { return new Root(ref, cleanupAction); } MemoryScope() {} public abstract MemoryScope acquire(); public abstract void close(); private static class Root extends MemoryScope { private final LongAdder enters = new LongAdder(); private final LongAdder exits = new LongAdder(); private volatile boolean closed; private final Object ref; private final Runnable cleanupAction; Root(Object ref, Runnable cleanupAction) { this.ref = ref; this.cleanupAction = cleanupAction; } @Override public MemoryScope acquire() { // increment enters 1st enters.increment(); // check closed flag 2nd if (closed) { exits.increment(); throw new IllegalStateException("This scope is already closed"); } return new MemoryScope() { @Override public MemoryScope acquire() { return Root.this.acquire(); } @Override public void close() { exits.increment(); } }; } private final Object lock = new Object(); @Override public void close() { synchronized (lock) { // modify closed flag 1st closed = true; // check for no more active acquired children 2nd // IMPORTANT: 1st sum exits, then sum enters !!! if (exits.sum() != enters.sum()) { throw new IllegalStateException("Cannot close this scope as it has active acquired children"); } } if (cleanupAction != null) { cleanupAction.run(); } } } } This MemoryScope is just 2-level. The root is the one that is to be created when the memory segment is allocated. A child is always a child of the root and has no own children. So a call to child.acquire() gets forwarded to the Root. The Root.acquire() 1st increments 'enters' scalable counter then checks the 'closed' flag. The child.close() just increments the 'exits' scalable counter. The Root.close() 1st modifies the 'closed' flag then checks to see that the sum of 'exits' equals the sum of 'enters' - the important thing here is that 'exits' are summed 1st and then 'enters'. These orderings guarantee that either a child scope is successfully acquired or the root scope is successfully closed but never both. WDYT? Regards, Peter On 4/28/20 6:12 PM, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case. I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset, long newSize) { 103 return dup(offset, newSize, mask, owner, scope); 104 }
...so here the "owner" of the slice is still the same as that of parent segment...
But then later in tryAdvance or forEachRemaining, the segment is acquired/closed for each element of the stream (in case of tryAdvance) or for the whole chunk to the end of spliterator (in case of forEachRemaining). So some pipelines will be more optimal than others...
So I'm thinking. Would it be possible to "lazily" acquire scope just once in tryAdvance and then re-use the scope until the end? Unfortunately Spliterator does not have a close() method to be called when the pipeline is done with it. Perhaps it could be added to the API? This is not the 1st time I wished Spliterator had a close method. I had a similar problem when trying to create a Spliterator with a database backend. When using JDBC API a separate transaction (Connection) is typically required for each thread of execution since several frameworks bind it to the ThreadLocal.
WDYT?
Regards, Peter
On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 28/04/2020 21:27, Peter Levart wrote:
Hi,
The problem with current implementation of MemoryScope is that if a child scope is frequently acquired and closed (which increments and then decrements the parent scope counter atomically using CAS), and that is performed from multiple concurrent threads, contention might become prohibitive. And I think that is precisely what happens when a parallel pipeline is such that it might short-circuit the stream:
final boolean forEachWithCancel(Spliterator<P_OUT> spliterator, Sink<P_OUT> sink) { boolean cancelled; do { } while (!(cancelled = sink.cancellationRequested()) && spliterator.tryAdvance(sink)); return cancelled; }
1st spliterators are created by trySplit (all of them inherit the same MemoryScope) and then FJPool threads are busy concurrently executing above method which calls tryAdvance for each element of the particular spliterator which does the following:
public boolean tryAdvance(Consumer<? super MemorySegment> action) { Objects.requireNonNull(action); if (currentIndex < elemCount) { AbstractMemorySegmentImpl acquired = segment.acquire(); try { action.accept(acquired.asSliceNoCheck(currentIndex * elementSize, elementSize)); } finally { acquired.closeNoCheck(); currentIndex++; if (currentIndex == elemCount) { segment = null; } } return true; } else { return false; } }
... acquire/close at each call. If the Stream is played to the end (i.e. it can't short-circuit), then forEachRemaining is used which performs just one acquire/close for the whole remaining spliterator. So for short-circuiting streams it might be important to have a MemoryScope that is scalable. Here's one such attempt using a pair of scalable counters (just one pair per root memory scope):
The current implementation has performances that are on par with the previous acquire-based implementation, and also on par with what can be achieved with Unsafe. We do have a micro benchmark in the patch (see ParallelSum (**)) which tests this, and I get _identical_ numbers even if I _comment_ the body of acquire/release - so that no contention can happen; so, I'm a bit skeptical overall that contention on acquire/release is the main factor at play here - but perhaps we need more targeted benchmarks. (**) - your email caused me to look deeper at the ParallelSum benchmark which, as currently written seems to favor Unsafe over the MemorySegment API - but in reality, as I discovered, that is down to an issue in the implementation of the unsafe spliterator, which doesn't sum all the elements; I will fix the benchmark in an upcoming iteration So, while I'm open to suggestion as to how to reduce contention on the acquire counter, I think we need more evidence that this is indeed an issue (or the _main_ issue, when it comes to parallel computation). That said, your implementation looks interesting - some questions inline and also below:
import java.util.concurrent.atomic.LongAdder;
/** * @author Peter Levart */ public abstract class MemoryScope {
public static MemoryScope create(Object ref, Runnable cleanupAction) { return new Root(ref, cleanupAction); }
MemoryScope() {}
public abstract MemoryScope acquire();
public abstract void close();
private static class Root extends MemoryScope { private final LongAdder enters = new LongAdder(); private final LongAdder exits = new LongAdder(); private volatile boolean closed;
private final Object ref; private final Runnable cleanupAction;
Root(Object ref, Runnable cleanupAction) { this.ref = ref; this.cleanupAction = cleanupAction; }
@Override public MemoryScope acquire() { // increment enters 1st enters.increment(); // check closed flag 2nd if (closed) { exits.increment(); throw new IllegalStateException("This scope is already closed"); }
return new MemoryScope() { @Override public MemoryScope acquire() { return Root.this.acquire(); }
@Override public void close() { exits.increment();
Here -- don't you mean Root.this.exits? Otherwise Root.exists is gonna remain != from Root.enters?
} }; }
private final Object lock = new Object();
@Override public void close() { synchronized (lock) {
Why the lock? If we are here we're already in the owner thread - e.g. it's not like multiple threads can call this at the same time. Or are you trying to make the code more robust in the case a segment is created w/o a confinement thread (e.g. via the unsafe API) ?
// modify closed flag 1st closed = true; // check for no more active acquired children 2nd // IMPORTANT: 1st sum exits, then sum enters !!! if (exits.sum() != enters.sum()) { throw new IllegalStateException("Cannot close this scope as it has active acquired children"); } } if (cleanupAction != null) { cleanupAction.run(); } } } }
This MemoryScope is just 2-level. The root is the one that is to be created when the memory segment is allocated. A child is always a child of the root and has no own children. So a call to child.acquire() gets forwarded to the Root. The Root.acquire() 1st increments 'enters' scalable counter then checks the 'closed' flag. The child.close() just increments the 'exits' scalable counter. The Root.close() 1st modifies the 'closed' flag then checks to see that the sum of 'exits' equals the sum of 'enters' - the important thing here is that 'exits' are summed 1st and then 'enters'. These orderings guarantee that either a child scope is successfully acquired or the root scope is successfully closed but never both.
I guess what you mean here is that, by construction, exits <= enters. So, we first read exists, then we read enters - and there can be only two cases: * exits < enters, in which case it means some other thread has acquired but not closed (possibly even *after* the call to exits.sum()) * exits == enters, in which case there's no pending acquire and we're golden While this all seems very clever there are some things that don't 100% convince me - for instance, I note that `closed` will stay set even if we later throw an ISE during close(). I suppose we *could* reset closed = false in the throwing code path, but then there's a possibility of having generated spurious ISE in MemoryScope::acquire in the time span where `closed = true`. In other words, one of the big realization of the current synchronization mechanism behind acquire() is that if we fold the "closed" state with the "count" state, we then have to worry only about one access, which makes it easier to reason about the implementation. Here it seems that races between updates to exits/enters/closed would be possible, and I'm not sure we can fully protect against those w/o adding more locks? Maurizio
WDYT?
Regards, Peter
On 4/28/20 6:12 PM, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case. I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset, long newSize) { 103 return dup(offset, newSize, mask, owner, scope); 104 }
...so here the "owner" of the slice is still the same as that of parent segment...
But then later in tryAdvance or forEachRemaining, the segment is acquired/closed for each element of the stream (in case of tryAdvance) or for the whole chunk to the end of spliterator (in case of forEachRemaining). So some pipelines will be more optimal than others...
So I'm thinking. Would it be possible to "lazily" acquire scope just once in tryAdvance and then re-use the scope until the end? Unfortunately Spliterator does not have a close() method to be called when the pipeline is done with it. Perhaps it could be added to the API? This is not the 1st time I wished Spliterator had a close method. I had a similar problem when trying to create a Spliterator with a database backend. When using JDBC API a separate transaction (Connection) is typically required for each thread of execution since several frameworks bind it to the ThreadLocal.
WDYT?
Regards, Peter
On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
On 28/04/2020 21:27, Peter Levart wrote:
Hi,
The problem with current implementation of MemoryScope is that if a child scope is frequently acquired and closed (which increments and then decrements the parent scope counter atomically using CAS), and that is performed from multiple concurrent threads, contention might become prohibitive. And I think that is precisely what happens when a parallel pipeline is such that it might short-circuit the stream:
final boolean forEachWithCancel(Spliterator<P_OUT> spliterator, Sink<P_OUT> sink) { boolean cancelled; do { } while (!(cancelled = sink.cancellationRequested()) && spliterator.tryAdvance(sink)); return cancelled; }
1st spliterators are created by trySplit (all of them inherit the same MemoryScope) and then FJPool threads are busy concurrently executing above method which calls tryAdvance for each element of the particular spliterator which does the following:
public boolean tryAdvance(Consumer<? super MemorySegment> action) { Objects.requireNonNull(action); if (currentIndex < elemCount) { AbstractMemorySegmentImpl acquired = segment.acquire(); try { action.accept(acquired.asSliceNoCheck(currentIndex * elementSize, elementSize)); } finally { acquired.closeNoCheck(); currentIndex++; if (currentIndex == elemCount) { segment = null; } } return true; } else { return false; } }
... acquire/close at each call. If the Stream is played to the end (i.e. it can't short-circuit), then forEachRemaining is used which performs just one acquire/close for the whole remaining spliterator. So for short-circuiting streams it might be important to have a MemoryScope that is scalable. Here's one such attempt using a pair of scalable counters (just one pair per root memory scope):
The current implementation has performances that are on par with the previous acquire-based implementation, and also on par with what can be achieved with Unsafe. We do have a micro benchmark in the patch (see ParallelSum (**)) which tests this, and I get _identical_ numbers even if I _comment_ the body of acquire/release - so that no contention can happen; so, I'm a bit skeptical overall that contention on acquire/release is the main factor at play here - but perhaps we need more targeted benchmarks.
Right, summing is typically not a short-circuiting operation, so I bet forEachRemaining is used in the leaf spliterators. FindAny or findFirst might be the ones to test. I'll prepare a test and see what experimenting with alternative MemoryScope can do...
(**) - your email caused me to look deeper at the ParallelSum benchmark which, as currently written seems to favor Unsafe over the MemorySegment API - but in reality, as I discovered, that is down to an issue in the implementation of the unsafe spliterator, which doesn't sum all the elements; I will fix the benchmark in an upcoming iteration
So, while I'm open to suggestion as to how to reduce contention on the acquire counter, I think we need more evidence that this is indeed an issue (or the _main_ issue, when it comes to parallel computation). That said, your implementation looks interesting - some questions inline and also below:
answers inline...
import java.util.concurrent.atomic.LongAdder;
/** * @author Peter Levart */ public abstract class MemoryScope {
public static MemoryScope create(Object ref, Runnable cleanupAction) { return new Root(ref, cleanupAction); }
MemoryScope() {}
public abstract MemoryScope acquire();
public abstract void close();
private static class Root extends MemoryScope { private final LongAdder enters = new LongAdder(); private final LongAdder exits = new LongAdder(); private volatile boolean closed;
private final Object ref; private final Runnable cleanupAction;
Root(Object ref, Runnable cleanupAction) { this.ref = ref; this.cleanupAction = cleanupAction; }
@Override public MemoryScope acquire() { // increment enters 1st enters.increment(); // check closed flag 2nd if (closed) { exits.increment(); throw new IllegalStateException("This scope is already closed"); }
return new MemoryScope() { @Override public MemoryScope acquire() { return Root.this.acquire(); }
@Override public void close() { exits.increment();
Here -- don't you mean Root.this.exits? Otherwise Root.exists is gonna remain != from Root.enters?
'exits' field is declared in Root only, so 'exits' in anonymous inner class actually refers to Root.this.exits...
} }; }
private final Object lock = new Object();
@Override public void close() { synchronized (lock) {
Why the lock? If we are here we're already in the owner thread - e.g. it's not like multiple threads can call this at the same time. Or are you trying to make the code more robust in the case a segment is created w/o a confinement thread (e.g. via the unsafe API) ?
That lock is unnecessary. A leftover.
// modify closed flag 1st closed = true; // check for no more active acquired children 2nd // IMPORTANT: 1st sum exits, then sum enters !!! if (exits.sum() != enters.sum()) { throw new IllegalStateException("Cannot close this scope as it has active acquired children"); } } if (cleanupAction != null) { cleanupAction.run(); } } } }
This MemoryScope is just 2-level. The root is the one that is to be created when the memory segment is allocated. A child is always a child of the root and has no own children. So a call to child.acquire() gets forwarded to the Root. The Root.acquire() 1st increments 'enters' scalable counter then checks the 'closed' flag. The child.close() just increments the 'exits' scalable counter. The Root.close() 1st modifies the 'closed' flag then checks to see that the sum of 'exits' equals the sum of 'enters' - the important thing here is that 'exits' are summed 1st and then 'enters'. These orderings guarantee that either a child scope is successfully acquired or the root scope is successfully closed but never both.
I guess what you mean here is that, by construction, exits <= enters.
So, we first read exists, then we read enters - and there can be only two cases:
* exits < enters, in which case it means some other thread has acquired but not closed (possibly even *after* the call to exits.sum()) * exits == enters, in which case there's no pending acquire and we're golden
While this all seems very clever there are some things that don't 100% convince me - for instance, I note that `closed` will stay set even if we later throw an ISE during close(). I suppose we *could* reset closed = false in the throwing code path, but then there's a possibility of having generated spurious ISE in MemoryScope::acquire in the time span where `closed = true`.
Right, as you saw in a private Email, I did exactly that in a revised version (posted below). The spurious ISE may happen but only when close is called prematurely relative to all child scope close(s) that were or are still active. So we could say the other way: if close was not called prematurely, the ISE on acquire would not be spurious - so ordering of close relative to later acquire was wrong anyway and the exception is an "indication" of that wrong ordering Unless we want to use close() to "probe" the scope whether it is still active or not, I don't think this should present a problem.
In other words, one of the big realization of the current synchronization mechanism behind acquire() is that if we fold the "closed" state with the "count" state, we then have to worry only about one access, which makes it easier to reason about the implementation. Here it seems that races between updates to exits/enters/closed would be possible, and I'm not sure we can fully protect against those w/o adding more locks?
The only problem I see with (the revised) version is this spurious ISE that might be thrown on acquire executed concurrently with premature close. I see no other races. For any interesting readers, I'm posting this revised version here so it doesn't look like I'm talking about some secret: import java.lang.invoke.MethodHandles; import java.lang.invoke.VarHandle; import java.util.concurrent.atomic.LongAdder; /** * @author Peter Levart */ public abstract class MemoryScope { public static MemoryScope create(Object ref, Runnable cleanupAction) { return new Root(ref, cleanupAction); } boolean closed; private static final VarHandle CLOSED; static { try { CLOSED = MethodHandles.lookup().findVarHandle(Root.class, "closed", boolean.class); } catch (Throwable ex) { throw new ExceptionInInitializerError(ex); } } MemoryScope() {} public abstract MemoryScope acquire(); public abstract void close(); public final boolean isAliveThreadSafe() { return !((boolean) CLOSED.getVolatile(this)); } public final boolean isAliveConfined() { return !closed; } private static final class Root extends MemoryScope { private final LongAdder acquires = new LongAdder(); private final LongAdder releases = new LongAdder(); private final Object ref; private final Runnable cleanupAction; private Root(Object ref, Runnable cleanupAction) { this.ref = ref; this.cleanupAction = cleanupAction; } @Override public MemoryScope acquire() { // increment acquires 1st acquires.increment(); // check closed flag 2nd if ((boolean) CLOSED.getVolatile(this)) { releases.increment(); throw new IllegalStateException("This scope is already closed"); } return new Child(); } @Override public void close() { if (closed) { throw new IllegalStateException("This scope is already closed"); } // modify closed flag 1st CLOSED.setVolatile(this, true); // check for no more active acquired children 2nd // IMPORTANT: 1st sum releases, then sum acquires !!! if (releases.sum() != acquires.sum()) { CLOSED.setVolatile(this, false); // undo before failing throw new IllegalStateException("Cannot close this scope as it has active acquired children"); } if (cleanupAction != null) { cleanupAction.run(); } } private final class Child extends MemoryScope { private Child() { } @Override public MemoryScope acquire() { return Root.this.acquire(); } @Override public void close() { if (closed) { throw new IllegalStateException("This scope is already closed"); } closed = true; // following acts as a volatile write after plain write above so // plain write gets flushed too (which is important for isAliveThreadSafe()) Root.this.releases.increment(); } } } }
Maurizio
Regards, Peter
WDYT?
Regards, Peter
On 4/28/20 6:12 PM, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case. I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset, long newSize) { 103 return dup(offset, newSize, mask, owner, scope); 104 }
...so here the "owner" of the slice is still the same as that of parent segment...
But then later in tryAdvance or forEachRemaining, the segment is acquired/closed for each element of the stream (in case of tryAdvance) or for the whole chunk to the end of spliterator (in case of forEachRemaining). So some pipelines will be more optimal than others...
So I'm thinking. Would it be possible to "lazily" acquire scope just once in tryAdvance and then re-use the scope until the end? Unfortunately Spliterator does not have a close() method to be called when the pipeline is done with it. Perhaps it could be added to the API? This is not the 1st time I wished Spliterator had a close method. I had a similar problem when trying to create a Spliterator with a database backend. When using JDBC API a separate transaction (Connection) is typically required for each thread of execution since several frameworks bind it to the ThreadLocal.
WDYT?
Regards, Peter
On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 29/04/2020 08:10, Peter Levart wrote:
Right, as you saw in a private Email, I did exactly that in a revised version (posted below). The spurious ISE may happen but only when close is called prematurely relative to all child scope close(s) that were or are still active. So we could say the other way: if close was not called prematurely, the ISE on acquire would not be spurious - so ordering of close relative to later acquire was wrong anyway and the exception is an "indication" of that wrong ordering
Unless we want to use close() to "probe" the scope whether it is still active or not, I don't think this should present a problem.
One quick comment: unless I'm missing something, this is starting to make a pretty strong assumption on how acquire()/close() are going to be used. Yes, if acquire() is not exposed to developers (as in the current API) and only hidden behind a spliterator, we might get away with this; but should we at some point re-introduce some kind of explicit acquire mechanism in the API, then such an assumption would not be a fine one to make. Maurizio
On 29/04/2020 21:40, Maurizio Cimadamore wrote:
On 29/04/2020 08:10, Peter Levart wrote:
Right, as you saw in a private Email, I did exactly that in a revised version (posted below). The spurious ISE may happen but only when close is called prematurely relative to all child scope close(s) that were or are still active. So we could say the other way: if close was not called prematurely, the ISE on acquire would not be spurious - so ordering of close relative to later acquire was wrong anyway and the exception is an "indication" of that wrong ordering
Unless we want to use close() to "probe" the scope whether it is still active or not, I don't think this should present a problem.
One quick comment: unless I'm missing something, this is starting to make a pretty strong assumption on how acquire()/close() are going to be used. Yes, if acquire() is not exposed to developers (as in the current API) and only hidden behind a spliterator, we might get away with this; but should we at some point re-introduce some kind of explicit acquire mechanism in the API, then such an assumption would not be a fine one to make.
Actually, the more I think of it, the less I'm convinced that, even in the restricted case of acquire() that the API has now, the proposed implementation is correct. A client has a segment, and it creates a spliterator for it. Then it gives the spliterator to some thread pool which will happily work away with the segment. But, the client code prematurely closes the segment - this operation should fail with exception, as per javadoc, if there are other actors working on the segment. Your implementation does that, but that failure leaves a sneaky side-effect - in that the threads in the thread-pool might not be able to continue their work on the segment. This action-at-a-distance between a failed close and a pending acquire is not documented anywhere, and I also find it very counter-intuitive. So, while I agree there might be ways to have a more scalable scope implementation, I think this is a problem that needs to be addressed. If we were willing to change the spec a bit, I would then be more inclined to say that when you call MemorySegment::close you always close - period; under no circumstances is an exception thrown. If there are pending acquires on the segment, we keep spinning until there aren't, and then we close for good. I think that would be a more stable semantics, rather than one where it seems like the operation failed, but in reality, it succeeded in part ;-) (at least for a period of time) Maurizio
Maurizio
On 4/29/20 11:16 PM, Maurizio Cimadamore wrote:
On 29/04/2020 21:40, Maurizio Cimadamore wrote:
On 29/04/2020 08:10, Peter Levart wrote:
Right, as you saw in a private Email, I did exactly that in a revised version (posted below). The spurious ISE may happen but only when close is called prematurely relative to all child scope close(s) that were or are still active. So we could say the other way: if close was not called prematurely, the ISE on acquire would not be spurious - so ordering of close relative to later acquire was wrong anyway and the exception is an "indication" of that wrong ordering
Unless we want to use close() to "probe" the scope whether it is still active or not, I don't think this should present a problem.
One quick comment: unless I'm missing something, this is starting to make a pretty strong assumption on how acquire()/close() are going to be used. Yes, if acquire() is not exposed to developers (as in the current API) and only hidden behind a spliterator, we might get away with this; but should we at some point re-introduce some kind of explicit acquire mechanism in the API, then such an assumption would not be a fine one to make.
Actually, the more I think of it, the less I'm convinced that, even in the restricted case of acquire() that the API has now, the proposed implementation is correct.
A client has a segment, and it creates a spliterator for it. Then it gives the spliterator to some thread pool which will happily work away with the segment.
But, the client code prematurely closes the segment - this operation should fail with exception, as per javadoc, if there are other actors working on the segment. Your implementation does that, but that failure leaves a sneaky side-effect - in that the threads in the thread-pool might not be able to continue their work on the segment.
Think differently: what if the client succeeded in closing the segment, just because it did it in a time window when no thread in the thread pool held an open scope (this is entirely possible with parallel stream for example since threads periodically acquire and close scopes). This would have the same effect on threads in the thread pool - they would not be able to continue their work... What I'm trying to say is that this is just a mechanism to make things safe, not to coordinate work. If program wants to avoid trouble, it must carefully coordinate work of threads.
This action-at-a-distance between a failed close and a pending acquire is not documented anywhere, and I also find it very counter-intuitive.
So, while I agree there might be ways to have a more scalable scope implementation, I think this is a problem that needs to be addressed.
If we were willing to change the spec a bit, I would then be more inclined to say that when you call MemorySegment::close you always close - period; under no circumstances is an exception thrown. If there are pending acquires on the segment, we keep spinning until there aren't, and then we close for good.
I think that would be a more stable semantics, rather than one where it seems like the operation failed, but in reality, it succeeded in part ;-) (at least for a period of time)
Well, you might not agree, but I don't think of this as a problem. You are trying to define some "correct" behavior to a program that is incorrectly written anyway. I'm just saying that if the program is correctly written, then there is no exceptions. And not defining the behavior for incorrect programs is not that uncommon. Just think of Java memory model. How much of a program behavior is guaranteed when program has data races? Regards, Peter
Maurizio
Maurizio
On 30/04/2020 01:06, Peter Levart wrote:
Think differently: what if the client succeeded in closing the segment, just because it did it in a time window when no thread in the thread pool held an open scope (this is entirely possible with parallel stream for example since threads periodically acquire and close scopes). This would have the same effect on threads in the thread pool - they would not be able to continue their work... What I'm trying to say is that this is just a mechanism to make things safe, not to coordinate work. If program wants to avoid trouble, it must carefully coordinate work of threads.
This appear to me to be a bit of a slippery slope? Sure, if somebody prematurely calls close() on a segment while other threads are accessing it, it could be seen as undefined behavior (a la C specifications ;-) ), but then, if you pull on the same string, why even bother with confinement in the first place? If you call close() prematurely and you get a VM crash that's on you? Maurizio
On 4/30/20 8:10 PM, Maurizio Cimadamore wrote:
On 30/04/2020 01:06, Peter Levart wrote:
Think differently: what if the client succeeded in closing the segment, just because it did it in a time window when no thread in the thread pool held an open scope (this is entirely possible with parallel stream for example since threads periodically acquire and close scopes). This would have the same effect on threads in the thread pool - they would not be able to continue their work... What I'm trying to say is that this is just a mechanism to make things safe, not to coordinate work. If program wants to avoid trouble, it must carefully coordinate work of threads.
This appear to me to be a bit of a slippery slope? Sure, if somebody prematurely calls close() on a segment while other threads are accessing it, it could be seen as undefined behavior (a la C specifications ;-) ), but then, if you pull on the same string, why even bother with confinement in the first place? If you call close() prematurely and you get a VM crash that's on you?
Luckily, I think I have fixed this shortcoming in the alternative MemoryScope: http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/v2/MemorySco... The trick is in using a 'state' with 3 values: OPEN, CLOSING, CLOSED ... The acquiring thread does the following in order: - increments the 'acquires' scalable counter (volatile write) - reads the 'state' (volatile read) and then enters a spin-loop: - if state == STATE_OPEN the acquire succeeded (this is fast path); else - if state == STATE_CLOSING it spin-loops re-reading 'state' in each iteration; else - if state == STATE_CLOSED it increments 'releases' scalable counter and throws exception The closing thread does the following in order: - writes STATE_CLOSING to 'state' (volatile write) - sums the 'releases' scalable counter (volatile reads) - sums the 'acquires' scalable counter (volatile reads) - compares both sums and: - if they don't match then it writes back STATE_OPEN to 'state' (volatile write) and throws exception; else - it writes STATE_CLOSED to 'state' (volatile write) and executes cleanup action This, I think, is better, isn't it? Regards, Peter
Maurizio
Hi Maurizio, On 5/1/20 3:00 PM, Peter Levart wrote:
Luckily, I think I have fixed this shortcoming in the alternative MemoryScope:
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/v2/MemorySco...
I added support for dup() (modified above file in-place) and also the last missing part: MemoryScope.GLOBAL. So WDYT? Is it ok to propose this for panama? How do I proceed? Regards, Peter
On 01/05/2020 15:31, Peter Levart wrote:
So WDYT? Is it ok to propose this for panama? How do I proceed?
You can do a pull request against this branch: https://github.com/openjdk/panama-foreign/tree/foreign-memaccess Cheers Maurizio
Hi Peter, this does look better yes. I suspect this doesn't affect performance negatively right? (in most cases, when acquiring, state will be OPEN). Now there's dup(). I think implementing dup() on a root scope is not too hard - for the child you probably need some trick, but probably not too bad - in the sense that in a Child scope, the cleanup action is really the increment of the root exit scope. So you could have a: close(boolean doCleanup) like we do now, and avoid the cleanup on dup (which in the case of the Child will avoid the adder increment). I *believe* this should be functionally equivalent to what we have now. One question: the more I look at the code, the less I'm sure that a close vs. access race cannot occur. I'm considering this situation: * thread 1 does acquire, and find state == OPEN * thread 2 does close, set state to CLOSING, then checks if the adders match But, how can we be sure that the value we get back from the adder (e.g. acquires.sum()) is accurate and reflects the fact that thread (1) has entered already? The API doesn't seem to provide any such guarantee: " The returned value is /NOT/ an atomic snapshot; invocation in the absence of concurrent updates returns an accurate result, but concurrent updates that occur while the sum is being calculated might not be incorporated." I guess perhaps the trick is in that "while" ? E.g. there's no guarantee only if the concurrent update occurs _while_ sum() is called. Now I think this is ok - because when acquire races with close we can have two cases: 1) "state" has been set to CLOSING _before_ it is read inside acquire() 2) "state" has been set to CLOSING _after_ it is read inside acquire() In the case (1), acquire will start spinning, so nothing can harmful can really happen here. Either the read of "state" from acquire() happened when "state" is about to transition to CLOSED (in which case it will fail soon after) - or the read happened before close() had a chance to look at the counter - in which case there might be a chance that the counter will be updated concurrently (e.g. acquire() thread calls increment() while close() thread calls sum()). But there will be two outcomes here: either the adder has missed the update, in which case the segment will be close, and acquire() will fail; or the adder got the update, in which case close() will fail and acquire() will fail. In the case (2) we have an happen before edge between the "state" read performed by acquire() and the "state" write performed by close(). Which means that, by the time we get to calling acquires.sum() we are guaranteed that the thread doing the close() will have seen the adder update from the thread doing the acquire (since the update comes _before_ the volatile read in the acquire() method). So, for this case we have that: * [acquire] acquires.increment() happens before * [acquire] state > OPEN happens before * [close] state = CLOSING happens before * [close] acquires.sum() Which, I think, proves that the thread performing the acquire cannot possibly have assumed that the scope is OPEN and also updating the adder concurrently with the call to sum() in the close thread. Maurizio On 01/05/2020 14:00, Peter Levart wrote:
On 4/30/20 8:10 PM, Maurizio Cimadamore wrote:
On 30/04/2020 01:06, Peter Levart wrote:
Think differently: what if the client succeeded in closing the segment, just because it did it in a time window when no thread in the thread pool held an open scope (this is entirely possible with parallel stream for example since threads periodically acquire and close scopes). This would have the same effect on threads in the thread pool - they would not be able to continue their work... What I'm trying to say is that this is just a mechanism to make things safe, not to coordinate work. If program wants to avoid trouble, it must carefully coordinate work of threads.
This appear to me to be a bit of a slippery slope? Sure, if somebody prematurely calls close() on a segment while other threads are accessing it, it could be seen as undefined behavior (a la C specifications ;-) ), but then, if you pull on the same string, why even bother with confinement in the first place? If you call close() prematurely and you get a VM crash that's on you?
Luckily, I think I have fixed this shortcoming in the alternative MemoryScope:
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/v2/MemorySco...
The trick is in using a 'state' with 3 values: OPEN, CLOSING, CLOSED ...
The acquiring thread does the following in order:
- increments the 'acquires' scalable counter (volatile write)
- reads the 'state' (volatile read) and then enters a spin-loop:
- if state == STATE_OPEN the acquire succeeded (this is fast path); else
- if state == STATE_CLOSING it spin-loops re-reading 'state' in each iteration; else
- if state == STATE_CLOSED it increments 'releases' scalable counter and throws exception
The closing thread does the following in order:
- writes STATE_CLOSING to 'state' (volatile write)
- sums the 'releases' scalable counter (volatile reads)
- sums the 'acquires' scalable counter (volatile reads)
- compares both sums and:
- if they don't match then it writes back STATE_OPEN to 'state' (volatile write) and throws exception; else
- it writes STATE_CLOSED to 'state' (volatile write) and executes cleanup action
This, I think, is better, isn't it?
Regards, Peter
Maurizio
On 01/05/2020 15:36, Maurizio Cimadamore wrote:
in which case close() will fail and acquire() will fail. I meant close() will fail and segment will be acquired
Hi Maurizio, On 5/1/20 4:36 PM, Maurizio Cimadamore wrote:
Hi Peter, this does look better yes.
I suspect this doesn't affect performance negatively right? (in most cases, when acquiring, state will be OPEN).
Now there's dup(). I think implementing dup() on a root scope is not too hard - for the child you probably need some trick, but probably not too bad - in the sense that in a Child scope, the cleanup action is really the increment of the root exit scope. So you could have a:
close(boolean doCleanup)
like we do now, and avoid the cleanup on dup (which in the case of the Child will avoid the adder increment). I *believe* this should be functionally equivalent to what we have now.
I'll look into optimizing this part.
One question: the more I look at the code, the less I'm sure that a close vs. access race cannot occur. I'm considering this situation:
* thread 1 does acquire, and find state == OPEN * thread 2 does close, set state to CLOSING, then checks if the adders match
But, how can we be sure that the value we get back from the adder (e.g. acquires.sum()) is accurate and reflects the fact that thread (1) has entered already?
Because of ordering. All 4 actions are volatile reads or writes. So we can linearize them: Thread 1 does: a) increments 'acquires' b) reads OPEN from 'state' Thread 2 does: c) writes CLOSING to 'state' d) sums 'acquires' 'a' happens before 'b' happens before 'c' happens before 'd' => 'a' happens before 'd'
The API doesn't seem to provide any such guarantee:
" The returned value is /NOT/ an atomic snapshot; invocation in the absence of concurrent updates returns an accurate result, but concurrent updates that occur while the sum is being calculated might not be incorporated."
As far as the relation of action 'a' to action 'd' is concerned, those two actions are not concurrent. They are clearly 'a' happens before 'd', so in 'd' we are guaranteed to see the effects of 'a'. Surely we may also see the effects of other concurrent updates. But what those concurrent updates do is increment the 'acquires' counter (more precisely one of the counters in LongAdder). So we may see an even larger sum, never smaller (unless overflow happens, but then the sum won't be any "closer" to the sum of 'releases', only further away). What is important here is that we will not miss the update of action 'a' above. The important part is also that we 1st do a sum of 'releases' and then sum of 'acquires'. That way we will never miss an update of 'acquires' if we saw a related update of 'releases' made by same thread, because the acquiring thread 1st increments 'acquires' and then, when finished, increments 'releases'. It's all a play of action orderings.
I guess perhaps the trick is in that "while" ? E.g. there's no guarantee only if the concurrent update occurs _while_ sum() is called.
The while loop is just an addition to prevent spurious acquire failures. It has nothing to do with "detection".
Now I think this is ok - because when acquire races with close we can have two cases:
1) "state" has been set to CLOSING _before_ it is read inside acquire() 2) "state" has been set to CLOSING _after_ it is read inside acquire()
In the case (1), acquire will start spinning, so nothing can harmful can really happen here. Either the read of "state" from acquire() happened when "state" is about to transition to CLOSED (in which case it will fail soon after) - or the read happened before close() had a chance to look at the counter - in which case there might be a chance that the counter will be updated concurrently (e.g. acquire() thread calls increment() while close() thread calls sum()). But there will be two outcomes here: either the adder has missed the update, in which case the segment will be close, and acquire() will fail; or the adder got the update, in which case close() will fail and acquire() will fail.
In the case (2) we have an happen before edge between the "state" read performed by acquire() and the "state" write performed by close(). Which means that, by the time we get to calling acquires.sum() we are guaranteed that the thread doing the close() will have seen the adder update from the thread doing the acquire (since the update comes _before_ the volatile read in the acquire() method). So, for this case we have that:
* [acquire] acquires.increment() happens before * [acquire] state > OPEN happens before * [close] state = CLOSING happens before * [close] acquires.sum()
Which, I think, proves that the thread performing the acquire cannot possibly have assumed that the scope is OPEN and also updating the adder concurrently with the call to sum() in the close thread.
But you just showed above that acquires.increment() (by acquiring thread) happens before acquires.sum() (by closing thread) - they don't execute concurrently, so closing thread can not possibly see the sum of 'acquires' be equal to the sum of 'releases' (which has been taken just before the sum of acquires) and therefore it will transition state back to OPEN and throw exception. So close will fail and acquire will succeed. Regards, Peter
Maurizio
On 01/05/2020 14:00, Peter Levart wrote:
On 4/30/20 8:10 PM, Maurizio Cimadamore wrote:
On 30/04/2020 01:06, Peter Levart wrote:
Think differently: what if the client succeeded in closing the segment, just because it did it in a time window when no thread in the thread pool held an open scope (this is entirely possible with parallel stream for example since threads periodically acquire and close scopes). This would have the same effect on threads in the thread pool - they would not be able to continue their work... What I'm trying to say is that this is just a mechanism to make things safe, not to coordinate work. If program wants to avoid trouble, it must carefully coordinate work of threads.
This appear to me to be a bit of a slippery slope? Sure, if somebody prematurely calls close() on a segment while other threads are accessing it, it could be seen as undefined behavior (a la C specifications ;-) ), but then, if you pull on the same string, why even bother with confinement in the first place? If you call close() prematurely and you get a VM crash that's on you?
Luckily, I think I have fixed this shortcoming in the alternative MemoryScope:
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/v2/MemorySco...
The trick is in using a 'state' with 3 values: OPEN, CLOSING, CLOSED ...
The acquiring thread does the following in order:
- increments the 'acquires' scalable counter (volatile write)
- reads the 'state' (volatile read) and then enters a spin-loop:
- if state == STATE_OPEN the acquire succeeded (this is fast path); else
- if state == STATE_CLOSING it spin-loops re-reading 'state' in each iteration; else
- if state == STATE_CLOSED it increments 'releases' scalable counter and throws exception
The closing thread does the following in order:
- writes STATE_CLOSING to 'state' (volatile write)
- sums the 'releases' scalable counter (volatile reads)
- sums the 'acquires' scalable counter (volatile reads)
- compares both sums and:
- if they don't match then it writes back STATE_OPEN to 'state' (volatile write) and throws exception; else
- it writes STATE_CLOSED to 'state' (volatile write) and executes cleanup action
This, I think, is better, isn't it?
Regards, Peter
Maurizio
On 02/05/2020 09:45, Peter Levart wrote:
Thread 1 does:
a) increments 'acquires'
b) reads OPEN from 'state'
Thread 2 does:
c) writes CLOSING to 'state'
d) sums 'acquires'
'a' happens before 'b' happens before 'c' happens before 'd' => 'a' happens before 'd'
Where did you get the "b" happens before "c" edge? I think it's more subtle then that? E.g. if thread 1 saw OPEN, _then_ you can establish an HB edge between (b) and (c). But it could also be the case that Thread 1 reads CLOSED and starts to spin - so we need also to prove that there's some HB edges in that case (which I've tried to do in my email). I think this is the actual _difficult_ case to prove (I agree that the other is just transitive property of HB) - I think I've convinced myself that we should be in the clear also for the path when Thread 1 reads CLOSING, but I found it more subtle there to prove that either one thread or the other would fail. E.g. can there be a situation where BOTH operations fail? Maurizio
Hi Maurizio, On 5/5/20 4:12 PM, Maurizio Cimadamore wrote:
On 02/05/2020 09:45, Peter Levart wrote:
Thread 1 does:
a) increments 'acquires'
b) reads OPEN from 'state'
Thread 2 does:
c) writes CLOSING to 'state'
d) sums 'acquires'
'a' happens before 'b' happens before 'c' happens before 'd' => 'a' happens before 'd'
Where did you get the "b" happens before "c" edge? I think it's more subtle then that? E.g. if thread 1 saw OPEN, _then_ you can establish an HB edge between (b) and (c). But it could also be the case that Thread 1 reads CLOSED and starts to spin - so we need also to prove that there's some HB edges in that case (which I've tried to do in my email). I think this is the actual _difficult_ case to prove (I agree that the other is just transitive property of HB) - I think I've convinced myself that we should be in the clear also for the path when Thread 1 reads CLOSING, but I found it more subtle there to prove that either one thread or the other would fail.
E.g. can there be a situation where BOTH operations fail?
Maurizio
There is one precondition which has to be present for this to work correctly. That close() is only executed by one thread. And in case of MemoryScope, this is true, because calls to close() are pre-screened by MemorySegment which checks that close() is executed only in owner thread. So if this is true, then it follows that 'b' happens before 'c' because in 'b' we did not see the effect of 'c' and 'c' is volatile write and 'b' is volatile read. So for the case that you was asking:
One question: the more I look at the code, the less I'm sure that a close vs. access race cannot occur. I'm considering this situation:
* thread 1 does acquire, and find state == OPEN * thread 2 does close, set state to CLOSING, then checks if the adders match
But, how can we be sure that the value we get back from the adder (e.g. acquires.sum()) is accurate and reflects the fact that thread (1) has entered already?
...this is all that needs to be proven. For other cases, the proof of correctness is of course different. So you are concerned about the case where the acquiring thread reads CLOSED. This is simple, since at that point the thread would just increment the releases (to level them back with acquires) and fail with exception. If that happens than we are also sure that the closing thread did successfully close the scope, so we have this mutual exclusion covered in this case where one thread is guaranteed to succeed while the other is guaranteed to fail. The interesting case is when the acquiring thread reads CLOSING. In that case it will spin-loop, re-reading the state until it either gets reset back to OPEN or set to CLOSED. So we have 2 sub-cases how the loop ends: - it finally reads CLOSED - in this case the closing thread decided to successfully close the scope, so acquiring thread fails - if finally reads OPEN - in this case the closing thread decided that it should not close the scope so close() fails, but acquiring thread succeeds. So in both above sub-cases we have the mutual exclusion covered. Notice that for this last three cases it was not important what closing thread read from acquires.sum() and releases.sum(). So we don't have to prove anything about the ordering of those two operations. Regards, Peter
On 5/1/20 4:36 PM, Maurizio Cimadamore wrote:
Now there's dup(). I think implementing dup() on a root scope is not too hard - for the child you probably need some trick, but probably not too bad - in the sense that in a Child scope, the cleanup action is really the increment of the root exit scope. So you could have a:
close(boolean doCleanup)
like we do now, and avoid the cleanup on dup (which in the case of the Child will avoid the adder increment). I *believe* this should be functionally equivalent to what we have now.
I optimized the dup() on both root and child scopes as much as possible in last version of MemoryScope. I also added some comments about usage of methods. I'll make a PR for this version incorporated in the panama shortly: http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/v3/MemorySco... Regards, Peter
Hi Maurizio, On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
The current implementation has performances that are on par with the previous acquire-based implementation, and also on par with what can be achieved with Unsafe. We do have a micro benchmark in the patch (see ParallelSum (**)) which tests this, and I get _identical_ numbers even if I _comment_ the body of acquire/release - so that no contention can happen; so, I'm a bit skeptical overall that contention on acquire/release is the main factor at play here - but perhaps we need more targeted benchmarks.
So I modified your benchmark (just took out the relevant parts) and added some benchmarks that exhibit Stream.findAny() and Stream.findFirst(). As I anticipated, the results for parallel stream variants were slowing the benchmark down, so I had to reduce the number of elements by a factor of 16 to get results in reasonable time: http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/ParallelSum.... I then swapped-in the alternative implementation of MemoryScope (note that this is not a whole implementation - the dup() method is missing): http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/MemoryScope.... ...and I got these results: i7 2600K (4 cores / 8 threads) with proposed MemoryScope: Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 1332.687 ± 733.535 ms/op ParallelSum.find_any_stream_serial avgt 10 440.260 ± 3.110 ms/op ParallelSum.find_first_loop_serial avgt 10 5.809 ± 0.044 ms/op ParallelSum.find_first_stream_parallel avgt 10 2070.318 ± 41.072 ms/op ParallelSum.find_first_stream_serial avgt 10 440.034 ± 4.672 ms/op ParallelSum.sum_loop_serial avgt 10 5.647 ± 0.055 ms/op ParallelSum.sum_stream_parallel avgt 10 5.314 ± 0.294 ms/op ParallelSum.sum_stream_serial avgt 10 19.179 ± 0.136 ms/op with alternative MemoryScope: Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 80.280 ± 13.183 ms/op ParallelSum.find_any_stream_serial avgt 10 317.388 ± 2.787 ms/op ParallelSum.find_first_loop_serial avgt 10 5.790 ± 0.038 ms/op ParallelSum.find_first_stream_parallel avgt 10 117.925 ± 1.747 ms/op ParallelSum.find_first_stream_serial avgt 10 315.076 ± 5.725 ms/op ParallelSum.sum_loop_serial avgt 10 5.652 ± 0.042 ms/op ParallelSum.sum_stream_parallel avgt 10 4.881 ± 0.053 ms/op ParallelSum.sum_stream_serial avgt 10 19.143 ± 0.035 ms/op So here it is. The proof that contention does occur. Regards, Peter
Many thanks Peter, my preference would be for adding the benchmark now, and come back to fix this post integration. The MemoryScope code is finicky and getting confidence that the code is race-free takes time and I'd prefer not to change that during the course of this RFR. Your approach seems promising, so let's keep working and thinking on it on the side (and perhaps let's maybe go for it in the panama repo first, to make sure we don't add regressions). Sounds like a plan? Cheers Maurizio On 29/04/2020 20:19, Peter Levart wrote:
Hi Maurizio,
On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
The current implementation has performances that are on par with the previous acquire-based implementation, and also on par with what can be achieved with Unsafe. We do have a micro benchmark in the patch (see ParallelSum (**)) which tests this, and I get _identical_ numbers even if I _comment_ the body of acquire/release - so that no contention can happen; so, I'm a bit skeptical overall that contention on acquire/release is the main factor at play here - but perhaps we need more targeted benchmarks.
So I modified your benchmark (just took out the relevant parts) and added some benchmarks that exhibit Stream.findAny() and Stream.findFirst(). As I anticipated, the results for parallel stream variants were slowing the benchmark down, so I had to reduce the number of elements by a factor of 16 to get results in reasonable time:
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/ParallelSum....
I then swapped-in the alternative implementation of MemoryScope (note that this is not a whole implementation - the dup() method is missing):
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/MemoryScope....
...and I got these results:
i7 2600K (4 cores / 8 threads)
with proposed MemoryScope:
Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 1332.687 ± 733.535 ms/op ParallelSum.find_any_stream_serial avgt 10 440.260 ± 3.110 ms/op ParallelSum.find_first_loop_serial avgt 10 5.809 ± 0.044 ms/op ParallelSum.find_first_stream_parallel avgt 10 2070.318 ± 41.072 ms/op ParallelSum.find_first_stream_serial avgt 10 440.034 ± 4.672 ms/op ParallelSum.sum_loop_serial avgt 10 5.647 ± 0.055 ms/op ParallelSum.sum_stream_parallel avgt 10 5.314 ± 0.294 ms/op ParallelSum.sum_stream_serial avgt 10 19.179 ± 0.136 ms/op
with alternative MemoryScope:
Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 80.280 ± 13.183 ms/op ParallelSum.find_any_stream_serial avgt 10 317.388 ± 2.787 ms/op ParallelSum.find_first_loop_serial avgt 10 5.790 ± 0.038 ms/op ParallelSum.find_first_stream_parallel avgt 10 117.925 ± 1.747 ms/op ParallelSum.find_first_stream_serial avgt 10 315.076 ± 5.725 ms/op ParallelSum.sum_loop_serial avgt 10 5.652 ± 0.042 ms/op ParallelSum.sum_stream_parallel avgt 10 4.881 ± 0.053 ms/op ParallelSum.sum_stream_serial avgt 10 19.143 ± 0.035 ms/op
So here it is. The proof that contention does occur.
Regards, Peter
On 4/29/20 10:23 PM, Maurizio Cimadamore wrote:
Many thanks Peter, my preference would be for adding the benchmark now, and come back to fix this post integration. The MemoryScope code is finicky and getting confidence that the code is race-free takes time and I'd prefer not to change that during the course of this RFR.
Your approach seems promising, so let's keep working and thinking on it on the side (and perhaps let's maybe go for it in the panama repo first, to make sure we don't add regressions).
Sounds like a plan?
I agree. There's still plenty of time until 15 ships... Regards, Peter
Cheers Maurizio
On 29/04/2020 20:19, Peter Levart wrote:
Hi Maurizio,
On 4/29/20 2:41 AM, Maurizio Cimadamore wrote:
The current implementation has performances that are on par with the previous acquire-based implementation, and also on par with what can be achieved with Unsafe. We do have a micro benchmark in the patch (see ParallelSum (**)) which tests this, and I get _identical_ numbers even if I _comment_ the body of acquire/release - so that no contention can happen; so, I'm a bit skeptical overall that contention on acquire/release is the main factor at play here - but perhaps we need more targeted benchmarks.
So I modified your benchmark (just took out the relevant parts) and added some benchmarks that exhibit Stream.findAny() and Stream.findFirst(). As I anticipated, the results for parallel stream variants were slowing the benchmark down, so I had to reduce the number of elements by a factor of 16 to get results in reasonable time:
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/ParallelSum....
I then swapped-in the alternative implementation of MemoryScope (note that this is not a whole implementation - the dup() method is missing):
http://cr.openjdk.java.net/~plevart/jdk-dev/8243491_MemoryScope/MemoryScope....
...and I got these results:
i7 2600K (4 cores / 8 threads)
with proposed MemoryScope:
Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 1332.687 ± 733.535 ms/op ParallelSum.find_any_stream_serial avgt 10 440.260 ± 3.110 ms/op ParallelSum.find_first_loop_serial avgt 10 5.809 ± 0.044 ms/op ParallelSum.find_first_stream_parallel avgt 10 2070.318 ± 41.072 ms/op ParallelSum.find_first_stream_serial avgt 10 440.034 ± 4.672 ms/op ParallelSum.sum_loop_serial avgt 10 5.647 ± 0.055 ms/op ParallelSum.sum_stream_parallel avgt 10 5.314 ± 0.294 ms/op ParallelSum.sum_stream_serial avgt 10 19.179 ± 0.136 ms/op
with alternative MemoryScope:
Benchmark Mode Cnt Score Error Units ParallelSum.find_any_stream_parallel avgt 10 80.280 ± 13.183 ms/op ParallelSum.find_any_stream_serial avgt 10 317.388 ± 2.787 ms/op ParallelSum.find_first_loop_serial avgt 10 5.790 ± 0.038 ms/op ParallelSum.find_first_stream_parallel avgt 10 117.925 ± 1.747 ms/op ParallelSum.find_first_stream_serial avgt 10 315.076 ± 5.725 ms/op ParallelSum.sum_loop_serial avgt 10 5.652 ± 0.042 ms/op ParallelSum.sum_stream_parallel avgt 10 4.881 ± 0.053 ms/op ParallelSum.sum_stream_serial avgt 10 19.143 ± 0.035 ms/op
So here it is. The proof that contention does occur.
Regards, Peter
On 28/04/2020 17:12, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case. I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset, long newSize) { 103 return dup(offset, newSize, mask, owner, scope); 104 }
...so here the "owner" of the slice is still the same as that of parent segment...
But then later in tryAdvance or forEachRemaining, the segment is acquired/closed for each element of the stream (in case of tryAdvance) or for the whole chunk to the end of spliterator (in case of forEachRemaining). So some pipelines will be more optimal than others...
Not sure I follow here - you have to create a new segment for each element of the stream since you don't know what thread is gonna process it anyway no? Maurizio
So I'm thinking. Would it be possible to "lazily" acquire scope just once in tryAdvance and then re-use the scope until the end? Unfortunately Spliterator does not have a close() method to be called when the pipeline is done with it. Perhaps it could be added to the API? This is not the 1st time I wished Spliterator had a close method. I had a similar problem when trying to create a Spliterator with a database backend. When using JDBC API a separate transaction (Connection) is typically required for each thread of execution since several frameworks bind it to the ThreadLocal.
WDYT?
Regards, Peter
On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 4/28/20 10:49 PM, Maurizio Cimadamore wrote:
On 28/04/2020 17:12, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case. I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset, long newSize) { 103 return dup(offset, newSize, mask, owner, scope); 104 }
...so here the "owner" of the slice is still the same as that of parent segment...
But then later in tryAdvance or forEachRemaining, the segment is acquired/closed for each element of the stream (in case of tryAdvance) or for the whole chunk to the end of spliterator (in case of forEachRemaining). So some pipelines will be more optimal than others...
Not sure I follow here - you have to create a new segment for each element of the stream since you don't know what thread is gonna process it anyway no?
When forEachRemaining is called for all remaining elements, you only have to acquire one child scope, since your loop will process all elements in the same thread. You do create slices for each element, but slices don't acquire new child scope and close it afterwards. But when tryAdvance is called for each element from the pipeline, you have to acquire new scope and close it after each element. And this is the point of contention since multiple threads will be doing the same concurrently... So depending on whether the pipeline might shortcut the stream or not, execution could be more optimal or less, maybe to the point where contention is so large that it is prohibitive. Peter
Maurizio
So I'm thinking. Would it be possible to "lazily" acquire scope just once in tryAdvance and then re-use the scope until the end? Unfortunately Spliterator does not have a close() method to be called when the pipeline is done with it. Perhaps it could be added to the API? This is not the 1st time I wished Spliterator had a close method. I had a similar problem when trying to create a Spliterator with a database backend. When using JDBC API a separate transaction (Connection) is typically required for each thread of execution since several frameworks bind it to the ThreadLocal.
WDYT?
Regards, Peter
On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
On 28/04/2020 23:08, Peter Levart wrote:
On 4/28/20 10:49 PM, Maurizio Cimadamore wrote:
On 28/04/2020 17:12, Peter Levart wrote:
Hi Maurizio,
I'm checking out the thread-confinement in the parallel stream case. I see the Spliterator.trySplit() is calling AbstractMemorySegmentImpl's:
102 private AbstractMemorySegmentImpl asSliceNoCheck(long offset, long newSize) { 103 return dup(offset, newSize, mask, owner, scope); 104 }
...so here the "owner" of the slice is still the same as that of parent segment...
But then later in tryAdvance or forEachRemaining, the segment is acquired/closed for each element of the stream (in case of tryAdvance) or for the whole chunk to the end of spliterator (in case of forEachRemaining). So some pipelines will be more optimal than others...
Not sure I follow here - you have to create a new segment for each element of the stream since you don't know what thread is gonna process it anyway no?
When forEachRemaining is called for all remaining elements, you only have to acquire one child scope, since your loop will process all elements in the same thread. You do create slices for each element, but slices don't acquire new child scope and close it afterwards. But when tryAdvance is called for each element from the pipeline, you have to acquire new scope and close it after each element. And this is the point of contention since multiple threads will be doing the same concurrently... So depending on whether the pipeline might shortcut the stream or not, execution could be more optimal or less, maybe to the point where contention is so large that it is prohibitive.
OK - see my other reply Maurizio
Peter
Maurizio
So I'm thinking. Would it be possible to "lazily" acquire scope just once in tryAdvance and then re-use the scope until the end? Unfortunately Spliterator does not have a close() method to be called when the pipeline is done with it. Perhaps it could be added to the API? This is not the 1st time I wished Spliterator had a close method. I had a similar problem when trying to create a Spliterator with a database backend. When using JDBC API a separate transaction (Connection) is typically required for each thread of execution since several frameworks bind it to the ThreadLocal.
WDYT?
Regards, Peter
On 4/23/20 10:33 PM, Maurizio Cimadamore wrote:
Hi, time has come for another round of foreign memory access API incubation (see JEP 383 [3]). This iteration aims at polishing some of the rough edges of the API, and adds some of the functionalities that developers have been asking for during this first round of incubation. The revised API tightens the thread-confinement constraints (by removing the MemorySegment::acquire method) and instead provides more targeted support for parallel computation via a segment spliterator. The API also adds a way to create a custom native segment; this is, essentially, an unsafe API point, very similar in spirit to the JNI NewDirectByteBuffer functionality [1]. By using this bit of API, power-users will be able to add support, via MemorySegment, to *their own memory sources* (e.g. think of a custom allocator written in C/C++). For now, this API point is called off as "restricted" and a special read-only JDK property will have to be set on the command line for calls to this method to succeed. We are aware there's no precedent for something like this in the Java SE API - but if Project Panama is to remain true about its ultimate goal of replacing bits of JNI code with (low level) Java code, stuff like this has to be *possible*. We anticipate that, at some point, this property will become a true launcher flag, and that the foreign restricted machinery will be integrated more neatly into the module system.
A list of the API, implementation and test changes is provided below. If you have any questions, or need more detailed explanations, I (and the rest of the Panama team) will be happy to point at existing discussions, and/or to provide the feedback required.
Thanks Maurizio
Webrev:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/webrev
Javadoc:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/javadoc
Specdiff:
http://cr.openjdk.java.net/~mcimadamore/8243491_v1/specdiff/overview-summary...
CSR:
https://bugs.openjdk.java.net/browse/JDK-8243496
API changes ===========
* MemorySegment - drop support for acquire() method - in its place now you can obtain a spliterator from a segment, which supports divide-and-conquer - revamped support for views - e.g. isReadOnly - now segments have access modes - added API to do serial confinement hand-off (MemorySegment::withOwnerThread) - added unsafe factory to construct a native segment out of an existing address; this API is "restricted" and only available if the program is executed using the -Dforeign.unsafe=permit flag. - the MemorySegment::mapFromPath now returns a MappedMemorySegment * MappedMemorySegment - small sub-interface which provides extra capabilities for mapped segments (load(), unload() and force()) * MemoryAddress - added distinction between *checked* and *unchecked* addresses; *unchecked* addresses do not have a segment, so they cannot be dereferenced - added NULL memory address (it's an unchecked address) - added factory to construct MemoryAddress from long value (result is also an unchecked address) - added API point to get raw address value (where possible - e.g. if this is not an address pointing to a heap segment) * MemoryLayout - Added support for layout "attributes" - e.g. store metadata inside MemoryLayouts - Added MemoryLayout::isPadding predicate - Added helper function to SequenceLayout to rehape/flatten sequence layouts (a la NDArray [4]) * MemoryHandles - add support for general VarHandle combinators (similar to MH combinators) - add a combinator to turn a long-VH into a MemoryAddress VH (the resulting MemoryAddress is also *unchecked* and cannot be dereferenced)
Implementation changes ======================
* add support for VarHandle combinators (e.g. IndirectVH)
The idea here is simple: a VarHandle can almost be thought of as a set of method handles (one for each access mode supported by the var handle) that are lazily linked. This gives us a relatively simple idea upon which to build support for custom var handle adapters: we could create a VarHandle by passing an existing var handle and also specify the set of adaptations that should be applied to the method handle for a given access mode in the original var handle. The result is a new VarHandle which might support a different carrier type and more, or less coordinate types. Adding this support was relatively easy - and it only required one low-level surgery of the lambda forms generated for adapted var handle (this is required so that the "right" var handle receiver can be used for dispatching the access mode call).
All the new adapters in the MemoryHandles API (which are really defined inside VarHandles) are really just a bunch of MH adapters that are stitched together into a brand new VH. The only caveat is that, we could have a checked exception mismatch: the VarHandle API methods are specified not to throw any checked exception, whereas method handles can throw any throwable. This means that, potentially, calling get() on an adapted VarHandle could result in a checked exception being thrown; to solve this gnarly issue, we decided to scan all the filter functions passed to the VH combinators and look for direct method handles which throw checked exceptions. If such MHs are found (these can be deeply nested, since the MHs can be adapted on their own), adaptation of the target VH fails fast.
* More ByteBuffer implementation changes
Some more changes to ByteBuffer support were necessary here. First, we have added support for retrieval of "mapped" properties associated with a ByteBuffer (e.g. the file descriptor, etc.). This is crucial if we want to be able to turn an existing byte buffer into the "right kind" of memory segment.
Conversely, we also have to allow creation of mapped byte buffers given existing parameters - which is needed when going from (mapped) segment to a buffer. These two pieces together allow us to go from segment to buffer and back w/o losing any information about the underlying memory mapping (which was an issue in the previous implementation).
Lastly, to support the new MappedMemorySegment abstraction, all the memory mapped supporting functionalities have been moved into a common helper class so that MappedMemorySegmentImpl can reuse that (e.g. for MappedMemorySegment::force).
* Rewritten memory segment hierarchy
The old implementation had a monomorphic memory segment class. In this round we aimed at splitting the various implementation classes so that we have a class for heap segments (HeapMemorySegmentImpl), one for native segments (NativeMemorySegmentImpl) and one for memory mapped segments (MappedMemorySegmentImpl, which extends from NativeMemorySegmentImpl). Not much to see here - although one important point is that, by doing this, we have been able to speed up performances quite a bit, since now e.g. native/mapped segments are _guaranteed_ to have a null "base". We have also done few tricks to make sure that the "base" accessor for heap segment is sharply typed and also NPE checked, which allows C2 to speculate more and hoist. With these changes _all_ segment types have comparable performances and hoisting guarantees (unlike in the old implementation).
* Add workarounds in MemoryAddressProxy, AbstractMemorySegmentImpl to special case "small segments" so that VM can apply bound check elimination
This is another important piece which allows to get very good performances out of indexes memory access var handles; as you might know, the JIT compiler has troubles in optimizing loops where the loop variable is a long [2]. To make up for that, in this round we add an optimization which allows the API to detect whether a segment is *small* or *large*. For small segments, the API realizes that there's no need to perform long computation (e.g. to perform bound checks, or offset additions), so it falls back to integer logic, which in turns allows bound check elimination.
* renaming of the various var handle classes to conform to "memory access var handle" terminology
This is mostly stylistic, nothing to see here.
Tests changes =============
In addition to the tests for the new API changes, we've also added some stress tests for var handle combinators - e.g. there's a flag that can be enabled which turns on some "dummy" var handle adaptations on all var handles created by the runtime. We've used this flag on existing tests to make sure that things work as expected.
To sanity test the new memory segment spliterator, we have wired the new segment spliterator with the existing spliterator test harness.
We have also added several micro benchmarks for the memory segment API (and made some changes to the build script so that native libraries would be handled correctly).
[1] - https://docs.oracle.com/en/java/javase/14/docs/specs/jni/functions.html#newd... [2] - https://bugs.openjdk.java.net/browse/JDK-8223051 [3] - https://openjdk.java.net/jeps/383 [4] - https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html#nump...
Maurizio,
On 23 Apr 2020, at 21:33, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Javadoc:
I’m still working my way through this update ( BTW it’s looking very good ), but an initial comment on the API. The access modes of MemorySegment are a nice addition, but while reading the API alone, I find myself wondering what access modes are held by segments returned by the static factories. Most hold all access modes but not all, e.g. read-only byte buffer. AND a mapped segment with mapping mode READ_ONLY ? Maybe it is worth specifying this? E.g something like: https://cr.openjdk.java.net/~chegar/foreign/segmentAccessModes/ -Chris.
Looks good - thanks. Perhaps I'd tweak "will hold" and replace it with "will feature", and also add a link between "access modes" to the access mode section of the javadoc, to make it more navigable. Maurizio On 30/04/2020 14:33, Chris Hegarty wrote:
Maurizio,
On 23 Apr 2020, at 21:33, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Javadoc:
I’m still working my way through this update ( BTW it’s looking very good ), but an initial comment on the API.
The access modes of MemorySegment are a nice addition, but while reading the API alone, I find myself wondering what access modes are held by segments returned by the static factories. Most hold all access modes but not all, e.g. read-only byte buffer. AND a mapped segment with mapping mode READ_ONLY ?
Maybe it is worth specifying this? E.g something like: https://cr.openjdk.java.net/~chegar/foreign/segmentAccessModes/
-Chris.
Maurizio,
On 30 Apr 2020, at 15:35, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Looks good - thanks. Perhaps I'd tweak "will hold" and replace it with "will feature", and also add a link between "access modes" to the access mode section of the javadoc, to make it more navigable.
Done. https://cr.openjdk.java.net/~chegar/foreign/segmentAccessModes/ I also added similar clarifying wording + test, for MemorySegment::spliterator Maybe fold this into your webrev, or into the foreign branch. Whatever is easier for you. -Chris.
Thanks Chris, I'll make sure this is included in the next iteration! Cheers Maurizio On 30/04/2020 16:07, Chris Hegarty wrote:
Maurizio,
On 30 Apr 2020, at 15:35, Maurizio Cimadamore <maurizio.cimadamore@oracle.com> wrote:
Looks good - thanks. Perhaps I'd tweak "will hold" and replace it with "will feature", and also add a link between "access modes" to the access mode section of the javadoc, to make it more navigable. Done. https://cr.openjdk.java.net/~chegar/foreign/segmentAccessModes/
I also added similar clarifying wording + test, for MemorySegment::spliterator
Maybe fold this into your webrev, or into the foreign branch. Whatever is easier for you.
-Chris.
participants (8)
-
Chris Hegarty
-
forax@univ-mlv.fr
-
Jorn Vernee
-
Mandy Chung
-
Maurizio Cimadamore
-
Paul Sandoz
-
Peter Levart
-
Remi Forax