RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non-volatile memory

newer
Re changes from JDK-8068498, but...

Andrew Dinn

20 Jul 2018 20 Jul '18

1:50 p.m.

I have finally managed to draft a JEP to formally present the proposal I circulated a month or two back regarding support for MappedByteBuffer access to non-volatile memory. JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851 The JEP references a re-implementation of the proposed API: http://cr.openjdk.java.net/~adinn/pmem/webrev.01/ The rework addresses some of the feedback provided in the initial discussion: Previous Thread: (starts in May archive) http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-May/053317.html (continues into June archive at) http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-June/053673.html In particular, the re-implementation has split the original Hotspot intrinsic used to guarantee writeback an address range into 3 intrinsics relocated into class Unsafe which write back a single cache line perform any necessary memory sync pre-writeback perform any necessary memory sync post-writeback Also, the implementation has been extended from Linux on x86_64 to also coater for Linux on AArch64. The latter version compiles and generates correct code but has not yet been tested because I have not yet obtained access to a box with a suitably up to date kernel. Comments and suggestions for improvement would be very welcome. n.v. Jonathan Halliday is about to get his hands on an x86 box with a real NVM dimm so we may soon be able to try this out with NVM proper rather than testing with volatile memory pretending to be NVM. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Show replies by date

Alan Bateman

20 Jul 20 Jul

2:05 p.m.

On 20/07/2018 14:50, Andrew Dinn wrote:

...

I have finally managed to draft a JEP to formally present the proposal I circulated a month or two back regarding support for MappedByteBuffer access to non-volatile memory.

JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851

The JEP references a re-implementation of the proposed API:

http://cr.openjdk.java.net/~adinn/pmem/webrev.01/

The JEP proposes adding a map_persistent method to FileChannel. An alternative may be to just add new MapModes that you can specify to the existing map method. It would reduce the API surface and I think fit better with the existing API. -Alan

Andrew Dinn

2:31 p.m.

On 20/07/18 15:05, Alan Bateman wrote:

...

On 20/07/2018 14:50, Andrew Dinn wrote:

...
I have finally managed to draft a JEP to formally present the proposal I circulated a month or two back regarding support for MappedByteBuffer access to non-volatile memory.

JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851

The JEP references a re-implementation of the proposed API:

http://cr.openjdk.java.net/~adinn/pmem/webrev.01/

The JEP proposes adding a map_persistent method to FileChannel. An alternative may be to just add new MapModes that you can specify to the existing map method. It would reduce the API surface and I think fit better with the existing API. Doh! that's a good idea (I wish I had thought of that :-)

I'll admit that it looks a tad counter-intuitive to specify the storage characteristics of the data as part the access mode e.g. class MapMode { public static final MapMode READ_ONLY = ... public static final MapMode READ_WRITE = ... public static final MapMode PRIVATE = ... public static final MapMode READ_ONLY_PERSISTENT = ... public static final MapMode READ_WRITE_PERSISTENT = ... . . . } However, it would make the API a damn sight neater. Also, there is still the question as to whether existing FileChannel implementations would correctly reject these new modes if passed them i.e. if (mode == MapMode.READ_ONLY) { ... } else if (mode == MapMode.READ_WRITE) { ... } else { ... } Thoughts? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Florian Weimer

10:15 p.m.

* Andrew Dinn:

...

Comments and suggestions for improvement would be very welcome.

There are a bunch of typos in the JEP (“susbet”, “Rntime”). Formatting of the headlines looks wrong to me, too. On the implementation side, I find this a bit concerning: + // TODO - remove the following defines + // for testing we need to define the extra MAP_XXX flags used for + // persistent memory. they will eventually get defined by the + // system and included via sys/mman.h Do you really want to change implementation behavior based on which glibc headers (or kernel headers) were used to build the JDK? It's likely more appropriate to define the constants yourself if they are not available. There is some architecture variance for the MAP_* constants, but in case of MAP_SHARED_VALIDATE, even HPPA has the same definition. How does the MappedByteBuffer::force(long, long) method interact with the memory model?

Andrew Dinn

23 Jul 23 Jul

11:01 a.m.

Hi Florian, Thank you for the feedback. Comments inline below (extensive comments in the case of the memory model question - it is complex). regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander On 20/07/18 23:15, Florian Weimer wrote:

...

There are a bunch of typos in the JEP (“susbet”, “Rntime”). Formatting of the headlines looks wrong to me, too.

Thanks. Yes, the spelling and, more importantly, format need a lot of work. The JIRA format does not exactly look compelling and the format in the JEP listing is even worse. I am currently working to remedy this.

...

On the implementation side, I find this a bit concerning:

+ // TODO - remove the following defines + // for testing we need to define the extra MAP_XXX flags used for + // persistent memory. they will eventually get defined by the + // system and included via sys/mman.h

Do you really want to change implementation behavior based on which glibc headers (or kernel headers) were used to build the JDK? It's likely more appropriate to define the constants yourself if they are not available. There is some architecture variance for the MAP_* constants, but in case of MAP_SHARED_VALIDATE, even HPPA has the same definition.

No, I guess I don't really want to change implementation behaviour based on which glibc headers (or kernel headers) were used to build the JDK. So, I think I need to modify this so the constants are always defined by FileChannelImpl.c I have followed the Intel libpmem code in defining the values for these constants. Can you provide details of the 'architecture variance' you refer to above? I guess it doesn't matter too much while the code is being compiled only for Linux on x86_64/AArch64.

...

How does the MappedByteBuffer::force(long, long) method interact with the memory model? That's a very good question, obviously with different answers for AArch64 and x86_64.

I'll note before citing the relevant documentation that in the current design I have explicitly allowed for independent pre and post sync operations to implement whatever memory sync is required wrt to non-writeback memory operations preceding or following the writebacks. This is to make it easy to revise the current barrier generation choices if needed (and this flexibility comes at no cost once the relevant sync methods are intrinsified by the JIT). I have currently implemented both pre and post sync by an mfence on x86_64 whatever the operation used to flush memory, which is playing safe and this choice merits reviewing. This is a stronger memory sync than that employed by the libpmem code for x86_64 which also relies on mfence (rather than sfence) but i) omits the mfence for the post-sync and ii) when using CLFLUSH to implement the writeback also omits the pre-sync mfence. It is not clear to me from the docs that a post-sync can safely be omitted for CLFLUSHOPT or CLWB but I do read it as confirming that pre- and post-sync can be safely omitted when using CLFLUSH. Also, the docs suggest only an sfence is needed rather than an mfence -- which sounds right. However, I have followed libpmem and employed mfence for now (more below). On AArch64 I have currently implemented both pre and post sync by a StoreStore barrier -- a dmb(ISHST). This is weaker than the libpmem pre-sync for AArch64 which emits a dmb(ISH) but also stronger as regards the post-sync in that libpmem does not emit a post-sync fence. On my reading of the ARM Architecture Reference Manual (ARM ARM) I believe dmb(ISHST) pre and post is what is needed and all that is needed (again see below for more). It's worth stating up front that I added the post-sync for both architectures to ensure that the thread doing the writeback cannot update any other NVM before the writeback has completed. This may perhaps be overkill. However, I envisaged the following possibility for error to arise. Imagine that a writeback to 'data' state e.g. a log record might be followed by a write to a NVM memory location that marks the log record as 'live' (it might, say, be linked into the tail of a list of valid log records). The danger here is that the memory system might, independent of any writeback scheduled by the application, flush the cache line for the 'live' write before the prior 'data' writeback was complete. A crash at this point might leave the log record live but, potentially, incomplete/inconsistent. Ok, now here is what the relevant documentation says with my gloss on what it means for the current draft implementation. x86_64: The latest Intel documentation available at https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-... describes all 3 available instructions for implementing the cache line writeback: CLFLUSH, CLFLUSHOPT and CLWB. Vol. 2A 3-139 CLFLUSH "The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLFLUSH instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLFLUSH instruction that references the cache line). "Executions of the CLFLUSH instruction are ordered with respect to each other and with respect to writes, locked read-modify-write instructions, fence instructions, and executions of CLFLUSHOPT to the same cache line. They are not ordered with respect to executions of CLFLUSHOPT to different cache lines." n.b. the prefetch clauses above (and also repeated in the next two citations) are not important since all a prefetch might do is bring the same flushed data back into cache without affecting validity of completion of the writeback. Ordering of loads can also be ignored for similar reasons. So, I believe the above text indicates that there is no need for a pre-sync or post-sync memory barrier when using CLFLUSH. All writes executed before the CLFLUSH will be completed before the flush is performed, all CLFLUSH operations complete in order of execution before each subsequent CLFLUSH and the last CLFLUSH completes before any subsequently executed writes can complete and become visible. Vol. 2A 3-141 CLFLUSHOPT [note the references to CLFLUSH below /appear/ exactly as occurs in the manual -- I believe the two occurences in the first paragraph are meant to read CLFLUSHOPT] "The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLFLUSH [sic] instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLFLUSH [sic] instruction that references the cache line). "Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read- modify-write instructions; they are also ordered with respect to the following accesses to the cache line being invalidated: writes, executions of CLFLUSH, and executions of CLFLUSHOPT. They are not ordered with respect to writes, executions of CLFLUSH, or executions of CLFLUSHOPT that access other cache lines; to enforce ordering with such an operation, software can insert an SFENCE instruction between CFLUSHOPT and that operation." I believe this indicates that an SFENCE (or a stronger MFENCE) is needed to ensure all writes executed before the first CLFLUSHOPT are visible. Successive CLFLUSHOPT instructions can then complete out of order and may not have completed before any subsequently executed write becomes visible. I think this risks the situation I described above where a controlling update may be flushed before a flush of the data it controls. Hence I think we require a post-sync MFENCE (or SFENCE). Vol. 2A 3-146 CLWB "The memory attribute of the page containing the affected line has no effect on the behavior of this instruction. It should be noted that processors are free to speculatively fetch and cache data from system memory regions that are assigned a memory-type allowing for speculative reads (such as, the WB, WC, and WT memory types). PREFETCHh instructions can be used to provide the processor with hints for this speculative behavior. Because this speculative fetching can occur at any time and is not tied to instruction execution, the CLWB instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms (that is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLWB instruction that references the cache line). "CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back. CLWB instruction need not be ordered by another CLWB or CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores executed by the logical processor to the same address." I think this is the same story as for CLFLUSHOPT i.e. an SFENCE is needed before any CLWB instructions are executed to ensure that pending writes are visible but then successive CLWB instructions can complete out of order and may not have completed before any subsequently executed write become visible. So, I also think this may need an MFENCE (or SFENCE) after the CLWB instructions. AArch64: The latest ARM ARM I have is ARM ARM C available at https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-man... I am following libpmem in using instruction dc(CVAC) to writeback the data cache lines (clear virtual address to point of coherency). I think it would be preferable if dc(CVAP) were used (clear virtual address to point of persistence). However 1) support for dc(CVAP) does not seem to be a mandatory part of the architecture 2) user-space access to CVAP instruction is not guaranteed and not currently configured by default on Linux 3) CVAC is expected normally to guarantee writeback to physical memory Ordering of cache management instructions is specified in section D3.4.8 under heading "Ordering and completion of data and instruction cache instructions" on page D3-2069 as follows: "All data cache instructions, other than DC ZVA, that specify an address: - Execute in program order relative to loads or stores that access an address in Normal memory with either Inner Write Through or Inner Write Back attributes within the same cache line of minimum size, as indicated by CTR_EL0.DMinLine. - Can execute in any order relative to loads or stores that access any address with the Device memory attribute, or with Normal memory with Inner Non-cacheable attribute unless a DMB or DSB is executed between the instructions. - Execute in program order relative to other data cache maintenance instructions, other than DC ZVA, that specify an address within the same cache line of minimum size, as indicated by CTR_EL0.DMinLine. - Can execute in any order relative to loads or stores that access an address in a different cache line of minimum size, as indicated by CTR_EL0.DMinLine, unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to other data cache maintenance instructions, other than DC ZVA , that specify an address in a different cache line of minimum size, as indicated by CTR_EL0.DMinLine, unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to data cache maintenance instructions that do not specify an address unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to instruction cache maintenance instructions unless a DSB is executed between the instructions." . . . "All data cache maintenance instructions that do not specify an address: - Can execute in any order relative to data cache maintenance instructions that do not specify an address unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to data cache maintenance instructions that specify an address, other than Data Cache Zero, unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to loads or stores unless a DMB or DSB is executed between the instructions. - Can execute in any order relative to instruction cache maintenance instructions unless a DSB is executed between the instructions. All instruction cache instructions can execute in any order relative to other instruction cache instructions, data cache instructions, loads, and stores unless a DSB is executed between the instructions. A cache maintenance instruction can complete at any time after it is executed, but is only guaranteed to be complete, and its effects visible to other observers, following a DSB instruction executed by the PE that executed the cache maintenance instruction. See also the requirements for cache maintenance instructions in Completion and endpoint ordering on page B2-97. In all cases, where the text in this section refers to a DMB or a DSB , this means a DMB or DSB whose required access type is both loads and stores." The link to B2-97 provides some further clarification of what it means for the above operations to be 'complete' (I have omitted the clauses regarding completion of translation table walks and TLB invalidates). "For all memory, the completion rules are defined as: - A read R 1 to a Location is complete for a shareability domain when all of the following are true: - Any write to the same Location by an Observer within the shareability domain will be Coherence-after R 1 . - Any translation table walks associated with R 1 are complete for that shareability domain. - A write W 1 to a Location is complete for a shareability domain when all of the following are true: - Any write to the same Location by an Observer within the shareability domain will be Coherence-after W 1 . - Any read to the same Location by an Observer within the shareability domain will either Reads-from W 1 or Reads-from a write that is Coherence-after W 1 . - Any translation table walks associated with the write are complete for that shareability domain. . . . - A cache maintenance instruction is complete for a shareability domain when the memory effects of the instruction are complete for that shareability domain, and any translation table walks that arise from the instruction are complete for that shareability domain. . . . The completion of any cache or TLB maintenance instruction includes its completion on all PEs that are affected by both the instruction and the DSB operation that is required to guarantee visibility of the maintenance instruction." So, my reading of this is i) a DMB is needed to ensure that writes executed before the dc(CVAC) are completed before the writeback starts ii) a DMB(ISHST) is all that is needed to achieve this pre-ordering iii) a DMB is needed to ensure that pending writebacks are completed before any subsequently executed writes can become visible. iv) a DMB(ISHST) is all that is needed to achieve this post-ordering

Andrew Dinn

24 Jul 24 Jul

12:17 p.m.

On 23/07/18 12:01, Andrew Dinn wrote:

...

Hi Florian,

Thank you for the feedback. Comments inline below (extensive comments in the case of the memory model question - it is complex). Having written up what I assumed was needed, of course I then went back to have another look at the Intel code and as a result I think I have

i) corrected my mistaken understanding of what it does ii) come to the conclusion that libpmem is doing the right thing No surprise ;-] Firstly, I was mistaken in thinking that the libpmem code was executing a memory barrier before the writeback operations (I was confused by talk of pre-drain, hw-drain and drain at various points in the code). The definition of the critical function pmem_persist is as follows void pmem_persist(const void *addr, size_t len) { pmem_flush(addr, len); pmem_drain(); } where pmem_flush is implemented via clflush/clflushopt/clwb or dc(CVAC) pmem_drain is implemented as an sfence or dmb(ISH) So, libpmem is emitting a memory barrier /after/ writeback and this barrier serves to to ensure that writeback is complete before any subsequent writes can proceed to completion and become visible. That deals with the data + control update issue I was concerned about. On the other hand, libpmem does /not/ execute any memory barriers prior to executing the writeback instructions. I had assumed that a pre-sync was also necessary to ensure that prior writes completed before the writeback was initiated. However, looking at the details provided in the documentation this seems to be a spurious requirement of my own invention. On x86_64 writeback operations proceeding via CLFLUSHOPT or CLWB on some specific cache line are ordered wrt to writes on that same cache line. The relevant text is: "Executions of the CLFLUSHOPT instruction are [...] ordered with respect to the following accesses to the cache line being invalidated: writes, executions of CLFLUSH, and executions of CLFLUSHOPT." "CLWB is implicitly ordered with older stores executed by the logical processor to the same address" So, there is no need to execute a memory barrier to avoid the possibility of writing back a cache line before it has been updated by an in-progress write to that same cache line. Of course, neither is there a need to be concerned about in-progress writes to cache lines that are not going to be flushed. They won't have any effect on the correctness of the writeback. So, a pre-barrier is unnecessary. The same consideration seem to apply for AArch64. "All data cache instructions, other than DC ZVA, that specify an address: - Execute in program order relative to loads or stores that access an address in Normal memory with either Inner Write Through or Inner Write Back attributes within the same cache line of minimum size, as indicated by CTR_EL0.DMinLine" So, once again a DMB is required after but not before writeback is initiated. I'm also now unsure that a DMB(ISHST)is actually adequate. The relevant text from the ARM ARM is at the end of the text I previously cited. "In all cases, where the text in this section refers to a DMB or a DSB , this means a DMB or DSB whose required access type is both loads and stores." I think that means that the pmem_drain barrier has to be a DMB(ISH) not a DMB(ISHST). I'm not really sure why but it does seem to be the implication. So, to sum up Unsafe.writebackPreSync0 should be implemented as a no-op on x86_64 a no-op on AArch64 Unsafe.writebackPosSync0 should be implemented as a no-op on x86_64 when using clflush an sfence on x86_64 when using clflushopt or clwb a dmb(ISH) on AArch64 I think it may still be worth retaining Unsafe.writebackPreSync0 (even though it currently translates to a no-op) in case new hardware requires some extra preparatory operation before writeback commences. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Haley

5 p.m.

On 07/24/2018 01:17 PM, Andrew Dinn wrote:

...

I'm also now unsure that a DMB(ISHST)is actually adequate. The relevant text from the ARM ARM is at the end of the text I previously cited.

"In all cases, where the text in this section refers to a DMB or a DSB , this means a DMB or DSB whose required access type is both loads and stores."

I think that means that the pmem_drain barrier has to be a DMB(ISH) not a DMB(ISHST). I'm not really sure why but it does seem to be the implication.

Probably, but that's OK. DMB(ISHST) is only a StoreStore, and I suspect you want LoadStore as well anyway. StoreStore on its own is only in practice correct if what you're storing before the barrier instruction is constant data, and that's really not true in this case. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Florian Weimer

26 Jul 26 Jul

7:05 p.m.

* Andrew Dinn:

...

...
+ // TODO - remove the following defines + // for testing we need to define the extra MAP_XXX flags used for + // persistent memory. they will eventually get defined by the + // system and included via sys/mman.h

Do you really want to change implementation behavior based on which glibc headers (or kernel headers) were used to build the JDK? It's likely more appropriate to define the constants yourself if they are not available. There is some architecture variance for the MAP_* constants, but in case of MAP_SHARED_VALIDATE, even HPPA has the same definition.

No, I guess I don't really want to change implementation behaviour based on which glibc headers (or kernel headers) were used to build the JDK. So, I think I need to modify this so the constants are always defined by FileChannelImpl.c

I have followed the Intel libpmem code in defining the values for these constants. Can you provide details of the 'architecture variance' you refer to above?

Just a quick note on that: Some of the constants in <sys/mman.h> vary between the Linux architectures for historic reasons. However, for MAP_SHARED_VALIDATE, the constant is consistent across all architectures supported by glibc and in the mainline kernel. You could add something like this to be on the safe side: #ifdef MAP_SHARED_VALIDATE #if MAP_SHARED_VALIDATE != 3 #error Unexpected value for MAP_SHARED_VALIDATE #endif #else /* !MAP_SHARED_VALIDATE */ #define MAP_SHARED_VALIDATE 3 #endif But I doubt that this is required for this constant.

Andrew Dinn

27 Jul 27 Jul

7:39 a.m.

On 26/07/18 20:05, Florian Weimer wrote:

...

* Andrew Dinn:

...
...
+ // TODO - remove the following defines + // for testing we need to define the extra MAP_XXX flags used for + // persistent memory. they will eventually get defined by the + // system and included via sys/mman.h

Do you really want to change implementation behavior based on which glibc headers (or kernel headers) were used to build the JDK? It's likely more appropriate to define the constants yourself if they are not available. There is some architecture variance for the MAP_* constants, but in case of MAP_SHARED_VALIDATE, even HPPA has the same definition.

No, I guess I don't really want to change implementation behaviour based on which glibc headers (or kernel headers) were used to build the JDK. So, I think I need to modify this so the constants are always defined by FileChannelImpl.c

I have followed the Intel libpmem code in defining the values for these constants. Can you provide details of the 'architecture variance' you refer to above?

Just a quick note on that: Some of the constants in <sys/mman.h> vary between the Linux architectures for historic reasons. However, for MAP_SHARED_VALIDATE, the constant is consistent across all architectures supported by glibc and in the mainline kernel.

You could add something like this to be on the safe side:

#ifdef MAP_SHARED_VALIDATE #if MAP_SHARED_VALIDATE != 3 #error Unexpected value for MAP_SHARED_VALIDATE #endif #else /* !MAP_SHARED_VALIDATE */ #define MAP_SHARED_VALIDATE 3 #endif

But I doubt that this is required for this constant. Thanks for the update, Florian. I agree that the above tweak is unnecessary at present -- given that the current code is targeted at Linux on x86_64 and AArch64 where a change is extremely unlikely. That decision may need revisiting if/when the implementation is extended to other ports.

regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

25 Jul 25 Jul

11:44 a.m.

Round 2 ------- I have updated the JEP and uploaded a revised webrev in the light of existing feedback JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851 Formatted JEP: http://openjdk.java.net/jeps/8207851 New webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.02/ I would welcome any further comments (I guess I'd be happy to push this as is but I find it hard to believe it does not require more work. Someone must have something to add :-) List of Changes --------------- I have followed Alan's request: - extending FileChannel.MapMode to add two new modes READ_ONLY_PERISTENT and READ_WRITE_PERSISTENT - retaining the single method FileChannel.map to handle the new modes (i.e. dumping the idea of a separate map_peristent API) - FileChannelImpl.map now handles the relevant XXX_PERSISTENT cases (rather than delegating to an internal method) I have followed Florian's advice: corrected spelling and format of the JEP, incuding made a clearer separation of description of API changes from implementation details modified the conditional compilation conditions and associated comments in FileChannelImpl.c to define MAP_SYNC and MAP_PRIVATE when the OS does not do so I have also brought the implementation of the pre-sync and post-sync operations in line with the libpmem code: On x86 pre-sync is a no-op post-sync is an no-op if clflush is used for writeback post-sync is an sfence if clflushopt or clwb is used for writeback On AArch64 pre-sync is a no-op post-sync is a dmb(ISH) Finally, I have updated the JEP text to reflect all the above changes. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

31 Jul 31 Jul

11:01 a.m.

Well, given the lack of any further input I am left wondering, JEP neophyte that I am, whether: i) said happy lacuna implies that it is appropriate to submit this JEP (as prompted by both the process blurb provided in JEP 1 and the accordingly labelled button in the JEP JIRA interface) or ii) the which unfortunate hiatus in commentary indicates simply that I have failed to engage with the relevant worthies of this parish Also, Iris (privately) mentioned something about "CSRs that add/modify public APIs in the "java.*" modules". Does that mean there is more paperwork to come? Now or later? Advice on how to progress would be very welcome. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Roger Riggs

3:24 p.m.

Hi Andrew, The JEP is clear on the concept and has many more details than are necessary to submit it. Supplying so much detail at the earliest stage invites more discussion about the implementation than the concept. Usually, the details of API and implementation are added as the JEP progresses. By the time it is implemented and complete, the JEP description will be fully up to date and concrete. Regards, Roger On 7/31/18 4:01 AM, Andrew Dinn wrote:

...

Well, given the lack of any further input I am left wondering, JEP neophyte that I am, whether:

i) said happy lacuna implies that it is appropriate to submit this JEP (as prompted by both the process blurb provided in JEP 1 and the accordingly labelled button in the JEP JIRA interface)

or

ii) the which unfortunate hiatus in commentary indicates simply that I have failed to engage with the relevant worthies of this parish

Also, Iris (privately) mentioned something about "CSRs that add/modify public APIs in the "java.*" modules". Does that mean there is more paperwork to come? Now or later?

Advice on how to progress would be very welcome.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Vladimir Kozlov

1 Aug 1 Aug

5:04 a.m.

Hi Andrew, I think most people are concentrating on JVMLS/OJW currently. That could explain a lack of comments. But from the discussion on this thread I see that your proposal is well received. On 7/31/18 4:01 AM, Andrew Dinn wrote:

...

Well, given the lack of any further input I am left wondering, JEP neophyte that I am, whether:

i) said happy lacuna implies that it is appropriate to submit this JEP (as prompted by both the process blurb provided in JEP 1 and the accordingly labelled button in the JEP JIRA interface)

You should look on http://cr.openjdk.java.net/~mr/jep/jep-2.0-02.html: "The first three states for a Feature or Infrastructure JEP are: Draft — Initial state in which the JEP is drafted, reviewed, revised, and endorsed Submitted — By the JEP’s owner, declaring the JEP ready to be evaluated for the JDK Roadmap Candidate — By the OpenJDK Lead, to add the JEP to the JDK Roadmap". JEP needs to be reviewed and endorsed by "Group or Area Lead" before moving it to "Submitted" state. Alan and Brian should do this from Libs side and Mikael and I from Hotspot side: http://openjdk.java.net/projects/jdk/leads And all of us are on JVMLS/OJW this week. Please, wait.

...

or

ii) the which unfortunate hiatus in commentary indicates simply that I have failed to engage with the relevant worthies of this parish

Also, Iris (privately) mentioned something about "CSRs that add/modify public APIs in the "java.*" modules". Does that mean there is more paperwork to come? Now or later?

Yes, you need to file CSR in JBS but only when JEP is in later stages and API changes are finalized. https://wiki.openjdk.java.net/display/csr/CSR+FAQs There should be "Create CSR" in "More" menu button. It has similar to JEP format: https://wiki.openjdk.java.net/display/csr/Fields+of+a+CSR+Request Regards, Vladimir

...

Advice on how to progress would be very welcome.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

7:41 a.m.

Hi Vladimir, Thank you for the very helpful explanation. I will happily wait for further feedback (Alan contacted me offline with similar advice). regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

3 Aug 3 Aug

3:44 p.m.

Round 3: This week Jonathan Halliday was able to access a machine which has both an NVM DIMM and CPU that implements the clflush and clwb instructions. He is currently preparing some benchmark figures for running transactions using a log stored in NVM. However, in order to get to that point we had to exercise and then fixed a few things that were unexercised on DRAM/CPU without clflush and clwb. So, here is a new webrev which fixes the errors found in the previous drop: New webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.03 The changes are all fairly trivial: MappedByteBuffer.java - fixed a copy-paste error which meant force(from to) was passing an incorrect address range length vm_version_x86.hpp - corrected an error in the bitwise tests that detect presence of hw flush insns assembler_x86.cpp - ensured a register prefix is generated when the address is encoded in register r8 and upwards regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

6 Aug 6 Aug

9:29 a.m.

On 25/07/2018 12:44, Andrew Dinn wrote:

...

Round 2 ------- I have updated the JEP and uploaded a revised webrev in the light of existing feedback

JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851

Formatted JEP: http://openjdk.java.net/jeps/8207851

New webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.02/

I would welcome any further comments (I guess I'd be happy to push this as is but I find it hard to believe it does not require more work. Someone must have something to add :-)

I read through the draft JEP and also skimmed the latest webrev. The current iteration, to introduce new MapMode values, is not too bad but it makes me wonder if we could avoid new modes altogether by detecting that the file is backed by NVM. Is there a fcntl cmd or some other syscall that can be used to detect this? I'm asking because it would open the potential to discuss limiting the API changes to just MappedByteBuffer. In the draft JEP then the Summary, Goals, Non-Goals, Success Metrics, and Motivation sections read well. The Description section is very wordy and includes a lot of implementation detail that I assume will be removed before it is submitted (my guess is that it can be distilled down to a couple of paragraphs). As a comparison, the API surface in JEP 337 is much larger but we were able to reduce the text down to a few paragraphs. The Testing sectioning highlights the challenges and reminds me that we have several features that will need maintainers to continuously test to ensure that a feature doesn't bit rot (SCTP and the proposed APIs for RDMA sockets are in the same boat). Are you familiar with BufferPoolMXBean which can be used by management tools to monitor pool of buffers? I'm wondering if we should introduce a new pool for NVM so that it doesn't show up in the "mapped" pool. I can't tell from this thread if you are looking for detailed feedback on the current patch. A couple of random things: - there are residual references to map_persistent in several places - MappedByteBuffer.force will need to specify IAE - The new methods are missing @since - I think it would be clearer if MappedByteBuffer.force use "region" rather than "sub-region" as it is used in the context of the buffer, not the original file. - There will be round of clean up needed to the changes to java.base to get the javadoc and code consistent with the javadoc/code in this area. I assume we'll get back to that when the patch is closer to review. -Alan.

Andrew Dinn

14 Aug 14 Aug

11:34 a.m.

Hi Alan, Thanks for looking at this and apologies for the delayed response (c/o our annual Red Hat OpenJDK team meeting taking place all last week). On 06/08/18 10:29, Alan Bateman wrote:

...

The current iteration, to introduce new MapMode values, is not too bad but it makes me wonder if we could avoid new modes altogether by detecting that the file is backed by NVM. Is there a fcntl cmd or some other syscall that can be used to detect this? I'm asking because it would open the potential to discuss limiting the API changes to just MappedByteBuffer.

That might well be a cleaner, simpler model. However, there are two issues with it: Firstly, there does not appear to be a supported way to identify whether or not a device -- or a file associated with the device -- is mappable using MAP_SYNC. A hack would be to attempt a map call with MAP_SYNC and see if it failed with a suitably revealing errno code but that seems rather gross and, perhaps, unreliable (the same errno code might occur for other reasons). Secondly, it may not always be the case that users will want to map an NVM device with MAP_SYNC and rely on explicit line-by-line writeback. If a user only needs block-level syncing then she might prefer to map the file as a normal device and rely on force to perform block level writeback via the driver.

...

In the draft JEP then the Summary, Goals, Non-Goals, Success Metrics, and Motivation sections read well. The Description section is very wordy and includes a lot of implementation detail that I assume will be removed before it is submitted (my guess is that it can be distilled down to a couple of paragraphs). As a comparison, the API surface in JEP 337 is much larger but we were able to reduce the text down to a few paragraphs. The Testing sectioning highlights the challenges and reminds me that we have several features that will need maintainers to continuously test to ensure that a feature doesn't bit rot (SCTP and the proposed APIs for RDMA sockets are in the same boat).

I will be happy to perform this trimming down. The detail was merely to ensure the draft implementation was understandable.

...

Are you familiar with BufferPoolMXBean which can be used by management tools to monitor pool of buffers? I'm wondering if we should introduce a new pool for NVM so that it doesn't show up in the "mapped" pool.

I am not familiar with BufferPoolMXBean but I will investigate. I'm agnostic as to whether to track NVM buffers separate from other buffers but I am happy to consider adding this as a requirement. However, before doing so, can you explain further why there is a need to keep this case separate? In particular, I am puzzled by your use of "mapped". The NVM memory is definitely mapped although obviously it is of a different type to say private mapped volatile memory or a mapped disk file. Also, are these latter cases distinguished as far as BufferPoolMXBean bean counting is concerned? NVM seems to me to fall somewhere between the two.

...

I can't tell from this thread if you are looking for detailed feedback on the current patch. A couple of random things:

I guess its not the primary concern (which is to get the JEP submitted) but comments are welcome now or later.

...

- there are residual references to map_persistent in several places - MappedByteBuffer.force will need to specify IAE - The new methods are missing @since - I think it would be clearer if MappedByteBuffer.force use "region" rather than "sub-region" as it is used in the context of the buffer, not the original file. - There will be round of clean up needed to the changes to java.base to get the javadoc and code consistent with the javadoc/code in this area. I assume we'll get back to that when the patch is closer to review. Ok, I'll post an update when I have trimmed the implementation details the JEP and tweaked the code according to the above comments. Thank you for your feedback.

Andrew Dinn

15 Aug 15 Aug

11:08 a.m.

Hi Alan, Just a quick follow-up regarding the BufferPoolMXBean. On 14/08/18 12:34, Andrew Dinn wrote:

...

On 06/08/18 10:29, Alan Bateman wrote:

...
Are you familiar with BufferPoolMXBean which can be used by management tools to monitor pool of buffers? I'm wondering if we should introduce a new pool for NVM so that it doesn't show up in the "mapped" pool.

Ok, so I have worked through the code and can see that MappedByteBuffers for file device-based mappings (as opposed to direct memory mappings) are currently counted by class Unmapper local to FileChannelImpl and the counts are made visible via a BufferPoolMXBean with name "mapped" which is present in the list exposed by method getBufferPoolMXBeans() of class ManagementFactoryHelper. So, it seems what you are asking for is to keep separate tallies for persistent MappedByteBuffers and expose them in a new BufferPoolMXBean also inserted in the list exposed by getBufferPoolMXBeans. That sounds like quite a good idea. It will leave any current code that wants to see file mappings counting the 'same' thing yet still makes it possible to count persistent mappings on their own and also tally all mappings by iterating over all BufferPoolMXBeans in the list. I suggest giving the bean the name "mapped_persistent". I would happily update the JEP to include this as an 'API change' of sorts but I'm not quite sure how to document it. Is this ok the JEP requires exposing a new java.lang.management.BufferPoolMXBean with name "mapped_persistent" as a platform managed bean which can be accessed using the existing ManagementFactory.getPlatformMXBeans() API the "mapped_persistent" bean records the buffer count, total bytes and total allocation count stats for persistent MappedByteBuffers (i.e. those mapped using MapMode READ_ONLY_PERSISTENT or READ_WRITE_PERSISTENT) these stats are collected separately from those currently collected in the "mapped" BufferPoolMXBean which records the equivalent counts for non-persistent MappedByteBuffers (i.e. those mapped using MapMode READ_ONLY, READ_WRITE or PRIVATE) It seems straightforward to implement this behaviour. If you like I can add it to the next webrev? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

20 Aug 20 Aug

3:18 p.m.

Hi Alan, Round 4: I have redrafted the JEP and updated the implementation in the light of your last feedback: JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851 Formatted JEP: http://openjdk.java.net/jeps/8207851 New webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.04/ n.b. the webrev is against the latest jdkdev head. n.b.b. there is only one change additional to those which respond to your feedback (all documented below). I had to redefine the x86 status bits used to tag presence of clflushopt and clwb insns (in class VM_Version). The ones I originally bagged were claimed by some new vector (VAES?) API -- which borked things spectacularly until worked ut what was going on.

...

On 06/08/18 10:29, Alan Bateman wrote:

...

...
The current iteration, to introduce new MapMode values, is not too bad but it makes me wonder if we could avoid new modes altogether by detecting that the file is backed by NVM. Is there a fcntl cmd or some other syscall that can be used to detect this? I'm asking because it would open the potential to discuss limiting the API changes to just MappedByteBuffer.

I have not made this change for reasons outlined in my previous post.

...

...
In the draft JEP then the Summary, Goals, Non-Goals, Success Metrics, and Motivation sections read well. The Description section is very wordy and includes a lot of implementation detail that I assume will be removed before it is submitted (my guess is that it can be distilled down to a couple of paragraphs). As a comparison, the API surface in JEP 337 is much larger but we were able to reduce the text down to a few paragraphs. The Testing sectioning highlights the challenges and reminds me that we have several features that will need maintainers to continuously test to ensure that a feature doesn't bit rot (SCTP and the proposed APIs for RDMA sockets are in the same boat).

The JEP is now slim and trim, omitting all details of the implementation.

...

...
Are you familiar with BufferPoolMXBean which can be used by management tools to monitor pool of buffers? I'm wondering if we should introduce a new pool for NVM so that it doesn't show up in the "mapped" pool.

I have updated the JEP to require that stats for current persistent MappedByteBuffer instances (i.e. those mapped with MapMode.READ_ONLY_PERSISTENT or MapMode.READ_WRITE_PERSISTENT) are exposed via a new BufferPoolMXBean with name "mapped_persistent". This is additional to and separate from the bean & associated stats which currently detail other file-map derived MappedByteBuffer instances. The JEP requires this new bean to be exposed in the list retrieved by a call to ManagementFactory.getPlatformMXBeans(BufferPoolMXBean.class)

...

...
- there are residual references to map_persistent in several places

I think I have removed all of them from the code

...

...
- MappedByteBuffer.force will need to specify IAE

IAE was intended for the case where an attempt was made to map a PERSISTENT MapMode and the fd turned out not to be for an NVM. It is not possible to detect this reliably when running on a Linux kernels which do not implement MAP_SYNC + MAP_SHARED_PRIVATE. So, I changed the API to return an IOException in all cases, with an embedded error message indicating the problem as accurately as possible. That means there is no change to the method's exception signature, rather to the possible causes for the IOException.

...

...
- The new methods are missing @since

I think I have added these annotations in all places where they are needed.

...

...
- I think it would be clearer if MappedByteBuffer.force use "region" rather than "sub-region" as it is used in the context of the buffer, not the original file.

I think I fixed all occurrences.

...

...
- There will be round of clean up needed to the changes to java.base to get the javadoc and code consistent with the javadoc/code in this area. I assume we'll get back to that when the patch is closer to review. Sure, happy to do that when we get there.

Alan Bateman

10 Sep 10 Sep

6:05 p.m.

On 20/08/2018 16:18, Andrew Dinn wrote:

...

Hi Alan,

Round 4:

I have redrafted the JEP and updated the implementation in the light of your last feedback:

JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851

Formatted JEP: http://openjdk.java.net/jeps/8207851

New webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.04/

The updated JEP looks much better. I realize we've been through several iterations on this but I'm now wondering if the MappedByteBuffer is the right API. As you've shown, it's straight forward to map a region of NVM and use the existing API, I'm just not sure if it's the right API. I think I'd like to see a few examples of how the API might be used. ByteBuffers aren't intended for use by concurrent threads and I just wonder if the examples might need that. I also wonder if there is a possible connection with work in Project Panama and whether it's worth exploring if its scopes and pointers could be used to backed by NVM. The Risks and Assumption section mentions the 2GB limit which is another reminder that the MBB API may not be the right API. The 2-arg force method to msync a region make sense although it might be more consistent for the second parameter to be the length than the end offset. A detail for later is whether UOE might be more appropriate for implementations that do not support the XXX_PERSISTENT modes. -Alan.

Andrew Dinn

20 Sep 20 Sep

2:20 p.m.

Ping! Could I please get a review of this latest version of the JEP? This includes responses to all previous comments with changes made both to the JEP and the draft implementation. I would like to get this into JDK12 if at all possible regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander On 10/09/18 19:05, Alan Bateman wrote:

...

On 20/08/2018 16:18, Andrew Dinn wrote:

...
Hi Alan,

Round 4:

I have redrafted the JEP and updated the implementation in the light of your last feedback:

   JEP JIRA: https://bugs.openjdk.java.net/browse/JDK-8207851

   Formatted JEP: http://openjdk.java.net/jeps/8207851

   New webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.04/

The updated JEP looks much better.

I realize we've been through several iterations on this but I'm now wondering if the MappedByteBuffer is the right API. As you've shown, it's straight forward to map a region of NVM and use the existing API, I'm just not sure if it's the right API. I think I'd like to see a few examples of how the API might be used. ByteBuffers aren't intended for use by concurrent threads and I just wonder if the examples might need that. I also wonder if there is a possible connection with work in Project Panama and whether it's worth exploring if its scopes and pointers could be used to backed by NVM. The Risks and Assumption section mentions the 2GB limit which is another reminder that the MBB API may not be the right API.

The 2-arg force method to msync a region make sense although it might be more consistent for the second parameter to be the length than the end offset.

A detail for later is whether UOE might be more appropriate for implementations that do not support the XXX_PERSISTENT modes.

-Alan.

Jonathan Halliday

4:17 p.m.

Hi Alan I'm a middleware engineer (transaction engine, message queues, etc) and I evolved the current API design whilst making some of Red Hat's Jakarta EE stack work with persistent memory. It's a good fit for our needs because it pretty much matches they way we currently do off-heap and persistent storage, so porting existing code is a breeze. For anything that is a 'make this bunch of bytes persistent' use case there isn't really a complex API. We're not trying to pass data structures to and fro as we would when calling a richer C library. The serialization layer takes care of flattening all the structures to an opaque byte[] or ByteBuffer already. We just need to be able to reason about the persistence guarantees the same way we can with the existing sync() call. We already take care of the threading, since existing storage solutions wouldn't work without those safeguards anyhow. So, there are certainly some use cases for which the current API is a good fit, because those are the ones I designed it for, based on code that already uses and copes with the limitations of MappedByteBuffer. However... There are cases where we may want to get further optimizations by eliding the serialization to byte[]/ByteBuffer and instead be able to access persistent memory *as objects*. That's a harder problem and may involve language integration rather than just API changes, for example being able to allocate an object whose state (primitive fields, perhaps also object pointers) is backed by an (optionally explicitly specified area) of pmem. It's definitely a more powerful model, but also a much bigger problem to chew on. Some halfway solution in which we can use Java objects to point into specific areas of memory in a typesafe way (e.g. 'that pmem address should be considered an int') would seem to be something that Panama could overlap with, but it's a convenience layer that could also be modelled by putting higher level abstractions over the proposed low level API. Over time we may have e.g. PersistentLong in the same way that today we have AtomicLong, but it's something that could be tested out in a 3rd party library initially and then migrated into the standard library if it's shown to be useful. Is the proposed API sufficient for all use cases? Probably not. But it's useful for some and, so far as I can tell, non-harmful to others. Under the new release model what we have now is useful in its own right and should ship sooner rather than later, with additional functionality following later in a modular, agile fashion? I don't really see sufficient advantage in holding this pending e.g. investigation of integration with Panama, though that's definitely an interesting avenue for future work. Regards Jonathan On 10/09/2018 19:05, Alan Bateman wrote: ...

...

I realize we've been through several iterations on this but I'm now wondering if the MappedByteBuffer is the right API. As you've shown, it's straight forward to map a region of NVM and use the existing API, I'm just not sure if it's the right API. I think I'd like to see a few examples of how the API might be used. ByteBuffers aren't intended for use by concurrent threads and I just wonder if the examples might need that. I also wonder if there is a possible connection with work in Project Panama and whether it's worth exploring if its scopes and pointers could be used to backed by NVM. The Risks and Assumption section mentions the 2GB limit which is another reminder that the MBB API may not be the right API.

-- Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

21 Sep 21 Sep

3:44 p.m.

Hi Alan, Thanks for the response and apologies for failing to notice you had posted it some days ago (doh!). Jonathan Halliday has already explained how Red Hat might want to use this API. Well, what he said, essentially! In particular, this model provides a way of ensuring that raw byte data is able to be persisted coherently from Java with the minimal possible overhead. It would be up to client code above this layer to implement structuring mechanisms for how those raw bytes get populated with data and to manage any associated issues regarding atomicity, consistency and isolation (i.e. to provide the A, C and I of ACID to this API's D). The main point of the JEP is to ensure that this such a performant base capability is available for known important cases where that is needed such as, for example, a transaction manager or a distributed cache. If equivalent middleware written in C can use persistent memory to bring the persistent storage tier nearer to the CPU and, hence, lower data durability overheads then we really need an equivalently performant option in Java or risk Java dropping out as a player in those middleware markets. I am glad to hear that other alternatives might be available and would be happy to consider them. However, I'm not sure that this means this option is not still desirable, especially if it is orthogonal to those other alternatives. Most importantly, this one has the advantage that we know it is ready to use and will provide benefits (we have already implemented a journaled transaction log over it with promising results and someone from our messaging team has already been looking into using it to persist log messages). Indeed, we also know we can use it to provide a base for supporting all the use cases addressed by Intel's libpmem and available to C programmers, e.g. a block store, simply by implementing Java client libraries that provide managed access to the persistent buffer along the same lines as the Intel C libraries. I'm afraid I am not familiar with Panama 'scopes' and 'pointers' so I can't really compare options here. Can you point me at any info that explains what those terms mean and how it might be possible to use them to access off-heap, persistent data. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

24 Sep 24 Sep

8:14 a.m.

On 21/09/2018 16:44, Andrew Dinn wrote:

...

Hi Alan,

Thanks for the response and apologies for failing to notice you had posted it some days ago (doh!).

Jonathan Halliday has already explained how Red Hat might want to use this API. Well, what he said, essentially! In particular, this model provides a way of ensuring that raw byte data is able to be persisted coherently from Java with the minimal possible overhead. It would be up to client code above this layer to implement structuring mechanisms for how those raw bytes get populated with data and to manage any associated issues regarding atomicity, consistency and isolation (i.e. to provide the A, C and I of ACID to this API's D).

The main point of the JEP is to ensure that this such a performant base capability is available for known important cases where that is needed such as, for example, a transaction manager or a distributed cache. If equivalent middleware written in C can use persistent memory to bring the persistent storage tier nearer to the CPU and, hence, lower data durability overheads then we really need an equivalently performant option in Java or risk Java dropping out as a player in those middleware markets.

I am glad to hear that other alternatives might be available and would be happy to consider them. However, I'm not sure that this means this option is not still desirable, especially if it is orthogonal to those other alternatives. Most importantly, this one has the advantage that we know it is ready to use and will provide benefits (we have already implemented a journaled transaction log over it with promising results and someone from our messaging team has already been looking into using it to persist log messages). Indeed, we also know we can use it to provide a base for supporting all the use cases addressed by Intel's libpmem and available to C programmers, e.g. a block store, simply by implementing Java client libraries that provide managed access to the persistent buffer along the same lines as the Intel C libraries.

I'm afraid I am not familiar with Panama 'scopes' and 'pointers' so I can't really compare options here. Can you point me at any info that explains what those terms mean and how it might be possible to use them to access off-heap, persistent data.

I'm not questioning the need to support NVM, instead I'm trying to see whether MappedByteBuffer is the right way to expose this in the standard API. Buffers were designed in JSR-51 with specific use-cases in mind but they are problematic for many off-heap cases as they aren't thread safe, are limited to 2GB, lack confinement, only support homogeneous data (no layout support). At the same time, Project Panama (foreign branch in panama/dev) has the potential to provide the right API to work with memory. I see Jonathan's mail where he seems to be using object serialization so the solution on the table works for his use-case but it may not be the right solution for more general multi-threaded access to NVM. There is some interest in seeing whether this part of Project Panama could be advanced to address many of the cases where developers are resorting to using Unsafe today. There would of course need to be some integration with buffers too. There's no concrete proposal/JEP at this time, I'm just pointing out that many of the complaints about buffers that are really cases where it's the wrong API and the real need is something more fundamental. So where does this leave us? If support for persistent memory is added to FileChannel.map as we've been discussing then it may not be too bad as the API surface is small. The API surface is just new map modes and a MappedByteBuffer::isPersistent method. The force method that specify a range is something useful to add to MBB anyway. If (and I hope when) there is support for memory regions or pointers then I could imagine re-visiting this so that there are alternative ways to get a memory region or pointer that is backed by NVM. If the timing were different then I think we'd skip the new map modes and we would be having a different discussion here. An alternative is course to create the mapped buffer via a JDK-specific API as that would be easier to deprecate and remove in the future if needed. I'm interested to see if there is other input on this topic before it gets locked into extending the standard API. -Alan.

Andrew Haley

25 Sep 25 Sep

9:01 a.m.

On 09/24/2018 09:14 AM, Alan Bateman wrote:

...

I'm not questioning the need to support NVM, instead I'm trying to see whether MappedByteBuffer is the right way to expose this in the standard API. Buffers were designed in JSR-51 with specific use-cases in mind but they are problematic for many off-heap cases as they aren't thread safe, are limited to 2GB, lack confinement, only support homogeneous data (no layout support).

I'm baffled by this assertion. It's true that the 2Gb limit is turning into a real pain, but all of the rest are nothing to do with ByteBuffers, which are just raw memory. Adding structure is something that can be done by third-party libraries or by some future OpenJDK API.

...

So where does this leave us? If support for persistent memory is added to FileChannel.map as we've been discussing then it may not be too bad as the API surface is small. The API surface is just new map modes and a MappedByteBuffer::isPersistent method. The force method that specify a range is something useful to add to MBB anyway.

Yeah, that's right, it is. While something not yet planned might be an alternative, even a better one, the purpose of our faster release cadence is to "evolve the Java SE Platform and the JDK at a more rapid pace, so that new features [can] be delivered in timelier manner". This is timely; waiting for Panama to think of what might be possible, not so much. -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Stuart Marks

26 Sep 26 Sep

2:19 a.m.

Hi Andrew, I've been starting to look at some of the buffer-related issues and I've been discussing this issue with Alan. On 9/25/18 2:01 AM, Andrew Haley wrote:

...

On 09/24/2018 09:14 AM, Alan Bateman wrote:

...
I'm not questioning the need to support NVM, instead I'm trying to see whether MappedByteBuffer is the right way to expose this in the standard API. Buffers were designed in JSR-51 with specific use-cases in mind but they are problematic for many off-heap cases as they aren't thread safe, are limited to 2GB, lack confinement, only support homogeneous data (no layout support).

I'm baffled by this assertion. It's true that the 2Gb limit is turning into a real pain, but all of the rest are nothing to do with ByteBuffers, which are just raw memory. Adding structure is something that can be done by third-party libraries or by some future OpenJDK API.

If you look around Java SE for a public API to represent raw memory, then MappedByteBuffer is the obvious choice; there isn't any realistic alternative right now. By asking whether MBB is "the right way to expose" raw memory, I think Alan is really saying, is MBB the best API to use to expose raw memory in the long run? I think the answer is clearly No. However, that's not an argument against proceeding with MBB. Rather, it's setting expectations that MBB has limitations that impose some pain in the short term, that possibly can be worked around, but which in the long term may prove to be insurmountable. For an example of something that might be "insurmountable", I'll use the 2GB limitation. Doing something to raise the limit is certainly possible. The question is, after retrofitting this into the API, whether the result will be something that people want to program with, and whether it will perform well enough. It might not. Another example would be a library layered on top that provides structured access. It's certainly possible have such a library that will get the job done. However, the Buffer API necessarily requires certain operations to be implemented using multiple method calls, or it might require copying in some cases. Either of these might impose unacceptable overhead. There are, however, certain things that can be done with buffers in the short term to make things work better. For example, JDK-5029431 absolute bulk put/get methods. I suspect this will be quite helpful for the NVM case, and indeed it's been something that's been asked for repeatedly for quite some time. If you (collectively) are aware of this and other limitations, then sure, let's proceed with this JEP.

...

...
So where does this leave us? If support for persistent memory is added to FileChannel.map as we've been discussing then it may not be too bad as the API surface is small. The API surface is just new map modes and a MappedByteBuffer::isPersistent method. The force method that specify a range is something useful to add to MBB anyway.

Yeah, that's right, it is. While something not yet planned might be an alternative, even a better one, the purpose of our faster release cadence is to "evolve the Java SE Platform and the JDK at a more rapid pace, so that new features [can] be delivered in timelier manner". This is timely; waiting for Panama to think of what might be possible, not so much.

Agree, "waiting for Panama" is certainly not something that anyone wants to do. s'marks

Andrew Dinn

8:44 a.m.

Hi Stuart, On 26/09/18 03:19, Stuart Marks wrote:

...

I've been starting to look at some of the buffer-related issues and I've been discussing this issue with Alan.

I'd be interested to hear more details if the discussion has gone far enough for any of it to be aired online.

...

On 9/25/18 2:01 AM, Andrew Haley wrote:

...
. . . I'm baffled by this assertion. It's true that the 2Gb limit is turning into a real pain, but all of the rest are nothing to do with ByteBuffers, which are just raw memory. Adding structure is something that can be done by third-party libraries or by some future OpenJDK API.

If you look around Java SE for a public API to represent raw memory, then MappedByteBuffer is the obvious choice; there isn't any realistic alternative right now. By asking whether MBB is "the right way to expose" raw memory, I think Alan is really saying, is MBB the best API to use to expose raw memory in the long run? I think the answer is clearly No. Sorry, it may well be my fault but it's not really clear to me. You mention two issues below, buffer size limit and API verbosity.

I acknowledge the former is a problem but i) there is a proposal (JDK-8180628, referred to in the JEP) to deal with this limitation by adding extra methods that support the creation of larger buffers and use of long indices and ii) there are existing Java libraries built over ByteBuffer that overcome this issue (as Sandhya pointed out in a note somewhere near this one). Sure, both of these remedies have limitations which /might/ lead to problems but I don't see (yet, at least) that they are manifestly unworkable. As regards the latter issue, I am not really sure what you are suggesting would be a better alternative to using ByteBuffer get and put methods? Are you perhaps thinking of some way of overlaying a record (or object?) structure over the mapped memory that might allow a compiler to provide an equivalent to these ByteBuffer method calls as direct memory loads and stores? Of course, a Java library built on top of this proposal could provide a similar abstraction, although perhaps not with as firm guarantees for compiler optimization and certainly not with the possibility of direct language integration. Copying might indeed be an issue but surely that depends on the type of data being written, the library design and the way the client needs to operate in order to use it (essentially on whether it can size in advance a data area in which to write the contents direct vs build a separate copy as distinct pieces and then serialize them). Anyway, I hope the above explains why I'm not sure about your use of the the words 'clearly' or (in a in a later comment) 'insurmountable'. Perhaps more details of your conversation with Alan would help.

...

There are, however, certain things that can be done with buffers in the short term to make things work better. For example, JDK-5029431 absolute bulk put/get methods. I suspect this will be quite helpful for the NVM case, and indeed it's been something that's been asked for repeatedly for quite some time.

Would it be enough to add a comment to the Risks and Assumptions of the JEP to point out this limitation and the potential need to address it, mentioning this specific JDK issue -- much as was done with JDK-8180628.

...

If you (collectively) are aware of this and other limitations, then sure, let's proceed with this JEP.

Well, I'm very happy to proceed if Alan is in agreement. One thing he suggested in an earlier post was splitting off the functionality to create a persistent ByteBuffer into a separate method so as to avoid any issues if we have to deprecate this model at a alter date. I think that's a quite reasonable precaution and I'd be happy to propose an alternative API or let Alan suggest one. Perhaps Alan can comment?

...

Agree, "waiting for Panama" is certainly not something that anyone wants to do. Yes, indeed, there are already several important use cases waiting in the wings.

Andrew Haley

9:53 a.m.

On 09/26/2018 09:44 AM, Andrew Dinn wrote:

...

As regards the latter issue, I am not really sure what you are suggesting would be a better alternative to using ByteBuffer get and put methods? Are you perhaps thinking of some way of overlaying a record (or object?) structure over the mapped memory that might allow a compiler to provide an equivalent to these ByteBuffer method calls as direct memory loads and stores? Of course, a Java library built on top of this proposal could provide a similar abstraction, although perhaps not with as firm guarantees for compiler optimization and certainly not with the possibility of direct language integration.

Thinking about it some more, I guess that being able to say something like aFoo.bar = n; rather than aFoo.setBar(n); is preferable, although common Java practice (and indeed good OOP practice) is to provide getters and setters rather than directly expose fields. I suppose one advantage of being able to use an object structure is that the compiler can do better (type-based) alias analysis, can track dead stores, etc. But from a language design perspective, the fact that classes internally use direct field accesses but expose a completely different get/set notation is something of a linguistic wart. [ The BETA language used a single notation, the pattern, for assignment, method calls, and argument passing. Therefore, in BETA there would be no API difference between the two exaples above. They'd both be something like n -> aFoo.bar Curiously, the first commercial licences for BETA were acquired by Bill Joy and James Gosling, so they knew about this, but I suppose a more C-like notation for Java was the right decision. The BETA notation would have been too frightening for the target audience. :-) ] -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Alan Bateman

10:35 a.m.

On 26/09/2018 09:44, Andrew Dinn wrote:

...

:

...
If you (collectively) are aware of this and other limitations, then sure, let's proceed with this JEP. Well, I'm very happy to proceed if Alan is in agreement. One thing he suggested in an earlier post was splitting off the functionality to create a persistent ByteBuffer into a separate method so as to avoid any issues if we have to deprecate this model at a alter date. I think that's a quite reasonable precaution and I'd be happy to propose an alternative API or let Alan suggest one. Perhaps Alan can comment?

I'm reasonably happy with the approach that we converged on to introduce new map modes and use the existing FileChannel.map method. Ideally new map modes wouldn't need to be exposed but you've looked into that (to my satisfaction at least). One detail that I think may need another iteration or two on is whether we need one or two modes. This will become clearer once the javadoc is fleshed out a bit further. It maybe that one new map mode, "SYNC" for example, that works with the existing READ_WRITE mode may be clearer and easier to specify. I think that would be consistent with how copy-on-write mappings are exposed with the PRIVATE mode. It also provides a 1-1 mapping to the underlying MAP_SYNC flag too. As regards the bigger topic on what the right API is for "memory" then I don't think ByteBuffer is the right answer. You've touched on a few of the issues in your mail but there are bigger issues around thread safety and confinement, also the issue of the buffer position/limit that get in the way and the reason why several libraries use Unsafe. There isn't any concrete proposal or discussion to point at around splitting out this aspect of Project Panama. Stuart and I just pointing out that a better solution could emerge which could lead to have an alternative API to map a region of NVM as "memory" rather than a mapped byte buffer. If I were developing a file system backed by NVM then that is probably the raw API that I would want, not MBB. As regards introducing an API that we could deprecate then that musing was about introducing a JDK-specific API. If MapMode were an interface then we could have introduce a JDK-specific map mode that wouldn't have required rev'ing the standard API. Introducing a completely separate map method in a JDK-specific module doesn't seem to be worth it as I think we can be confident that the proposed and possible-new.future approaches will not conflict. -Alan

Andrew Dinn

1:27 p.m.

Hi Alan, On 26/09/18 11:35, Alan Bateman wrote:

...

I'm reasonably happy with the approach that we converged on to introduce new map modes and use the existing FileChannel.map method. Ideally new map modes wouldn't need to be exposed but you've looked into that (to my satisfaction at least). One detail that I think may need another iteration or two on is whether we need one or two modes. This will become clearer once the javadoc is fleshed out a bit further. It maybe that one new map mode, "SYNC" for example, that works with the existing READ_WRITE mode may be clearer and easier to specify. I think that would be consistent with how copy-on-write mappings are exposed with the PRIVATE mode. It also provides a 1-1 mapping to the underlying MAP_SYNC flag too.

I'm not clear why we should only use one flag. The two flags I specified reflect two independent use cases, one where data stored in an NVM device is accessed read-only and another where it is accessed read-write. Are you suggesting that the read-only case is redundant? I'm not sure I agree. For example, a utility which might want to review the state of persistent data while a service is off-line would really want to pass flag READ_ONLY_PERSISTENT. Of course, it could employ READ_WRITE_PERSISTENT (or equivalently, SYNC) and just not write the data but, mutatis mutandis, that same argument would remove the case for flag READ_ONLY.

...

As regards the bigger topic on what the right API is for "memory" then I don't think ByteBuffer is the right answer. You've touched on a few of the issues in your mail but there are bigger issues around thread safety and confinement, also the issue of the buffer position/limit that get in the way and the reason why several libraries use Unsafe. There isn't any concrete proposal or discussion to point at around splitting out this aspect of Project Panama. Stuart and I just pointing out that a better solution could emerge which could lead to have an alternative API to map a region of NVM as "memory" rather than a mapped byte buffer. If I were developing a file system backed by NVM then that is probably the raw API that I would want, not MBB.

I really don't understand how thread safety comes into the argument here. How is some other mechanism going to avoid the need for client threads -- or, rather, the applications which create them them -- to manage concurrent updates of NVM? Are you perhaps thinking of some form of software transactional memory? I'm really struggling to understand why you keep raising this point without any further detail to explain how the lack of exclusion and synchronization primitives constitutes a problem this API that can be bypassed by rolling some equivalent alternative into another API. Also, can you explain what you mean by confinement? (thread confinement?). Also, I don't think I would label this API an attempt to develop a file system. I think that's rather and overblown characterisation of what it does. This is an attempt to provide an intermediate storage tier somewhere between a file system and volatile memory to create/access/update data across program runs, without incurring the costs associated with implementing that sort of capability on top of existing file system implementations. The use of a byte array layout at the base level is indeed, as the success of Unix/Linux/MS Windows file systems makes clear, a helpful way of enabling a variety of application-defined data layouts to be implemented on top of this storage tier. But I don't really see how that makes this a file system.

...

As regards introducing an API that we could deprecate then that musing was about introducing a JDK-specific API. If MapMode were an interface then we could have introduce a JDK-specific map mode that wouldn't have required rev'ing the standard API. Introducing a completely separate map method in a JDK-specific module doesn't seem to be worth it as I think we can be confident that the proposed and possible-new.future approaches will not conflict. Ok, so no need for a change there then I guess.

I'm still not quite sure where this reply leaves the JEP though. Shall I update the Risks and Assumptions section to include mention of JDK-5029431 as suggested to Stuart? Is there anything else I can do to progress things? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

4 p.m.

On 26/09/2018 14:27, Andrew Dinn wrote:

...

: I really don't understand how thread safety comes into the argument here. How is some other mechanism going to avoid the need for client threads -- or, rather, the applications which create them them -- to manage concurrent updates of NVM? Are you perhaps thinking of some form of software transactional memory? I'm really struggling to understand why you keep raising this point without any further detail to explain how the lack of exclusion and synchronization primitives constitutes a problem this API that can be bypassed by rolling some equivalent alternative into another API. The reason that we've mentioned it a few times is because it's a significant issue. If you have a byte buffer then you can't have different threads accessing different parts of the buffer at the same time, at least not with any of the relative get/put methods as they depend on the buffer position. Sure you can globally synchronize all operations but you'll likely want much finer granularity. This bugbear comes up periodically, particularly when using buffers for cases that they weren't really designed for. Stuart pointed out the lack of absolute bulk get/put operations which is something that I think will help some of these cases.

...

Also, can you explain what you mean by confinement? (thread confinement?).

Yes, thread vs. global. I haven't been following Panama close enough to say how this is exposed in the API.

...

Also, I don't think I would label this API an attempt to develop a file system. I think that's rather and overblown characterisation of what it does.

I think you may have mis-read my mail as was just picking another example where MBB would be problematic.

...

: I'm still not quite sure where this reply leaves the JEP though. Shall I update the Risks and Assumptions section to include mention of JDK-5029431 as suggested to Stuart? Is there anything else I can do to progress things?

It wouldn't do any harm to have this section mention that an alternative that exposes a more memory centric API may be possible in the future. -Alan

Andrew Dinn

27 Sep 27 Sep

9:23 a.m.

On 26/09/18 17:00, Alan Bateman wrote:

...

The reason that we've mentioned it a few times is because it's a significant issue. If you have a byte buffer then you can't have different threads accessing different parts of the buffer at the same time, at least not with any of the relative get/put methods as they depend on the buffer position. Sure you can globally synchronize all operations but you'll likely want much finer granularity. This bugbear comes up periodically, particularly when using buffers for cases that they weren't really designed for. Stuart pointed out the lack of absolute bulk get/put operations which is something that I think will help some of these cases.

Ok, I see that there is an issue here where only byte puts at absolute positions can be performed concurrently (assuming threads know how to avoid overlapping writes) while, by contrast, cursor-based byte[] stores require synchronization. Is that the problem in full? Or is there still more that I have missed? I certainly agree that a retro-fit to ByteBuffer which provided for byte[] puts at absolute positions would be of benefit for this proposal. However, such a retro-fix would be equally as useful for volatile memory buffers. I am not clear why this omission suggests to you that we should look at a new, alternative model for managing this particular type of mapped memory rather than just fixing the current one properly for all buffers.

...

...
Also, can you explain what you mean by confinement? (thread confinement?). Yes, thread vs. global. I haven't been following Panama close enough to say how this is exposed in the API.

Well, my vague stab was obviously in the right ballpark but I'm afraid I still don't know what baseball is. Could you explain what you mean by confinement?

...

...
Also, I don't think I would label this API an attempt to develop a file system. I think that's rather and overblown characterisation of what it does. I think you may have mis-read my mail as was just picking another example where MBB would be problematic.

Apologies for my very evident confusion here. I'd be very grateful if you could talk down a notch or two and/or amplify a bit more to help the hard of thinking.

...

...
I'm still not quite sure where this reply leaves the JEP though. Shall I update the Risks and Assumptions section to include mention of JDK-5029431 as suggested to Stuart? Is there anything else I can do to progress things?

It wouldn't do any harm to have this section mention that an alternative that exposes a more memory centric API may be possible in the future. Ok, I'll certainly add that.

Peter Levart

10:28 a.m.

Hi Andrew, On 09/27/2018 11:23 AM, Andrew Dinn wrote:

...

On 26/09/18 17:00, Alan Bateman wrote:

...
The reason that we've mentioned it a few times is because it's a significant issue. If you have a byte buffer then you can't have different threads accessing different parts of the buffer at the same time, at least not with any of the relative get/put methods as they depend on the buffer position. Sure you can globally synchronize all operations but you'll likely want much finer granularity. This bugbear comes up periodically, particularly when using buffers for cases that they weren't really designed for. Stuart pointed out the lack of absolute bulk get/put operations which is something that I think will help some of these cases. Ok, I see that there is an issue here where only byte puts at absolute positions can be performed concurrently (assuming threads know how to avoid overlapping writes) while, by contrast, cursor-based byte[] stores require synchronization. Is that the problem in full? Or is there still more that I have missed?

I certainly agree that a retro-fit to ByteBuffer which provided for byte[] puts at absolute positions would be of benefit for this proposal. However, such a retro-fix would be equally as useful for volatile memory buffers. I am not clear why this omission suggests to you that we should look at a new, alternative model for managing this particular type of mapped memory rather than just fixing the current one properly for all buffers.

May I just note that multithreaded bulk operations are kind of possible without external synchronization (i.e. locks) if you follow a simple protocol: - never use relative operations on the shared ByteBuffer instance - never use operations that change internal mark/position/limit/byteOrder on the shared ByteBuffer instance - a concurrent bulk operation on 'bb' consists of: ByteBuffer myBb = bb.slice(0, bb.capacity()); // use myBb to perform concurrent bulk operation (any operations are allowed) and then throw it away or cache it in ThreadLocal If you combine this with explicit fences and/or atomic 16, 32 and 64 bit operations via VarHandles. (see MethodHandles.byteBufferViewVarHandle(Class, ByteOrder)), concurrent programming with ByteBuffer(s) is entirely possible. Regards, Peter

...

...
...
Also, can you explain what you mean by confinement? (thread confinement?). Yes, thread vs. global. I haven't been following Panama close enough to say how this is exposed in the API. Well, my vague stab was obviously in the right ballpark but I'm afraid I still don't know what baseball is. Could you explain what you mean by confinement?

...
...
Also, I don't think I would label this API an attempt to develop a file system. I think that's rather and overblown characterisation of what it does. I think you may have mis-read my mail as was just picking another example where MBB would be problematic. Apologies for my very evident confusion here. I'd be very grateful if you could talk down a notch or two and/or amplify a bit more to help the hard of thinking.

...
...
I'm still not quite sure where this reply leaves the JEP though. Shall I update the Risks and Assumptions section to include mention of JDK-5029431 as suggested to Stuart? Is there anything else I can do to progress things?

It wouldn't do any harm to have this section mention that an alternative that exposes a more memory centric API may be possible in the future. Ok, I'll certainly add that.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

10:41 a.m.

Hi Peter, On 27/09/18 11:28, Peter Levart wrote:

...

May I just note that multithreaded bulk operations are kind of possible without external synchronization (i.e. locks) if you follow a simple protocol:

- never use relative operations on the shared ByteBuffer instance - never use operations that change internal mark/position/limit/byteOrder on the shared ByteBuffer instance - a concurrent bulk operation on 'bb' consists of:

ByteBuffer myBb = bb.slice(0, bb.capacity()); // use myBb to perform concurrent bulk operation (any operations are allowed) and then throw it away or cache it in ThreadLocal

If you combine this with explicit fences and/or atomic 16, 32 and 64 bit operations via VarHandles. (see MethodHandles.byteBufferViewVarHandle(Class, ByteOrder)), concurrent programming with ByteBuffer(s) is entirely possible. Thank you for the usual expert advice. I am sure it will be of great help in implementing a persistent data management library over this JEP's base capability.

Stuart Marks

28 Sep 28 Sep

5:51 a.m.

Hi Andrew, Let me first stay that this issue of "ByteBuffer might not be the right answer" is something of a digression from the JEP discussion. I think the JEP should proceed forward using MBB with the API that you and Alan had discussed previously. At most, the discussion of the "right thing" issue might affect a side note in the JEP text about possible limitations and future directions of this effort. However, it's not a blocker to the JEP making progress as far as I'm concerned. With that in mind, I'll discuss the issue of multithreaded access to ByteBuffers and how this bears on whether buffers are or aren't the "right answer." There are actually several issues that figure into the "right answer" analysis. In this message, though, I'll just focus on the issue of multithreaded access. To recap (possibly for the benefit of other readers) the Buffer class doc has the following statement: Buffers are not safe for use by multiple concurrent threads. If a buffer is to be used by more than one thread then access to the buffer should be controlled by appropriate synchronization. Buffers are primarily designed for sequential operations such as I/O or codeset conversion. Typical buffer operations set the mark, position, and limit before initiating the operation. If the operation completes partially -- not uncommon with I/O or codeset conversion -- the position is updated so that the operation can be resumed easily from where it left off. The fact that buffers not only contain the data being operated upon but also mutable state information such as mark/position/limit makes it difficult to have multiple threads operate on different parts of the same buffer. Each thread would have to lock around setting the position and limit and performing the operation, preventing any parallelism. The typical way to deal with this is to create multiple buffer slices, one per thread. Each slice has its own mark/position/limit values but shares the same backing data. We can avoid the need for this by adding absolute bulk operations, right? Let's suppose we were to add something like this (considering ByteBuffer only, setting the buffer views aside): get(int srcOff, byte[] dst, int dstOff, int length) put(int dstOff, byte[] src, int srcOff, int length) Each thread can perform its operations on a different part of the buffer, in parallel, without interference from the others. Presumably these operations don't read or write the mark and position. Oh, wait. The existing absolute put and get overloads *do* respect the buffer's limit, so the absolute bulk operations ought to as well. This means they do depend on shared state. (I guess we could make the absolute bulk ops not respect the limit, but that seems inconsistent.) OK, let's adopt an approach similar to what was described by Peter Levart a couple messages upthread, where a) there is an initialization step where various things including the limit are set properly; b) the buffer is published to the worker threads properly, e.g., using a lock or other suitable memory operation; and c) all worker threads agree only to use absolute operations and to avoid relative operations. Now suppose the threads have completed their work and you want to, say, write the buffer's contents to a channel. You have to carefully make sure the threads are all finished and properly publish their results back to some central thread, have that central thread receive the results, set the position and limit, after which the central thread can initiate the I/O operation. This can certainly be made to work. But note what we just did. We now have an API where: - there are different "phases", where in one phase all the methods work, but in another phase only certain methods work (otherwise it breaks silently); - you have to carefully control all the code to ensure that the wrong methods aren't called when the buffer is in the wrong phase (otherwise it breaks silently); and - you can't hand off the buffer to a library (3rd party or JDK) without carefully orchestrating a transition into the right phase (otherwise it breaks silently). Frankly, this is pretty crappy. It's certainly possible to work around it. People do, and it is painful, and they complain about it up and down all day long (and rightfully so). Note that this discussion is based primarily on looking at the ByteBuffer API. I have not done extensive investigation of the impact of the various buffer views (IntBuffer, LongBuffer, etc.), nor have I looked thoroughly at the implementations. I have no doubt that we will run into additional issues when we do those investigations. If we were designing an API to support multi-threaded access to memory regions, it would almost certainly look nothing like the buffer API. This is what Alan means by "buffers might not be the right answer." As things stand, it appears quite difficult to me to fix the multi-threaded access problem without turning buffers into something they aren't, or fragmenting the API in some complex and uncomfortable way. Finally, note that this is not an argument against adding bulk absolute operations! I think we should probably go ahead and do that anyway. But let's not fool ourselves into thinking that bulk absolute operations solve the multi-threaded buffer access problem. s'marks

Peter Levart

7:21 a.m.

Hi Stuart, I mostly agree with your assessment about the suitability of the ByteBuffer API for nice multithreaded use. What would such API look like? I think pretty much like ByteBuffer but without things that mutate mark/position/limit/ByteOrder. A stripped-down ByteBuffer API therefore. That would be in my opinion the most low-level API possible. If you add things to such API that coordinate multithreaded access to the underlying memory, you are already creating a concurrent data structure for a particular set of use cases, which might not cover all possible use cases or be sub-optimal at some of them. So I think this is better layered on top of such API not built into it. Low-level multithreaded access to memory is, in my opinion, always going to be "unsafe" from the standpoint of coordination. It's not only the mark/position/limit/ByteOrder that is not multithreaded-friendly about ByteBuffer API, but the underlying memory too. It would be nice if mark/position/limit/ByteOrder weren't in the way though. Regards, Peter On 09/28/2018 07:51 AM, Stuart Marks wrote:

...

Hi Andrew,

Let me first stay that this issue of "ByteBuffer might not be the right answer" is something of a digression from the JEP discussion. I think the JEP should proceed forward using MBB with the API that you and Alan had discussed previously. At most, the discussion of the "right thing" issue might affect a side note in the JEP text about possible limitations and future directions of this effort. However, it's not a blocker to the JEP making progress as far as I'm concerned.

With that in mind, I'll discuss the issue of multithreaded access to ByteBuffers and how this bears on whether buffers are or aren't the "right answer." There are actually several issues that figure into the "right answer" analysis. In this message, though, I'll just focus on the issue of multithreaded access.

To recap (possibly for the benefit of other readers) the Buffer class doc has the following statement:

Buffers are not safe for use by multiple concurrent threads. If a buffer is to be used by more than one thread then access to the buffer should be controlled by appropriate synchronization.

Buffers are primarily designed for sequential operations such as I/O or codeset conversion. Typical buffer operations set the mark, position, and limit before initiating the operation. If the operation completes partially -- not uncommon with I/O or codeset conversion -- the position is updated so that the operation can be resumed easily from where it left off.

The fact that buffers not only contain the data being operated upon but also mutable state information such as mark/position/limit makes it difficult to have multiple threads operate on different parts of the same buffer. Each thread would have to lock around setting the position and limit and performing the operation, preventing any parallelism. The typical way to deal with this is to create multiple buffer slices, one per thread. Each slice has its own mark/position/limit values but shares the same backing data.

We can avoid the need for this by adding absolute bulk operations, right?

Let's suppose we were to add something like this (considering ByteBuffer only, setting the buffer views aside):

get(int srcOff, byte[] dst, int dstOff, int length) put(int dstOff, byte[] src, int srcOff, int length)

Each thread can perform its operations on a different part of the buffer, in parallel, without interference from the others. Presumably these operations don't read or write the mark and position. Oh, wait. The existing absolute put and get overloads *do* respect the buffer's limit, so the absolute bulk operations ought to as well. This means they do depend on shared state. (I guess we could make the absolute bulk ops not respect the limit, but that seems inconsistent.)

OK, let's adopt an approach similar to what was described by Peter Levart a couple messages upthread, where a) there is an initialization step where various things including the limit are set properly; b) the buffer is published to the worker threads properly, e.g., using a lock or other suitable memory operation; and c) all worker threads agree only to use absolute operations and to avoid relative operations.

Now suppose the threads have completed their work and you want to, say, write the buffer's contents to a channel. You have to carefully make sure the threads are all finished and properly publish their results back to some central thread, have that central thread receive the results, set the position and limit, after which the central thread can initiate the I/O operation.

This can certainly be made to work.

But note what we just did. We now have an API where:

- there are different "phases", where in one phase all the methods work, but in another phase only certain methods work (otherwise it breaks silently);

- you have to carefully control all the code to ensure that the wrong methods aren't called when the buffer is in the wrong phase (otherwise it breaks silently); and

- you can't hand off the buffer to a library (3rd party or JDK) without carefully orchestrating a transition into the right phase (otherwise it breaks silently).

Frankly, this is pretty crappy. It's certainly possible to work around it. People do, and it is painful, and they complain about it up and down all day long (and rightfully so).

Note that this discussion is based primarily on looking at the ByteBuffer API. I have not done extensive investigation of the impact of the various buffer views (IntBuffer, LongBuffer, etc.), nor have I looked thoroughly at the implementations. I have no doubt that we will run into additional issues when we do those investigations.

If we were designing an API to support multi-threaded access to memory regions, it would almost certainly look nothing like the buffer API. This is what Alan means by "buffers might not be the right answer." As things stand, it appears quite difficult to me to fix the multi-threaded access problem without turning buffers into something they aren't, or fragmenting the API in some complex and uncomfortable way.

Finally, note that this is not an argument against adding bulk absolute operations! I think we should probably go ahead and do that anyway. But let's not fool ourselves into thinking that bulk absolute operations solve the multi-threaded buffer access problem.

s'marks

Francesco Nigro

7:38 a.m.

Hi guys! I'm one of the mentioned devs (like many others) that are using external (and unsafe) APIs to concurrent access ByteBuffer's content and a developer of a messaging broker's journal that would benefit by this JEP :) Re concurrent access API, how it looks https://github.com/real-logic/agrona/blob/master/agrona/src/main/java/org/ag... ? note: I don't know how's considered to appear in these discussions without presenting myself and I hope to not be OT, but both this JEP and the comments around are so interesting that I couldn't resist: I apologize if I'm not respecting some rule on it Thanks for the hard work, Francesco Il giorno ven 28 set 2018 alle ore 09:21 Peter Levart < peter.levart@gmail.com> ha scritto:

...

Hi Stuart,

I mostly agree with your assessment about the suitability of the ByteBuffer API for nice multithreaded use. What would such API look like? I think pretty much like ByteBuffer but without things that mutate mark/position/limit/ByteOrder. A stripped-down ByteBuffer API therefore. That would be in my opinion the most low-level API possible. If you add things to such API that coordinate multithreaded access to the underlying memory, you are already creating a concurrent data structure for a particular set of use cases, which might not cover all possible use cases or be sub-optimal at some of them. So I think this is better layered on top of such API not built into it. Low-level multithreaded access to memory is, in my opinion, always going to be "unsafe" from the standpoint of coordination. It's not only the mark/position/limit/ByteOrder that is not multithreaded-friendly about ByteBuffer API, but the underlying memory too. It would be nice if mark/position/limit/ByteOrder weren't in the way though.

Regards, Peter

On 09/28/2018 07:51 AM, Stuart Marks wrote:

Hi Andrew,

Let me first stay that this issue of "ByteBuffer might not be the right answer" is something of a digression from the JEP discussion. I think the JEP should proceed forward using MBB with the API that you and Alan had discussed previously. At most, the discussion of the "right thing" issue might affect a side note in the JEP text about possible limitations and future directions of this effort. However, it's not a blocker to the JEP making progress as far as I'm concerned.

With that in mind, I'll discuss the issue of multithreaded access to ByteBuffers and how this bears on whether buffers are or aren't the "right answer." There are actually several issues that figure into the "right answer" analysis. In this message, though, I'll just focus on the issue of multithreaded access.

To recap (possibly for the benefit of other readers) the Buffer class doc has the following statement:

Buffers are not safe for use by multiple concurrent threads. If a buffer is to be used by more than one thread then access to the buffer should be controlled by appropriate synchronization.

Buffers are primarily designed for sequential operations such as I/O or codeset conversion. Typical buffer operations set the mark, position, and limit before initiating the operation. If the operation completes partially -- not uncommon with I/O or codeset conversion -- the position is updated so that the operation can be resumed easily from where it left off.

The fact that buffers not only contain the data being operated upon but also mutable state information such as mark/position/limit makes it difficult to have multiple threads operate on different parts of the same buffer. Each thread would have to lock around setting the position and limit and performing the operation, preventing any parallelism. The typical way to deal with this is to create multiple buffer slices, one per thread. Each slice has its own mark/position/limit values but shares the same backing data.

We can avoid the need for this by adding absolute bulk operations, right?

Let's suppose we were to add something like this (considering ByteBuffer only, setting the buffer views aside):

get(int srcOff, byte[] dst, int dstOff, int length) put(int dstOff, byte[] src, int srcOff, int length)

Each thread can perform its operations on a different part of the buffer, in parallel, without interference from the others. Presumably these operations don't read or write the mark and position. Oh, wait. The existing absolute put and get overloads *do* respect the buffer's limit, so the absolute bulk operations ought to as well. This means they do depend on shared state. (I guess we could make the absolute bulk ops not respect the limit, but that seems inconsistent.)

OK, let's adopt an approach similar to what was described by Peter Levart a couple messages upthread, where a) there is an initialization step where various things including the limit are set properly; b) the buffer is published to the worker threads properly, e.g., using a lock or other suitable memory operation; and c) all worker threads agree only to use absolute operations and to avoid relative operations.

Now suppose the threads have completed their work and you want to, say, write the buffer's contents to a channel. You have to carefully make sure the threads are all finished and properly publish their results back to some central thread, have that central thread receive the results, set the position and limit, after which the central thread can initiate the I/O operation.

This can certainly be made to work.

But note what we just did. We now have an API where:

- there are different "phases", where in one phase all the methods work, but in another phase only certain methods work (otherwise it breaks silently);

- you have to carefully control all the code to ensure that the wrong methods aren't called when the buffer is in the wrong phase (otherwise it breaks silently); and

- you can't hand off the buffer to a library (3rd party or JDK) without carefully orchestrating a transition into the right phase (otherwise it breaks silently).

Frankly, this is pretty crappy. It's certainly possible to work around it. People do, and it is painful, and they complain about it up and down all day long (and rightfully so).

Note that this discussion is based primarily on looking at the ByteBuffer API. I have not done extensive investigation of the impact of the various buffer views (IntBuffer, LongBuffer, etc.), nor have I looked thoroughly at the implementations. I have no doubt that we will run into additional issues when we do those investigations.

If we were designing an API to support multi-threaded access to memory regions, it would almost certainly look nothing like the buffer API. This is what Alan means by "buffers might not be the right answer." As things stand, it appears quite difficult to me to fix the multi-threaded access problem without turning buffers into something they aren't, or fragmenting the API in some complex and uncomfortable way.

Finally, note that this is not an argument against adding bulk absolute operations! I think we should probably go ahead and do that anyway. But let's not fool ourselves into thinking that bulk absolute operations solve the multi-threaded buffer access problem.

s'marks

Stuart Marks

9:14 p.m.

Hi Francesco, Thanks for the pointer to AtomicBuffer stuff. It's quite interesting. I don't know how directly relevant this JEP is your work. I guess that's really up to you and possibly Andrew Dinn. However, in my thinking, if you have useful comments and relevant questions, you're certainly welcome to participate in the discussion. Looking at the AtomicBuffer interface, I see that it supports reading and writing of a variety of data items, with a few different memory access modes. That reminds me of the VarHandles API. [1] This enables quite a number of different operations on a data item somewhere in memory, with a variety of memory access modes. What would AtomicBuffer look like if it were to use VarHandles? Or would AtomicBuffer be necessary at all if the rest of the library were to use VarHandles? Note that a VarHandle can be used to access an arbitrary item within a region of memory, such as an array or a ByteBuffer.[2] An obvious extension to VarHandle is to allow a long offset, not just an int offset. Note also that while many VarHandle methods return Object and take a varargs parameter of Object..., this does not imply that primitives are boxed! This is a bit of VM magic called "signature polymorphism"; see JVMS 2.9.3 [3]. s'marks [1] https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/invok... [2] https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/invok...) [3] https://docs.oracle.com/javase/specs/jvms/se11/html/jvms-2.html#jvms-2.9.3 On 9/28/18 12:38 AM, Francesco Nigro wrote:

...

Hi guys!

I'm one of the mentioned devs (like many others) that are using external (and unsafe) APIs to concurrent access ByteBuffer's content and a developer of a messaging broker's journal that would benefit by this JEP :) Re concurrent access API, how it looks https://github.com/real-logic/agrona/blob/master/agrona/src/main/java/org/ag...

note: I don't know how's considered to appear in these discussions without presenting myself and I hope to not be OT, but both this JEP and the comments around are so interesting that I couldn't resist: I apologize if I'm not respecting some rule on it

Thanks for the hard work, Francesco

Il giorno ven 28 set 2018 alle ore 09:21 Peter Levart <peter.levart@gmail.com <mailto:peter.levart@gmail.com>> ha scritto:

Hi Stuart,

I mostly agree with your assessment about the suitability of the ByteBuffer API for nice multithreaded use. What would such API look like? I think pretty much like ByteBuffer but without things that mutate mark/position/limit/ByteOrder. A stripped-down ByteBuffer API therefore. That would be in my opinion the most low-level API possible. If you add things to such API that coordinate multithreaded access to the underlying memory, you are already creating a concurrent data structure for a particular set of use cases, which might not cover all possible use cases or be sub-optimal at some of them. So I think this is better layered on top of such API not built into it. Low-level multithreaded access to memory is, in my opinion, always going to be "unsafe" from the standpoint of coordination. It's not only the mark/position/limit/ByteOrder that is not multithreaded-friendly about ByteBuffer API, but the underlying memory too. It would be nice if mark/position/limit/ByteOrder weren't in the way though.

Regards, Peter

On 09/28/2018 07:51 AM, Stuart Marks wrote:

...
Hi Andrew,

Let me first stay that this issue of "ByteBuffer might not be the right answer" is something of a digression from the JEP discussion. I think the JEP should proceed forward using MBB with the API that you and Alan had discussed previously. At most, the discussion of the "right thing" issue might affect a side note in the JEP text about possible limitations and future directions of this effort. However, it's not a blocker to the JEP making progress as far as I'm concerned.

With that in mind, I'll discuss the issue of multithreaded access to ByteBuffers and how this bears on whether buffers are or aren't the "right answer." There are actually several issues that figure into the "right answer" analysis. In this message, though, I'll just focus on the issue of multithreaded access.

To recap (possibly for the benefit of other readers) the Buffer class doc has the following statement:

Buffers are not safe for use by multiple concurrent threads. If a buffer is to be used by more than one thread then access to the buffer should be controlled by appropriate synchronization.

Buffers are primarily designed for sequential operations such as I/O or codeset conversion. Typical buffer operations set the mark, position, and limit before initiating the operation. If the operation completes partially -- not uncommon with I/O or codeset conversion -- the position is updated so that the operation can be resumed easily from where it left off.

The fact that buffers not only contain the data being operated upon but also mutable state information such as mark/position/limit makes it difficult to have multiple threads operate on different parts of the same buffer. Each thread would have to lock around setting the position and limit and performing the operation, preventing any parallelism. The typical way to deal with this is to create multiple buffer slices, one per thread. Each slice has its own mark/position/limit values but shares the same backing data.

We can avoid the need for this by adding absolute bulk operations, right?

Let's suppose we were to add something like this (considering ByteBuffer only, setting the buffer views aside):

get(int srcOff, byte[] dst, int dstOff, int length) put(int dstOff, byte[] src, int srcOff, int length)

Each thread can perform its operations on a different part of the buffer, in parallel, without interference from the others. Presumably these operations don't read or write the mark and position. Oh, wait. The existing absolute put and get overloads *do* respect the buffer's limit, so the absolute bulk operations ought to as well. This means they do depend on shared state. (I guess we could make the absolute bulk ops not respect the limit, but that seems inconsistent.)

OK, let's adopt an approach similar to what was described by Peter Levart a couple messages upthread, where a) there is an initialization step where various things including the limit are set properly; b) the buffer is published to the worker threads properly, e.g., using a lock or other suitable memory operation; and c) all worker threads agree only to use absolute operations and to avoid relative operations.

Now suppose the threads have completed their work and you want to, say, write the buffer's contents to a channel. You have to carefully make sure the threads are all finished and properly publish their results back to some central thread, have that central thread receive the results, set the position and limit, after which the central thread can initiate the I/O operation.

This can certainly be made to work.

But note what we just did. We now have an API where:

- there are different "phases", where in one phase all the methods work, but in another phase only certain methods work (otherwise it breaks silently);

- you have to carefully control all the code to ensure that the wrong methods aren't called when the buffer is in the wrong phase (otherwise it breaks silently); and

- you can't hand off the buffer to a library (3rd party or JDK) without carefully orchestrating a transition into the right phase (otherwise it breaks silently).

Frankly, this is pretty crappy. It's certainly possible to work around it. People do, and it is painful, and they complain about it up and down all day long (and rightfully so).

Note that this discussion is based primarily on looking at the ByteBuffer API. I have not done extensive investigation of the impact of the various buffer views (IntBuffer, LongBuffer, etc.), nor have I looked thoroughly at the implementations. I have no doubt that we will run into additional issues when we do those investigations.

If we were designing an API to support multi-threaded access to memory regions, it would almost certainly look nothing like the buffer API. This is what Alan means by "buffers might not be the right answer." As things stand, it appears quite difficult to me to fix the multi-threaded access problem without turning buffers into something they aren't, or fragmenting the API in some complex and uncomfortable way.

Finally, note that this is not an argument against adding bulk absolute operations! I think we should probably go ahead and do that anyway. But let's not fool ourselves into thinking that bulk absolute operations solve the multi-threaded buffer access problem.

s'marks

Stuart Marks

8:50 p.m.

On 9/28/18 12:21 AM, Peter Levart wrote:

...

I mostly agree with your assessment about the suitability of the ByteBuffer API for nice multithreaded use. What would such API look like? I think pretty much like ByteBuffer but without things that mutate mark/position/limit/ByteOrder. A stripped-down ByteBuffer API therefore. That would be in my opinion the most low-level API possible. If you add things to such API that coordinate multithreaded access to the underlying memory, you are already creating a concurrent data structure for a particular set of use cases, which might not cover all possible use cases or be sub-optimal at some of them. So I think this is better layered on top of such API not built into it. Low-level multithreaded access to memory is, in my opinion, always going to be "unsafe" from the standpoint of coordination. It's not only the mark/position/limit/ByteOrder that is not multithreaded-friendly about ByteBuffer API, but the underlying memory too. It would be nice if mark/position/limit/ByteOrder weren't in the way though.

Right, getting mark/position/limit/ByteOrder out of the way would be a good first step. (I just realized that ByteOrder is mutable too!) I also think you're right that proceeding down a "classic" thread-safe object design won't be fruitful. We don't know what the right set of operations is yet, so it'll be difficult to know how to deal with thread safety. One complicating factor is timely deallocation. This is an existing problem with direct buffers and MappedByteBuffer (see JDK-4724038). If a "buffer" were confined to a single thread, it could be deallocated safely when that thread is finished. I don't know how to guarantee thread confinement though. On the other hand, if a "buffer" is exposed to multiple threads, deallocation requires that proper synchronization and checking be done so that subsequent operations are properly checked (so that they do something reasonable, like throw an exception) instead of accessing unmapped or repurposed memory. If checking is done, this pushes operations to be coarser-grained (bulk) so that the checking overhead is amortized over a more expensive operation. I know there has been some thought put into this in the Panama project, but I don't know exactly where it stands at the moment. See the MemoryRegion and Scope stuff. s'marks

Andrew Dinn

9:16 a.m.

Hi Stuart, On 28/09/18 06:51, Stuart Marks wrote:

...

Let me first stay that this issue of "ByteBuffer might not be the right answer" is something of a digression from the JEP discussion. I think the JEP should proceed forward using MBB with the API that you and Alan had discussed previously. At most, the discussion of the "right thing" issue might affect a side note in the JEP text about possible limitations and future directions of this effort. However, it's not a blocker to the JEP making progress as far as I'm concerned.

Thanks for clarifying that point. I have already added a note to that effect to the JEP. I take your other point that these limitations make this JEP a less useful addition than it could be. However, it's hard to see what else might usefully be provided that does not involve a reworking of JDK core-lib (and, potentially, JVM) functionality that has a much larger scope than is needed to crack the specific nut the JEP addresses.

...

With that in mind, I'll discuss the issue of multithreaded access to ByteBuffers and how this bears on whether buffers are or aren't the "right answer." There are actually several issues that figure into the "right answer" analysis. In this message, though, I'll just focus on the issue of multithreaded access.

Thank you for a very clear and interesting summary of the limitations of the Buffer API. I have cut it from this reply for the sake of brevity but I will respond to a few points. I think the limitations you point out regarding concurrent clients' mode of operation are less severe in this specific case because there is not really a need for those client threads to reach a rendezvous point in order to execute some form of FileChannel update. The buffer content is persistent memory. So, essentially, the data writes constitute the update. If independent threads can arrange to coordinate over carving up separate regions of a persistent mapped buffer for parallel update then they can also write and flush (by which I mean force cache writeback for) those regions independently. Clearly there will also be a need for threads to write common index regions of the persistent mapped buffer in order to ensure that the associated data updates are committed. That means the writes and flushes for those common regions need to synchronize. However, that is simply business as usual for persistent data management code. A TX manager will already have code in place for this purpose, for example. Certainly, that synchronized update will not need to rely on buffer cursor (position) management. Also, I am not sure I see any problem arising from your point about absolute puts (and gets) depending on the 'limit' property. The various put operations do indeed /read/ the current limit but they do not update it. So, you are right to state that a persistent store management library built over this API would need to ensure that put operations were reined in via some form of rendezvous if it ever wanted to adjust the limit. However, I don't think that is going to happen with a librray that manages a mapped persistent store. I would expect that any such code is never going to call clear(), flip(), truncate() -- nor make a direct call to limit() -- except as part of the initialization or reconciliation performed at startup before concurrent clients are unleashed. Anyway, thank you for a clear warning as to the precise perils faced in implementing correct client libraries over the base layer this JEP proposes.

...

If we were designing an API to support multi-threaded access to memory regions, it would almost certainly look nothing like the buffer API. This is what Alan means by "buffers might not be the right answer." As things stand, it appears quite difficult to me to fix the multi-threaded access problem without turning buffers into something they aren't, or fragmenting the API in some complex and uncomfortable way.

Agreed.

...

Finally, note that this is not an argument against adding bulk absolute operations! I think we should probably go ahead and do that anyway. But let's not fool ourselves into thinking that bulk absolute operations solve the multi-threaded buffer access problem. Also agreed.

Stuart Marks

8:12 p.m.

On 9/28/18 2:16 AM, Andrew Dinn wrote:

...

Thanks for clarifying that point. I have already added a note to that effect to the JEP. I take your other point that these limitations make this JEP a less useful addition than it could be. However, it's hard to see what else might usefully be provided that does not involve a reworking of JDK core-lib (and, potentially, JVM) functionality that has a much larger scope than is needed to crack the specific nut the JEP addresses.

I'm not sure I'd put it quite that way, "less useful than it could be." I guess it depends on what you think the JEP is about. If the JEP is about MBB, and MBB is at some point superseded by something else, then yes, I suppose that means this JEP is less useful than it might be. On the other hand, suppose that this JEP is primarily about NVM, including access, operations, API, architecture, life cycle issues, etc., and these happen to be surfaced through MBB today. If something supersedes MBB, then the concepts developed by this JEP can be retargeted to that other thing at the appropriate time. Or are the concepts developed by this JEP so closely intertwined with MBB that this idea of "retargeting" doesn't make sense? I don't know.

...

Thank you for a very clear and interesting summary of the limitations of the Buffer API. I have cut it from this reply for the sake of brevity but I will respond to a few points.

Great, I'm glad this helped. I'm never quite sure whether writing these big essays is helpful. (Note also that there are OTHER limitations of the buffer API that I didn't cover, since the message was getting too long as it was. Example: 2GB limit.)

...

I think the limitations you point out regarding concurrent clients' mode of operation are less severe in this specific case because there is not really a need for those client threads to reach a rendezvous point in order to execute some form of FileChannel update. The buffer content is persistent memory. So, essentially, the data writes constitute the update.

Sure. It may be that the use cases for NVM aren't particularly affected by limitations of the Buffer APIs. If so, so much the better! But there are other systems where the limitations imposed by buffers are so onerous that they've had to go directly to Unsafe.

...

Anyway, thank you for a clear warning as to the precise perils faced in implementing correct client libraries over the base layer this JEP proposes.

Yes, this is essentially it. When you run into a problem -- as every project does -- think about whether it's inherent to NVM, or whether it's incidental to NVM and is rooted in the use of Buffers. s'marks

Alan Bateman

30 Sep 30 Sep

3:31 p.m.

On 26/09/2018 14:27, Andrew Dinn wrote:

...

: I'm not clear why we should only use one flag. The two flags I specified reflect two independent use cases, one where data stored in an NVM device is accessed read-only and another where it is accessed read-write. Are you suggesting that the read-only case is redundant? I'm not sure I agree. For example, a utility which might want to review the state of persistent data while a service is off-line would really want to pass flag READ_ONLY_PERSISTENT. Of course, it could employ READ_WRITE_PERSISTENT (or equivalently, SYNC) and just not write the data but, mutatis mutandis, that same argument would remove the case for flag READ_ONLY.

I'm wrong on this point. The map takes a single MapMode, not a set of modes as I was assuming, so you are right that it needs two new modes, not one. I do think we should re-visit the name though as the native flag is MAP_SYNC. -Alan

Andrew Dinn

3 Oct 3 Oct

9:14 a.m.

On 30/09/18 16:31, Alan Bateman wrote:

...

On 26/09/2018 14:27, Andrew Dinn wrote:

...
: I'm not clear why we should only use one flag. The two flags I specified reflect two independent use cases, one where data stored in an NVM device is accessed read-only and another where it is accessed read-write. Are you suggesting that the read-only case is redundant? I'm not sure I agree. For example, a utility which might want to review the state of persistent data while a service is off-line would really want to pass flag READ_ONLY_PERSISTENT. Of course, it could employ READ_WRITE_PERSISTENT (or equivalently, SYNC) and just not write the data but, mutatis mutandis, that same argument would remove the case for flag READ_ONLY.

I'm wrong on this point. The map takes a single MapMode, not a set of modes as I was assuming, so you are right that it needs two new modes, not one. I do think we should re-visit the name though as the native flag is MAP_SYNC. Sure, I'd be happy to change this.

Would READ_ONLY_SYNC and READ_WRITE_SYNC be suitable alternatives? Or do you have something else in mind? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

2:24 p.m.

On 03/10/2018 10:14, Andrew Dinn wrote:

...

: Sure, I'd be happy to change this.

Would READ_ONLY_SYNC and READ_WRITE_SYNC be suitable alternatives? Or do you have something else in mind?

I think that works but it means looking at the proposed MappedByteBuffer::isPersistent too. MappedByteBuffer hasn't needed methods to test if the mapping is read-only, read-write or private mapping and I wonder if isPersistent is really needed. -Alan

Andrew Dinn

23 Oct 23 Oct

1:58 p.m.

Hi Alan, Apologies for the 2 week+ delay in replying -- I was away on holiday. On 03/10/18 15:24, Alan Bateman wrote:

...

On 03/10/2018 10:14, Andrew Dinn wrote:

...
: Sure, I'd be happy to change this.

Would READ_ONLY_SYNC and READ_WRITE_SYNC be suitable alternatives? Or do you have something else in mind?

I think that works but it means looking at the proposed MappedByteBuffer::isPersistent too. MappedByteBuffer hasn't needed methods to test if the mapping is read-only, read-write or private mapping and I wonder if isPersistent is really needed. That's a good question. I guess isPersistent is not really needed as a private method since a field lookup would do. It is obviously of no use as a protected method because MappedByteBuffer sits in a sandwich between ByteBuffer and DirectByteBuffer and, hence, cannot be specialized. So, why make it public?

The immediate intentions here is to use the new MappedByteBuffer API as a base layer for implementation of a library of classes that provide equivalent capabilities to the libs that extend Intel's libpmem, providing various managed persistent data overlays on the raw NVM bytes such as journals, block array stores etc. As far as that goal is concerned there is arguably no need to provide isPersistent as a public API because these client classes would normally only employ a buffer mapped to NVM. However, I am not sure that is always going to be the only desired mode of operation. It may be, for example, that we want to use these classes to operate over mapped files as well as mapped NVM. That's very likely not going to give great performance (although in the case of a block array store whose block sizes are file page multiples it might not make that much difference). However, it does allow for compatibility mode operation when NVM is not available. Even so, I believe those clients will not actually care what type of buffer they use. Other client classes might also need to be able to provide these two alternative modes of operation -- where the underlying ByteBuffer may or may not be persistent -- and it is not clear to me that they would not care about whether the mapping was to NVM or to a conventional file. It might be that some clients would want to use the buffer in different ways depending on how it is mapped. Jonathan Halliday (in cc) actually defined the method as public on the rather liberal assumption that this might be how it was used but it seems he did not have a specific use case in mind. Of course, a client could always track this information itself. However, since this datum is i) a property of the ByteBuffer and ii) already stored in the ByteBuffer I felt it was best to expose the property via a public getter -- arguably it ought to be final. If you think that is inappropriate or would prefer to remove it so as to minimize the new API surface I would be happy to follow your decision. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

6 Nov 6 Nov

10:17 a.m.

Ping! Is it possible to get a response on this. To summarise: I am happy to rename isPersistent to isSync and/or make it private as well as change the enum tags to use SYNC instyead of PERSISTENT if that is what is needed to get the JEP approved. I'd really like to see this progress towards a release -- preferably jdk12! regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander On 23/10/18 14:58, Andrew Dinn wrote:

...

Hi Alan,

Apologies for the 2 week+ delay in replying -- I was away on holiday.

On 03/10/18 15:24, Alan Bateman wrote:

...
On 03/10/2018 10:14, Andrew Dinn wrote:

...
: Sure, I'd be happy to change this.

Would READ_ONLY_SYNC and READ_WRITE_SYNC be suitable alternatives? Or do you have something else in mind?

I think that works but it means looking at the proposed MappedByteBuffer::isPersistent too. MappedByteBuffer hasn't needed methods to test if the mapping is read-only, read-write or private mapping and I wonder if isPersistent is really needed. That's a good question. I guess isPersistent is not really needed as a private method since a field lookup would do. It is obviously of no use as a protected method because MappedByteBuffer sits in a sandwich between ByteBuffer and DirectByteBuffer and, hence, cannot be specialized. So, why make it public?

The immediate intentions here is to use the new MappedByteBuffer API as a base layer for implementation of a library of classes that provide equivalent capabilities to the libs that extend Intel's libpmem, providing various managed persistent data overlays on the raw NVM bytes such as journals, block array stores etc. As far as that goal is concerned there is arguably no need to provide isPersistent as a public API because these client classes would normally only employ a buffer mapped to NVM.

However, I am not sure that is always going to be the only desired mode of operation. It may be, for example, that we want to use these classes to operate over mapped files as well as mapped NVM. That's very likely not going to give great performance (although in the case of a block array store whose block sizes are file page multiples it might not make that much difference). However, it does allow for compatibility mode operation when NVM is not available. Even so, I believe those clients will not actually care what type of buffer they use.

Other client classes might also need to be able to provide these two alternative modes of operation -- where the underlying ByteBuffer may or may not be persistent -- and it is not clear to me that they would not care about whether the mapping was to NVM or to a conventional file. It might be that some clients would want to use the buffer in different ways depending on how it is mapped. Jonathan Halliday (in cc) actually defined the method as public on the rather liberal assumption that this might be how it was used but it seems he did not have a specific use case in mind.

Of course, a client could always track this information itself. However, since this datum is i) a property of the ByteBuffer and ii) already stored in the ByteBuffer I felt it was best to expose the property via a public getter -- arguably it ought to be final. If you think that is inappropriate or would prefer to remove it so as to minimize the new API surface I would be happy to follow your decision.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

7 Nov 7 Nov

2:08 p.m.

On 06/11/18 10:17, Andrew Dinn wrote:

...

Ping!

Is it possible to get a response on this.

To summarise: I am happy to rename isPersistent to isSync and/or make it private as well as change the enum tags to use SYNC instyead of PERSISTENT if that is what is needed to get the JEP approved.

I'd really like to see this progress towards a release -- preferably jdk12! Hi Alan, I see you have reviewed the JEP. Thank you.

Vladimir, do you have any further feedback/requests for modification? If not are you able to review the JEP? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Vladimir Kozlov

5:12 p.m.

Hi Andrew, I am Lead for Hotspot [1]. Alan is Group Lead for core libs and he gave review already. I don't see any reference to Hotspot in JEP so I am not sure what to review. Do you need any new optimizations/intrinsics in Hotspot for this JEP? You need to ask Alan or Brian Goetz (as Area Lead) for endorsement before 'Submitting' JEP [2]. Regards, Vladimir [1] http://openjdk.java.net/projects/jdk/leads [2] "Making decisions and building consensus" http://openjdk.java.net/jeps/1 "On 11/7/18 6:08 AM, Andrew Dinn wrote:

...

On 06/11/18 10:17, Andrew Dinn wrote:

...
Ping!

Is it possible to get a response on this.

To summarise: I am happy to rename isPersistent to isSync and/or make it private as well as change the enum tags to use SYNC instyead of PERSISTENT if that is what is needed to get the JEP approved.

I'd really like to see this progress towards a release -- preferably jdk12! Hi Alan, I see you have reviewed the JEP. Thank you.

Vladimir, do you have any further feedback/requests for modification? If not are you able to review the JEP?

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

8 Nov 8 Nov

9:10 a.m.

On 07/11/18 17:12, Vladimir Kozlov wrote:

...

I am Lead for Hotspot [1]. Alan is Group Lead for core libs and he gave review already.

I don't see any reference to Hotspot in JEP so I am not sure what to review. Do you need any new optimizations/intrinsics in Hotspot for this JEP?

Yes I do need some new intrinsics. I was not clear whether they needed to be documented in the JEP. Perhaps you could advise? n.b. If you need to know what is being proposed in order to answer that I can point you at my prototype implementation. Details after the sig.

...

You need to ask Alan or Brian Goetz (as Area Lead) for endorsement before 'Submitting' JEP [2].

Ok, will do once I know whether details of the intrinsics have to be included. Thanks for your help. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander ----- 8< -------- 8< -------- 8< -------- 8< -------- 8< -------- 8< --- The basic operation to persist ByteBuffer changes is provided via a new method of jdk.internal.misc.Unsafe which /is/ currently described in the JEP: public void writebackMemory(long address, long length) My prototype implements this method using 3 intrinsics, a pre-writeback memory sync, a per-cache line force (executed in a loop) and a post-writeback memory sync. I also added a native method which allows the (cpu-specific) cache line size to be retrieved at class init time. Webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.04/ Unsafe changes: http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/java.base/share/classes... In the underlying implementation of the intrinsics there are 3 corresponding new IR nodes, CacheWBNode, CacheWBPreSyncNode and CacheWBPostSyncNode. They are matched by processor-specific ad file rules to generate the required assembler Intrinsics implementation: http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opto/memn... http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opto/libr... Back end rules: http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch64/aar... http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/x86_64.... Assembler changes: The assembler implementations are fairly straightforward. There is no need for a pre sync on either AArch64 or x86_64. A post sync is always needed on AArch64. It may not be needed on x86 depending on what type of cache line flush the processor supports http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch64/mac... http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/macroAs... http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/assembl...

Roger Riggs

2:51 p.m.

Hi Andrew, If to achieve the performance or functional goals of the JEP hotspot changes are needed they should be mentioned (no details needed) in the JEP. It helps the reader understand the scope and impact of the change. Regards, Roger On 11/08/2018 04:10 AM, Andrew Dinn wrote:

...

On 07/11/18 17:12, Vladimir Kozlov wrote:

...
I am Lead for Hotspot [1]. Alan is Group Lead for core libs and he gave review already.

I don't see any reference to Hotspot in JEP so I am not sure what to review. Do you need any new optimizations/intrinsics in Hotspot for this JEP? Yes I do need some new intrinsics. I was not clear whether they needed to be documented in the JEP. Perhaps you could advise?

n.b. If you need to know what is being proposed in order to answer that I can point you at my prototype implementation. Details after the sig.

...
You need to ask Alan or Brian Goetz (as Area Lead) for endorsement before 'Submitting' JEP [2]. Ok, will do once I know whether details of the intrinsics have to be included. Thanks for your help.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

----- 8< -------- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---

The basic operation to persist ByteBuffer changes is provided via a new method of jdk.internal.misc.Unsafe which /is/ currently described in the JEP:

public void writebackMemory(long address, long length)

My prototype implements this method using 3 intrinsics, a pre-writeback memory sync, a per-cache line force (executed in a loop) and a post-writeback memory sync.

I also added a native method which allows the (cpu-specific) cache line size to be retrieved at class init time.

Webrev:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/

Unsafe changes:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/java.base/share/classes...

In the underlying implementation of the intrinsics there are 3 corresponding new IR nodes, CacheWBNode, CacheWBPreSyncNode and CacheWBPostSyncNode. They are matched by processor-specific ad file rules to generate the required assembler

Intrinsics implementation:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opto/memn...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opto/libr...

Back end rules:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch64/aar...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/x86_64....

Assembler changes:

The assembler implementations are fairly straightforward. There is no need for a pre sync on either AArch64 or x86_64. A post sync is always needed on AArch64. It may not be needed on x86 depending on what type of cache line flush the processor supports

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch64/mac...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/macroAs...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/assembl...

Andrew Dinn

4:20 p.m.

Hi Roger, On 08/11/18 14:51, Roger Riggs wrote:

...

If to achieve the performance or functional goals of the JEP hotspot changes are needed they should be mentioned (no details needed) in the JEP. It helps the reader understand the scope and impact of the change.

Thanks. I have added a brief overview of the required Hotspot changes. I also added a note about the target OS/CPU combinations. Vladimir, could you please take a look at the revised JEP and comment. The Hotspot-specifc details are at the end of the Description section where details of proposed new method Unsafe.writebackMemory are provided. Thanks for any feedback you can provide. regards, Andrew Dinn -----------

White, Derek

3:49 p.m.

Hi Andrew, Given that there is platform-specific code, it would be good to be clear which platforms you are intending to implement as part of this JEP, and which platforms will need others to step in to support. I'm quite happy with your plan, but I'd like to see more JEPs being clear about platform support (CPU and OS). - Derek

...

-----Original Message----- From: hotspot-compiler-dev <hotspot-compiler-dev- bounces@openjdk.java.net> On Behalf Of Andrew Dinn Sent: Thursday, November 08, 2018 4:10 AM To: Vladimir Kozlov <vladimir.kozlov@oracle.com>; Alan Bateman <Alan.Bateman@oracle.com> Cc: Jonathan Halliday <jonathan.halliday@redhat.com>; hotspot-compiler- dev@openjdk.java.net; core-libs-dev@openjdk.java.net Subject: Re: RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non- volatile memory

On 07/11/18 17:12, Vladimir Kozlov wrote:

...
I am Lead for Hotspot [1]. Alan is Group Lead for core libs and he gave review already.

I don't see any reference to Hotspot in JEP so I am not sure what to review. Do you need any new optimizations/intrinsics in Hotspot for this JEP?

Yes I do need some new intrinsics. I was not clear whether they needed to be documented in the JEP. Perhaps you could advise?

n.b. If you need to know what is being proposed in order to answer that I can point you at my prototype implementation. Details after the sig.

...
You need to ask Alan or Brian Goetz (as Area Lead) for endorsement before 'Submitting' JEP [2].

Ok, will do once I know whether details of the intrinsics have to be included. Thanks for your help.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

----- 8< -------- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---

The basic operation to persist ByteBuffer changes is provided via a new method of jdk.internal.misc.Unsafe which /is/ currently described in the JEP:

public void writebackMemory(long address, long length)

My prototype implements this method using 3 intrinsics, a pre-writeback memory sync, a per-cache line force (executed in a loop) and a post- writeback memory sync.

I also added a native method which allows the (cpu-specific) cache line size to be retrieved at class init time.

Webrev:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/

Unsafe changes:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/java.base/share/cla sses/jdk/internal/misc/Unsafe.java.udiff.html

In the underlying implementation of the intrinsics there are 3 corresponding new IR nodes, CacheWBNode, CacheWBPreSyncNode and CacheWBPostSyncNode. They are matched by processor-specific ad file rules to generate the required assembler

Intrinsics implementation:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opt o/memnode.hpp.udiff.html

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opt o/library_call.cpp.udiff.html

Back end rules:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch 64/aarch64.ad.udiff.html

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/x 86_64.ad.udiff.html

Assembler changes:

The assembler implementations are fairly straightforward. There is no need for a pre sync on either AArch64 or x86_64. A post sync is always needed on AArch64. It may not be needed on x86 depending on what type of cache line flush the processor supports

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch 64/macroAssembler_aarch64.cpp.udiff.html

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/ macroAssembler_x86.cpp.udiff.html

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/a ssembler_x86.cpp.udiff.html

Andrew Dinn

4:01 p.m.

Hi Derek, On 08/11/18 15:49, White, Derek wrote:

...

Given that there is platform-specific code, it would be good to be clear which platforms you are intending to implement as part of this JEP, and which platforms will need others to step in to support.

I'm quite happy with your plan, but I'd like to see more JEPs being clear about platform support (CPU and OS).

The prototype implements this on Linux/x86_64 and Linux/AArch64. On other platforms it will refuse to create a persistent MappedByteBuffer by throwing an exception. I believe it should be straightforward to migrate it to Windows/x86_64 but I don't know for sure. That depends on support for the mmap MAP_SYNC mode being available. I am not currently in a position to implement or test Windows/x86_64. I really don't know what would work on other hardware or OSes. I'd be happy to take advice but I'd like to get this included targeting the 2 OS/CPU configurations mentioned above and see those other platforms supported as an upgrade. I'll add a note to the JEP to record this. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

White, Derek

4:20 p.m.

Thanks Andrew, looks great! - Derek

...

-----Original Message----- From: Andrew Dinn <adinn@redhat.com> Sent: Thursday, November 08, 2018 11:01 AM To: White, Derek <Derek.White@cavium.com>; Vladimir Kozlov <vladimir.kozlov@oracle.com>; Alan Bateman <Alan.Bateman@oracle.com> Cc: Jonathan Halliday <jonathan.halliday@redhat.com>; hotspot-compiler- dev@openjdk.java.net; core-libs-dev@openjdk.java.net Subject: Re: RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non- volatile memory

External Email

Hi Derek,

On 08/11/18 15:49, White, Derek wrote:

...
Given that there is platform-specific code, it would be good to be clear which platforms you are intending to implement as part of this JEP, and which platforms will need others to step in to support.

I'm quite happy with your plan, but I'd like to see more JEPs being clear about platform support (CPU and OS).

The prototype implements this on Linux/x86_64 and Linux/AArch64. On other platforms it will refuse to create a persistent MappedByteBuffer by throwing an exception.

I believe it should be straightforward to migrate it to Windows/x86_64 but I don't know for sure. That depends on support for the mmap MAP_SYNC mode being available.

I am not currently in a position to implement or test Windows/x86_64. I really don't know what would work on other hardware or OSes. I'd be happy to take advice but I'd like to get this included targeting the 2 OS/CPU configurations mentioned above and see those other platforms supported as an upgrade.

I'll add a note to the JEP to record this.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Vladimir Kozlov

6:07 p.m.

Thank you, Andrew, for sending changes I am fine with intrinsics general wording in JEP. I 'reviewed' JEP. I have several questions and issues with proposed changes in Hotspot which needs to be discussed during changes review. My main question is - should new nodes be treated as global memory barriers (not only RAW memory)? One suggestion I can give is to introduce a feature flag in java code which will be set by VM if platform can support the feature. Similar to COMPACT_STRINGS in JEP 254 Compact Strings: http://hg.openjdk.java.net/jdk/jdk/file/66a0e6b3ec1a/src/java.base/share/cla... Thanks, Vladimir On 11/8/18 1:10 AM, Andrew Dinn wrote:

...

On 07/11/18 17:12, Vladimir Kozlov wrote:

...
I am Lead for Hotspot [1]. Alan is Group Lead for core libs and he gave review already.

I don't see any reference to Hotspot in JEP so I am not sure what to review. Do you need any new optimizations/intrinsics in Hotspot for this JEP?

Yes I do need some new intrinsics. I was not clear whether they needed to be documented in the JEP. Perhaps you could advise?

n.b. If you need to know what is being proposed in order to answer that I can point you at my prototype implementation. Details after the sig.

...
You need to ask Alan or Brian Goetz (as Area Lead) for endorsement before 'Submitting' JEP [2].

Ok, will do once I know whether details of the intrinsics have to be included. Thanks for your help.

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

----- 8< -------- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---

The basic operation to persist ByteBuffer changes is provided via a new method of jdk.internal.misc.Unsafe which /is/ currently described in the JEP:

public void writebackMemory(long address, long length)

My prototype implements this method using 3 intrinsics, a pre-writeback memory sync, a per-cache line force (executed in a loop) and a post-writeback memory sync.

I also added a native method which allows the (cpu-specific) cache line size to be retrieved at class init time.

Webrev:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/

Unsafe changes:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/java.base/share/classes...

In the underlying implementation of the intrinsics there are 3 corresponding new IR nodes, CacheWBNode, CacheWBPreSyncNode and CacheWBPostSyncNode. They are matched by processor-specific ad file rules to generate the required assembler

Intrinsics implementation:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opto/memn...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/share/opto/libr...

Back end rules:

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch64/aar...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/x86_64....

Assembler changes:

The assembler implementations are fairly straightforward. There is no need for a pre sync on either AArch64 or x86_64. A post sync is always needed on AArch64. It may not be needed on x86 depending on what type of cache line flush the processor supports

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/aarch64/mac...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/macroAs...

http://cr.openjdk.java.net/~adinn/pmem/webrev.04/src/hotspot/cpu/x86/assembl...

Andrew Dinn

9 Nov 9 Nov

5:01 p.m.

Hi Vladimir, On 08/11/18 18:07, Vladimir Kozlov wrote:

...

Thank you, Andrew, for sending changes

I am fine with intrinsics general wording in JEP. I 'reviewed' JEP.

Excellent. Thank you very much. On to the next step . . .

...

I have several questions and issues with proposed changes in Hotspot which needs to be discussed during changes review. My main question is - should new nodes be treated as global memory barriers (not only RAW memory)?

That's a very good question which I did consider and to which, I think, I can offer not just one answer but two, although I may not be understanding correctly what is involved here. Apologies if that is the case. Also for the length of the reply and the 'on the one hand/on the other hand' conclusions. I chose to type the memory ops using the RAW slice because currently the Unsafe intrinsic is only being used to write data /transactionally/ to a MappedByteBuffer i.e. to RAW addresses. Clearly, in order to ensure that RAW memory is persisted correctly it is necessary to ensure that other RAW memory read and write operations do not get re-ordered around the flushes. However, in order to commit persistent writes, transactional clients using this API must employ some form of thread synchronization. A commit involves writing some control state to an area of memory that other threads may also be trying to write concurrently. The control state updates a shadow record to a committed record. Synchronization is required to ensure that updates to this shared control state ar serialised and, hence, retain consistency. As a consequence, for the current purposes I believe it does not matter if non-RAW memory operations are re-ordered around the RAW flush operations. Reordering before commit should not affect any other threads. The synchronization needed at commit should enforce serialization of any re-ordered non-RAW and RAW writes that might have occurred before the commit so that they are all visible at the point of commit. On the other hand ... if in future we want to use this same Unsafe intirinsic to provide writeback guarantees for memory that is not of RAW type -- say for persisting changes to klass metadata or, even, Java heap objects in some future Java persistent programming model -- in that case I think we would need to type the memory ops using a Bottom ptr slice. Perhaps we should just go for the second option straight away?

...

One suggestion I can give is to introduce a feature flag in java code which will be set by VM if platform can support the feature. Similar to COMPACT_STRINGS in JEP 254 Compact Strings:

http://hg.openjdk.java.net/jdk/jdk/file/66a0e6b3ec1a/src/java.base/share/cla... That sounds like a good idea. I will take a look at it (I assume if I grep around in the JVM code looking for COMPACT_STRINGS I will find out where this field actually gets initialised?).

Vladimir Kozlov

5:52 p.m.

On 11/9/18 9:01 AM, Andrew Dinn wrote:

...

Hi Vladimir,

On 08/11/18 18:07, Vladimir Kozlov wrote:

...
Thank you, Andrew, for sending changes

I am fine with intrinsics general wording in JEP. I 'reviewed' JEP.

Excellent. Thank you very much. On to the next step . . .

...
I have several questions and issues with proposed changes in Hotspot which needs to be discussed during changes review. My main question is - should new nodes be treated as global memory barriers (not only RAW memory)?

That's a very good question which I did consider and to which, I think, I can offer not just one answer but two, although I may not be understanding correctly what is involved here. Apologies if that is the case. Also for the length of the reply and the 'on the one hand/on the other hand' conclusions.

I chose to type the memory ops using the RAW slice because currently the Unsafe intrinsic is only being used to write data /transactionally/ to a MappedByteBuffer i.e. to RAW addresses. Clearly, in order to ensure that RAW memory is persisted correctly it is necessary to ensure that other RAW memory read and write operations do not get re-ordered around the flushes.

However, in order to commit persistent writes, transactional clients using this API must employ some form of thread synchronization. A commit involves writing some control state to an area of memory that other threads may also be trying to write concurrently. The control state updates a shadow record to a committed record. Synchronization is required to ensure that updates to this shared control state ar serialised and, hence, retain consistency.

As a consequence, for the current purposes I believe it does not matter if non-RAW memory operations are re-ordered around the RAW flush operations. Reordering before commit should not affect any other threads. The synchronization needed at commit should enforce serialization of any re-ordered non-RAW and RAW writes that might have occurred before the commit so that they are all visible at the point of commit.

On the other hand ... if in future we want to use this same Unsafe intirinsic to provide writeback guarantees for memory that is not of RAW type -- say for persisting changes to klass metadata or, even, Java heap objects in some future Java persistent programming model -- in that case I think we would need to type the memory ops using a Bottom ptr slice.

Perhaps we should just go for the second option straight away?

To be safe I think it should be global memory. We can optimize it later if we find issues.

...

...
One suggestion I can give is to introduce a feature flag in java code which will be set by VM if platform can support the feature. Similar to COMPACT_STRINGS in JEP 254 Compact Strings:

http://hg.openjdk.java.net/jdk/jdk/file/66a0e6b3ec1a/src/java.base/share/cla... That sounds like a good idea. I will take a look at it (I assume if I grep around in the JVM code looking for COMPACT_STRINGS I will find out where this field actually gets initialised?).

It is controlled by VM flag CompactStrings which is platform specific: http://hg.openjdk.java.net/jdk/jdk/file/13266dac5fdb/src/hotspot/share/runti... Vladimir

...

regards,

Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

16 Jan 16 Jan

11:23 a.m.

Hi Alan/Brian, I have finally been able to shelve other commitments and return to this JEP (apologies for the hiatus). https://openjdk.java.net/jeps/8207851 The JEP has been reviewed positively by Stuart Marks (core libs) and Vladimir Kozlov (intrinsics). It has also been warmly welcomed by several potential users in Red Hat and Intel (including, respectively, Jonathan Halliday and Sandya Viswanathan both in cc). I believe I have addressed all outstanding comments on the JEP per se, including those made by Alan. Is it now possible for one of you to endorse the JEP so it can be submitted? I am aware that I still need to address a few details in the draft implementation that are not present in the latest webrev. I believe there are two changes requested by Vladimir: 1. correct the type of cache writeback memory nodes to generic memory 2. use the JVM to inject a flag setting which enables/disables mapping of persistent buffers and also one change requested by Alan: make method MappedByteBuffer.isPersistent private rather than public Is there any other impediment to submitting the JEP and proceeding to code review? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Viswanathan, Sandhya

17 Jan 17 Jan

midnight

It will be wonderful to have persistent MappedByteBuffer feature proposed by Andrew Dinn in JDK 13. To us it looks to be a seamless extension to the existing API, provides a very good building block for persistent memory support in Java in the current Java paradigm and is directly applicable to a class of workloads. Many Big Data frameworks like Apache HBASE use FileChannel map and MappedByteBuffer as the underlying mechanism and so can use the proposed feature to utilize non-volatile memory. We have also reviewed and provided initial feedback to Andrew on the implementation. Best Regards, Sandhya -----Original Message----- From: Andrew Dinn [mailto:adinn@redhat.com] Sent: Wednesday, January 16, 2019 3:23 AM To: Alan Bateman <Alan.Bateman@oracle.com>; Brian Goetz <brian.goetz@oracle.com> Cc: core-libs-dev@openjdk.java.net; hotspot compiler <hotspot-compiler-dev@openjdk.java.net>; Jonathan Halliday <jonathan.halliday@redhat.com>; Viswanathan, Sandhya <sandhya.viswanathan@intel.com> Subject: Re: RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non-volatile memory Hi Alan/Brian, I have finally been able to shelve other commitments and return to this JEP (apologies for the hiatus). https://openjdk.java.net/jeps/8207851 The JEP has been reviewed positively by Stuart Marks (core libs) and Vladimir Kozlov (intrinsics). It has also been warmly welcomed by several potential users in Red Hat and Intel (including, respectively, Jonathan Halliday and Sandya Viswanathan both in cc). I believe I have addressed all outstanding comments on the JEP per se, including those made by Alan. Is it now possible for one of you to endorse the JEP so it can be submitted? I am aware that I still need to address a few details in the draft implementation that are not present in the latest webrev. I believe there are two changes requested by Vladimir: 1. correct the type of cache writeback memory nodes to generic memory 2. use the JVM to inject a flag setting which enables/disables mapping of persistent buffers and also one change requested by Alan: make method MappedByteBuffer.isPersistent private rather than public Is there any other impediment to submitting the JEP and proceeding to code review? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

12:53 p.m.

On 16/01/2019 11:23, Andrew Dinn wrote:

...

Hi Alan/Brian,

I have finally been able to shelve other commitments and return to this JEP (apologies for the hiatus).

https://openjdk.java.net/jeps/8207851

The JEP has been reviewed positively by Stuart Marks (core libs) and Vladimir Kozlov (intrinsics). It has also been warmly welcomed by several potential users in Red Hat and Intel (including, respectively, Jonathan Halliday and Sandya Viswanathan both in cc). I think the proposal is good as a short term/tactical solution, especially as you were able to reduce the API surface down to new FileChannel map modes. I think it can be looked at again once Project Panama is further along and there is some notion of "memory region" that is backed by NVM.

I skimmed through the current draft. In the most recent discussion then I think we had converged on "SYNC" rather than "PERSISTENT", the reasoning being that there is persistence already with regular file mapped files, also it aligns with the MAP_SYNC flag to mmap. I don't recall if the discussion on isPersistent concluded, that was more of a naming issue and whether you include an isXXX method or not is not critical to the proposal. The overload of the force method to specify a range is a good addition, irrespective of the JEP. One thing to clarify is the heading "Proposed Restricted Public JDK API Changes". The proposal (and the early webrevs) exposed writebackMemory in the internal Unsafe, not sun.misc.Unsafe, which I think is right. This makes it a JDK internal API so it doesn't need to be in JEP. Did you get any feedback on the Testing section? Given that the feature needs special hardware then it will need commitment to test is on a regular basis. It's a similar issue to the draft "JEP 337: RDMA Network Sockets" where special hardware is needed to full test the feature. In the case of JEP 337 then some testing with emulation is possible. Vladimir and I have reviewed the JEP, it will need an area lead to endorse, I think it can be Brian or Mikael in this case. -Alan

Andrew Dinn

2:27 p.m.

Hi Alan, Thanks for your response. On 17/01/2019 12:53, Alan Bateman wrote:

...

I skimmed through the current draft. In the most recent discussion then I think we had converged on "SYNC" rather than "PERSISTENT", the reasoning being that there is persistence already with regular file mapped files, also it aligns with the MAP_SYNC flag to mmap. I don't recall if the discussion on isPersistent concluded, that was more of a naming issue and whether you include an isXXX method or not is not critical to the proposal. The overload of the force method to specify a range is a good addition, irrespective of the JEP.

Ok, thanks. At least sync is now being used consistently in the public API. I will look at renaming internal vars/methods to use sync when I publish the next webrev.

...

One thing to clarify is the heading "Proposed Restricted Public JDK API Changes". The proposal (and the early webrevs) exposed writebackMemory in the internal Unsafe, not sun.misc.Unsafe, which I think is right. This makes it a JDK internal API so it doesn't need to be in JEP.

I am happy to remove it from the JEP if needed. Does it do any harm to leave it?

...

Did you get any feedback on the Testing section? Given that the feature needs special hardware then it will need commitment to test is on a regular basis. It's a similar issue to the draft "JEP 337: RDMA Network Sockets" where special hardware is needed to full test the feature. In the case of JEP 337 then some testing with emulation is possible.

I believe I received no specific feedback on that topic. Some of the other Red Hat dev teams (i.e. not OpenJDK) and also dev staff at Intel are very keen to base some of their future work on this feature. So, it will certainly get tested /after/ JDK release :-) Red Hat does have the Intel hardware needed to test this feature but, so far, nothing that can be used to test on AArch64. Our OpenJDK team can access this kit for one-off testing but it is not currently available for continuous integration testing. I will propose to my manager that we acquire the relevant kit and ensure that all JDKs which implement this JEP are tested prior to release. We should also be able to test AArch64 using volatile memory to simulate a non-volatile memory device up to the point where the requisite AArch64-based NVM hardware becomes available. I am fairly confident this plan will be agreeable to the overlords whom I humbly serve. Perhaps Intel also could provide help with testing? [Sadhya, is this an option?] My bigger concern was that crash recovery tests may never be 100% reliable. A 100% guarantee requires the ability to engineer a machine crash at a precisely defined critical point of execution and some of the relevant critical locations will be embedded in the middle of JITted code making it hard to provoke the crash. So, the situations where a crash /can/ be engineered may not fully reflect those that occur in live deployments. That said, a dash of artificiality in test code is, perhaps, not so worthy of remark . . .

...

Vladimir and I have reviewed the JEP, it will need an area lead to endorse, I think it can be Brian or Mikael in this case. Ok, thanks for the above answers. Looking forward to hearing further from Brian and/or Mikael (Vidstedt, I assume? :-).

regards, Andrew Dinn -----------

Viswanathan, Sandhya

18 Jan 18 Jan

1:13 a.m.

Hi Andrew,

...

...
...
Perhaps Intel also could provide help with testing? [Sadhya, is this an option?]

Yes, we can help with testing this feature as needed. Best Regards, Sandhya -----Original Message----- From: Andrew Dinn [mailto:adinn@redhat.com] Sent: Thursday, January 17, 2019 6:28 AM To: Alan Bateman <Alan.Bateman@oracle.com>; Brian Goetz <brian.goetz@oracle.com> Cc: core-libs-dev@openjdk.java.net; hotspot compiler <hotspot-compiler-dev@openjdk.java.net>; Jonathan Halliday <jonathan.halliday@redhat.com>; Viswanathan, Sandhya <sandhya.viswanathan@intel.com>; Mikael Vidstedt <mikael.vidstedt@oracle.com> Subject: Re: RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non-volatile memory Hi Alan, Thanks for your response. On 17/01/2019 12:53, Alan Bateman wrote:

...

I skimmed through the current draft. In the most recent discussion then I think we had converged on "SYNC" rather than "PERSISTENT", the reasoning being that there is persistence already with regular file mapped files, also it aligns with the MAP_SYNC flag to mmap. I don't recall if the discussion on isPersistent concluded, that was more of a naming issue and whether you include an isXXX method or not is not critical to the proposal. The overload of the force method to specify a range is a good addition, irrespective of the JEP.

Ok, thanks. At least sync is now being used consistently in the public API. I will look at renaming internal vars/methods to use sync when I publish the next webrev.

...

One thing to clarify is the heading "Proposed Restricted Public JDK API Changes". The proposal (and the early webrevs) exposed writebackMemory in the internal Unsafe, not sun.misc.Unsafe, which I think is right. This makes it a JDK internal API so it doesn't need to be in JEP.

I am happy to remove it from the JEP if needed. Does it do any harm to leave it?

...

Did you get any feedback on the Testing section? Given that the feature needs special hardware then it will need commitment to test is on a regular basis. It's a similar issue to the draft "JEP 337: RDMA Network Sockets" where special hardware is needed to full test the feature. In the case of JEP 337 then some testing with emulation is possible.

...

Vladimir and I have reviewed the JEP, it will need an area lead to endorse, I think it can be Brian or Mikael in this case. Ok, thanks for the above answers. Looking forward to hearing further from Brian and/or Mikael (Vidstedt, I assume? :-).

regards, Andrew Dinn -----------

Alan Bateman

1:32 p.m.

On 17/01/2019 14:27, Andrew Dinn wrote:

...

:

...
Vladimir and I have reviewed the JEP, it will need an area lead to endorse, I think it can be Brian or Mikael in this case. Ok, thanks for the above answers. Looking forward to hearing further from Brian and/or Mikael (Vidstedt, I assume? :-). I had a brief discussion with Brian about this yesterday. He brought up the same concern about using MBB as it's not the right API for this in the longer term. So this JEP is very much about a short term/tactical solution as we've already concluded here. This leads to the question as to whether this JEP needs to evolve the standard/Java SE API or not. It's convenient for the implementation of course but we should at least explore doing this as a JDK-specific feature.

To that end, one approach to explore is allowing the FC.map method accept map modes beyond those defined by MapMode. There is precedence for extensibility in this area already, e.g. FC.open allows you to specify options beyond the standard options specified by the method. It would require MapMode to define a protected constructor and would require a bit of plumbing to support MapMode defined in a JDK-specific module but there are examples to point to. Another approach is aanother class in a JDK-specific module to define the map method. It would require the same plumbing under the covers but would avoid touch the FC spec. -Alan

Peter Levart

2:11 p.m.

Hi Alan, On 1/18/19 2:32 PM, Alan Bateman wrote:

...

On 17/01/2019 14:27, Andrew Dinn wrote:

...
:

...
Vladimir and I have reviewed the JEP, it will need an area lead to endorse, I think it can be Brian or Mikael in this case. Ok, thanks for the above answers. Looking forward to hearing further from Brian and/or Mikael (Vidstedt, I assume? :-). I had a brief discussion with Brian about this yesterday. He brought up the same concern about using MBB as it's not the right API for this in the longer term. So this JEP is very much about a short term/tactical solution as we've already concluded here. This leads to the question as to whether this JEP needs to evolve the standard/Java SE API or not. It's convenient for the implementation of course but we should at least explore doing this as a JDK-specific feature.

To that end, one approach to explore is allowing the FC.map method accept map modes beyond those defined by MapMode. There is precedence for extensibility in this area already, e.g. FC.open allows you to specify options beyond the standard options specified by the method. It would require MapMode to define a protected constructor and would require a bit of plumbing to support MapMode defined in a JDK-specific module but there are examples to point to.

You meant package-private constructor, right? Protected constructor would allow subclassing MapMode by arbitrary user class which is not what would be desirable. So perhaps all that is needed is to declare the static final field in the MapMode class as package-private. That would allow referenceing it in the java.nio.channels package. Then add SharedSecrets mechanism to expose it's value to other needed java.base packages and to the additional module that would expose it publicly... Regards, Peter

Peter Levart

2:28 p.m.

On 1/18/19 3:11 PM, Peter Levart wrote:

...

You meant package-private constructor, right? Protected constructor would allow subclassing MapMode by arbitrary user class which is not what would be desirable.

...unless you actually want users to construct their own MapMode(s), like you mentioned is the case with FileChannel.open() and FileAttribute interface. But there this makes sense because the backend (FileSystem) is also pluggable, so users can define their own FileSystem implementations that consume their own FileAttribute(s)... Are you proposing to add an spi for MappedByteBuffer's here? That would be an overkill for this feature, I think... Regards, Peter

Alan Bateman

22 Jan 22 Jan

4:12 p.m.

On 18/01/2019 14:28, Peter Levart wrote:

...

...unless you actually want users to construct their own MapMode(s), like you mentioned is the case with FileChannel.open() and FileAttribute interface. But there this makes sense because the backend (FileSystem) is also pluggable, so users can define their own FileSystem implementations that consume their own FileAttribute(s)...

Are you proposing to add an spi for MappedByteBuffer's here? That would be an overkill for this feature, I think...

No, we definitely don't want to go there as buffers are closed abstraction. Instead, this is just about allowing the JDK to support additional map modes beyond those specified by FileChannel.map. If you create your own MapMode and call the map method with it then it will be rejected, probably UOE. With the suggestion, a JDK-specific module would define READ_WRITE_SYNC and you could pass that mode to FileChannel.map. There's a bit of plumbing needed make that work but there are examples of this already (e.g. socket options and file open options). -Alan.

Andrew Dinn

21 Jan 21 Jan

4:14 p.m.

Hi Alan, On 18/01/2019 13:32, Alan Bateman wrote:

...

I had a brief discussion with Brian about this yesterday. He brought up the same concern about using MBB as it's not the right API for this in the longer term. So this JEP is very much about a short term/tactical solution as we've already concluded here. This leads to the question as to whether this JEP needs to evolve the standard/Java SE API or not. It's convenient for the implementation of course but we should at least explore doing this as a JDK-specific feature.

I disagree with your characterization of use of MBB as a short term/ tactical solution. Despite not being entirely suitable for the task MBB is a de facto standard way for many applications to gain direct access to files of data located on persistent storage. The current proposal is not, as you characterize it, a quick fix to use MBB as a temporary way to access NVM storage until something better comes along. The intention is rather to ensure that the current API caters for a new addition to the persistent memory tier. The imperative is to allow existing code to employ it now. Of course, a better API may come along for accessing persistent storage, whether that be NVM, flash disk or spinning platter. However, I would hazard that in many cases existing application code and libraries will still want/need to continue to use the MBB API, including cases where that storage can most usefully be NVM. Rewriting application code to use a new API will not always be feasible or cost-effective. Yet, the improved speed of NVM suggests that an API encompassing this new case will be very welcome and may well be cost-effective to adopt. In sum, far from being a stop-gap this proposal should be seen as a step towards completing and maintaining the existing MBB API for emergent tech.

...

To that end, one approach to explore is allowing the FC.map method accept map modes beyond those defined by MapMode. There is precedence for extensibility in this area already, e.g. FC.open allows you to specify options beyond the standard options specified by the method. It would require MapMode to define a protected constructor and would require a bit of plumbing to support MapMode defined in a JDK-specific module but there are examples to point to. Another approach is aanother class in a JDK-specific module to define the map method. It would require the same plumbing under the covers but would avoid touch the FC spec. I'm not sure what this side-step is supposed to achieve nor how that relates to the concerns over use of MBB (perhaps it doesn't). I'm not really clear what problem you are trying to avoid here by allowing the MapMode enum to be extensible via a protected constructor.

If your desire is to avoid adding extra API surface to FileChannel then where would you consider it appropriate to add such a surface. Something is going to have to create and employ the extra enum tags that are currently proposed for addition to MapMode. How is a client application going to reach that something? Perhaps we might benefit form looking at a simple example? Currently, my most basic test program drives the API to create an MBB as follows: . . . String dir = "/mnt/pmem/test"; // mapSync should work, since fs mount is -o dax Path path = new File(dir, "pmemtest").toPath(); FileChannel fileChannel = (FileChannel) Files .newByteChannel(path, EnumSet.of( StandardOpenOption.READ, StandardOpenOption.WRITE, StandardOpenOption.CREATE)); MappedByteBuffer mappedByteBuffer = fileChannel.map(FileChannel.MapMode.READ_WRITE_PERSISTENT, 0, 1024); . . . Could you give a sketch of an alternative way that you see a client operating? One thing I did wonder about was whether we could insert the relevant behavioural switch in the call to Files.newByteChannel rather than the map call? If we passed an ExtendedOpenOption (e.g. ExtendedOpenOption.SYNC) to this call then method newByteChannel could create and return a corresponding variant of FleChannelImpl, say an instance of a subclass called SyncFileChannelImpl. This could behave as per a normal FileChannelImpl apart from adding the MAP_SYNC flag to the mmap call (well, also rejecting PRIVATE maps). Would that be a better way to drive this? Would it address the concerns you raised above? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Viswanathan, Sandhya

28 Jan 28 Jan

6:39 p.m.

Hi Alan, Could you please let us know more on what does it mean to be a jdk-specific feature? How it is to be implemented? An example would be very helpful. ByteBuffer is a widely used API and deprecating ByteBuffer any time would make it difficult for more and more Java software frameworks to move up to the latest JDK. Best Regards, Sandhya -----Original Message----- From: Alan Bateman [mailto:Alan.Bateman@oracle.com] Sent: Friday, January 18, 2019 5:33 AM To: Andrew Dinn <adinn@redhat.com>; Brian Goetz <brian.goetz@oracle.com> Cc: core-libs-dev@openjdk.java.net; hotspot compiler <hotspot-compiler-dev@openjdk.java.net>; Jonathan Halliday <jonathan.halliday@redhat.com>; Viswanathan, Sandhya <sandhya.viswanathan@intel.com> Subject: Re: RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non-volatile memory On 17/01/2019 14:27, Andrew Dinn wrote:

...

:

...
Vladimir and I have reviewed the JEP, it will need an area lead to endorse, I think it can be Brian or Mikael in this case. Ok, thanks for the above answers. Looking forward to hearing further from Brian and/or Mikael (Vidstedt, I assume? :-). I had a brief discussion with Brian about this yesterday. He brought up the same concern about using MBB as it's not the right API for this in the longer term. So this JEP is very much about a short term/tactical solution as we've already concluded here. This leads to the question as to whether this JEP needs to evolve the standard/Java SE API or not. It's convenient for the implementation of course but we should at least explore doing this as a JDK-specific feature.

Alan Bateman

7:45 p.m.

On 28/01/2019 18:39, Viswanathan, Sandhya wrote:

...

Hi Alan,

Could you please let us know more on what does it mean to be a jdk-specific feature? How it is to be implemented? An example would be very helpful. ByteBuffer is a widely used API and deprecating ByteBuffer any time would make it difficult for more and more Java software frameworks to move up to the latest JDK.

In the API docs, you'll see a number of JDK-specific modules with names that start with "jdk." instead of "java.". Many of these modules export JDK-specific APIs. The jdk.attach module exports the JDK-specific com.sun.tools.attach API. The jdk.management module exports the com.sun.management API which has defined JDK-specific extensions to the management API since JDK 5. Closer to the feature under discussion are APIs that are extensible to allow for support beyond what the Java SE API specifies. The Direct I/O feature in JDK 10 defined a JDK-specific OpenOption that you can specify to FileChannel.open, e.g.: var channel = FileChannel.open(file, StandardOpenOption.WRITE, ExtendedOpenOption.DIRECT); Another example is socket options. Java SE defines "standard" socket options in java.net.StandardSocketOptions but an implementation can support many others. The JDK has the jdk.net.ExtendedSocketOption to define additional socket options so you can do things like this: Socket s = ... s.setOption(ExtendedSocketOption.TCP_KEEPIDLE, 5); The suggestion on the table is to see if we can do the same for file mapping modes so that the platform specific MAP_SYNC mode can be used with the existing API. This would allow for code like this: MappedByteBuffer mbb = fc.map(ExtendedMapMode.READ_WRITE_SYNC, position, size); There's plumbing needed to make this work as the underlying implementation would be in java.base but the platform specific map mode defined in a JDK-specific module. There are several advantages to the approach, the main one is that it doesn't commit Java SE to supporting this mode. I'm hoping to meet up with Andrew Dinn at FOSDEM to discuss this approach in a bit more detail. You asked about deprecating ByteBuffer but I don't think there is any suggestion to do that here. Once Panama is further along, specifically the memory region or scope/pointer API, then interop with ByteBuffer will need to be worked out. -Alan

Viswanathan, Sandhya

7:54 p.m.

Thanks a lot Alan! This is very helpful. Best Regards, Sandhya -----Original Message----- From: Alan Bateman [mailto:Alan.Bateman@oracle.com] Sent: Monday, January 28, 2019 11:45 AM To: Viswanathan, Sandhya <sandhya.viswanathan@intel.com>; Andrew Dinn <adinn@redhat.com>; Brian Goetz <brian.goetz@oracle.com> Cc: core-libs-dev@openjdk.java.net; hotspot compiler <hotspot-compiler-dev@openjdk.java.net>; Jonathan Halliday <jonathan.halliday@redhat.com> Subject: Re: RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non-volatile memory On 28/01/2019 18:39, Viswanathan, Sandhya wrote:

...

Hi Alan,

Could you please let us know more on what does it mean to be a jdk-specific feature? How it is to be implemented? An example would be very helpful. ByteBuffer is a widely used API and deprecating ByteBuffer any time would make it difficult for more and more Java software frameworks to move up to the latest JDK.

Brian Goetz

21 Jan 21 Jan

5:16 p.m.

...

Hi Alan/Brian,

My apologies for having missed this the first time around; messages to lists get swept into folders, and staying on top of many lists is not an exact science.

...

I believe I have addressed all outstanding comments on the JEP per se, including those made by Alan. Is it now possible for one of you to endorse the JEP so it can be submitted?

You don’t actually need my endorsement to Submit a JEP to be a Candidate; all you need is evidence that it’s been socialized on the right list (check) and a reviewer (check). So you’re not blocked on submitting at all. That said, Candidate is actually supposed to be an early stage in the lifecycle; making something a candidate is essentially a statement that the feature is worth considering. It is not any sort of commitment that the feature meets the high bar for inclusion, or is done — just that it is something that is worth continuing to work on in OpenJDK. Clearly, you’ve done a lot more than the minimum to meet the Candidate bar. Where I see this JEP at is you’ve iterated on the implementation and the design for a few rounds — and where we’re at now is we’re facing the stewardship questions of exactly what the right way to position this feature in the JDK is. Based on the list discussion so far, it seems like we’ve not yet reached consensus on this, so these discussions can continue, whether the JEP is a Candidate or not. I concur with Alan and Stuart’s assessment that this is very much a “stop gap” measure. (That’s not a value judgment, either the design of MBB, or your work.) It is clear that MBBs are reaching the end of their design lifetime, and the set of mismatches to the problems people want to solve are growing, and the amount of investment needed to bring MBBs up to the task would be significant. It seems likely a much better long-term solution is Panama, but I agree that Panama is not yet ready to replace MBB, and so some stop-gap is needed to prolong the usefulness of MBBs for long enough for a better solution to emerge. However, Alan’s concern — which I share — is that by expanding the MBB API now, we open ourselves up to increased maintenance costs in the too-near future. Before we settle on an API, we’re going to need to come to consensus on the right tradeoffs here, and I’m not convinced we’ve found this yet. So we’ll keep talking.

...

Is there any other impediment to submitting the JEP and proceeding to code review?

There do not seem to be impediments to submitting the JEP, but I would call out the assumption that the next stop is code review, and then integration. That seems to want to skip over the far more important “should we / how should we” discussions, which are still in progress. Cheers, -Brian

Andrew Dinn

15 Feb 15 Feb

4:51 p.m.

The latest round of changes to the JEP and draft implementation are now available for review: JEP: http://openjdk.java.net/jeps/8207851 webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.05/ Latest Changes: After a very helpful discussion with Alan at FOSDEM I borrowed some hints from him and adjusted the JEP and the latest implementation to address concerns about adding more cruft to the current JDK APIs. The latest solution removes the extra MapMode enum values for SYNC mappings. Instead it makes the enum constructor for MapMode public, allowing JDK-internal FileChannel code to add some extra, extended MapMode tags that are recognized by FileChannel.map. An API class in package jdk.unsupported makes these enum values available for clients that want to perform SYNC maps. This minimizes yet further the changes to JDK public APIs. Thanks are very much due to Alan for explaining how to tease out the dependencies here. I also followed up Vladimir's suggestion regarding the implementation and initialised static long field Unsafe.CACHE_LINE_FLUSH_SIZE by injecting a value during JVM initialization rather than haing Unsafe call out via a native at clinit. I actually added a separate package-public class UnsafeConstants to own the static field and ran the initialize_class call and the do_local_static_fields fixup of static values on that class rather than on Unsafe. This should make it easy to add a follow up patch which injects some of the other static constants in Unsafe that are currently being accessed via native callouts (e.g. the page size). I believe I have also cleaned up the last few remaining issues in previous reviews -- especially: all uses of Persistent have now been replaced with Sync (apart from one comment where I am describing the behaviour of the NVRAM memory) regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Chris Hegarty

5:13 p.m.

Hi Andrew,

...

On 15 Feb 2019, at 16:51, Andrew Dinn <adinn@redhat.com> wrote:

The latest round of changes to the JEP and draft implementation are now available for review:

JEP: http://openjdk.java.net/jeps/8207851 webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.05/

I'm only following this issue at a high-level, but I have one suggestion that I think would help clarify the scope of this feature. If I understand the changes correctly, then the feature is actually accessed through the JDK-specific ExtendedMapMode READ_ONLY_SYNC and READ_WRITE_SYNC map modes. This kinda implies that the JEP should be scoped at the JDK level ( rather than Java SE ). I see that there are changes to the Java SE Platform, namely to the MapMode constructor and an overload of MappedByteBuffer::force. I see these more as "enablers" in support of this feature ( rather than the core of the feature itself ). They can happen as part of the same changeset, or could possibly be pushed separately upfront. If you agree, then the only action needed is to change the `Scope` of the JEP from `SE` to `JDK`. -Chris.

Alan Bateman

17 Feb 17 Feb

5:37 p.m.

On 15/02/2019 17:13, Chris Hegarty wrote:

...

: I see that there are changes to the Java SE Platform, namely to the MapMode constructor and an overload of MappedByteBuffer::force. I see these more as "enablers" in support of this feature ( rather than the core of the feature itself ). They can happen as part of the same changeset, or could possibly be pushed separately upfront.

Yes, the 2-arg force method is useful on its own and could be done in advance (if Andrew wants). There are several detailed API issues with this method but we should be able to agree those quickly (Andrew - these are issues due to MBB being a ByteBuffer so we have to sort out - long from/to vs. int index/size, the upper bound check against the limit rather the capacity, and IAE vs. IIOBE - I'll put these in another mail). Making map mode extensible is also something that can be done in advance. The only piece that is would make it SE scope is the isSync (was isPersistent) method but I don't think it is strictly needed to be exposed initially. -Alan

Andrew Dinn

18 Feb 18 Feb

11:15 a.m.

Hi Chris/Alan, On 17/02/2019 17:37, Alan Bateman wrote:

...

On 15/02/2019 17:13, Chris Hegarty wrote:

...
: I see that there are changes to the Java SE Platform, namely to the MapMode constructor and an overload of MappedByteBuffer::force. I see these more as "enablers" in support of this feature ( rather than the core of the feature itself ). They can happen as part of the same changeset, or could possibly be pushed separately upfront.

Yes, the 2-arg force method is useful on its own and could be done in advance (if Andrew wants). There are several detailed API issues with this method but we should be able to agree those quickly (Andrew - these are issues due to MBB being a ByteBuffer so we have to sort out - long from/to vs. int index/size, the upper bound check against the limit rather the capacity, and IAE vs. IIOBE - I'll put these in another mail). Making map mode extensible is also something that can be done in advance. The only piece that is would make it SE scope is the isSync (was isPersistent) method but I don't think it is strictly needed to be exposed initially. Ah well, perhaps we /can/ change the scope to JDK rather than SE. I finessed the isSync() issue by dropping it from the API. Provision of this method is not critical since I envisage that most clients which know about the SYNC option will always be using a SYNC buffer. If not, they can still easily be written to track how the buffer was created. So, the latest JEP does not mention the isSync method and the implementation makes it private.

I'd be very happy to pre-push separate, enabling changes for 1) the 2-arg force method 2) making MapMode extensible. Alan, I'll wait for your follow-up note before responding regarding the arguments to force. I'll just note that the latest JEP draft accepts start position and length [i.e. force(from, length)] rather than start position and end position [i.e. force(from, to)]. I thought this would be easier for clients to get right -- less danger of out by one errors. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

19 Feb 19 Feb

4:55 p.m.

On 18/02/2019 11:15, Andrew Dinn wrote:

...

: Alan, I'll wait for your follow-up note before responding regarding the arguments to force. I'll just note that the latest JEP draft accepts start position and length [i.e. force(from, length)] rather than start position and end position [i.e. force(from, to)]. I thought this would be easier for clients to get right -- less danger of out by one errors.

The issues with the 2-arg force method that I think need discussion are: 1. long from/to vs. from/length vs int index/length. Elements in buffers, or the starting index of a region, are addressed by an int index in the existing API. We are currently discussing absolute bulk get/put methods on nio-dev right now and the methods on the table use "int length", this is mostly because they are about bulk copying in/out of byte arrays where offet+length is the norm. 2. limit vs. capacity. If I read the webrev correctly, it checks the upper bound against the buffer capacity. I don't think we have any existing methods where you can specify an index that is >= limit. 3. The javadoc doesn't specify the exception thrown when the bounds checks fail. In the webrev I see that IAE is thrown but the existing buffer methods specify IIOBE. If you agree then it means you can replace the checks with one method: Objects.checkFromIndexSize(index, length, limit()); I think that's it. -Alan

Andrew Dinn

6:01 p.m.

Hi Alan, Thanks for following this up. On 19/02/2019 16:55, Alan Bateman wrote:

...

The issues with the 2-arg force method that I think need discussion are:

1. long from/to vs. from/length vs int index/length. Elements in buffers, or the starting index of a region, are addressed by an int index in the existing API. We are currently discussing absolute bulk get/put methods on nio-dev right now and the methods on the table use "int length", this is mostly because they are about bulk copying in/out of byte arrays where offet+length is the norm.

I agree that the force method should take int arguments to align with other methods. Also, I think they should be offset+length for the same reason. I will modify the JEP and the next webrev accordingly.

...

2. limit vs. capacity. If I read the webrev correctly, it checks the upper bound against the buffer capacity. I don't think we have any existing methods where you can specify an index that is >= limit.

I agree that for consistency with other methods it would be better to check against limit(). I will modify the next webrev accordingly modulo resolution of one small detail. My reason for using capacity() was that I was swayed by the original implementation of force(). It calls force0(fd, mappingAddress(offset), mappingLength(offset)) where offset is long offset = mappingOffset(); The definition of mappingLength(offset) is private long mappingLength(long mappingOffset) { return (long)capacity() + mappingOffset; I'm wondering if this ought to remain as is or ought to change to specify limit()?

...

3. The javadoc doesn't specify the exception thrown when the bounds checks fail. In the webrev I see that IAE is thrown but the existing buffer methods specify IIOBE. If you agree then it means you can replace the checks with one method: Objects.checkFromIndexSize(index, length, limit());

It sounds like a very good idea to use that method and hence to throw IOOBE. I will modify the JEP and the next webrev accordingly. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

20 Feb 20 Feb

11:11 a.m.

On 19/02/2019 18:01, Andrew Dinn wrote:

...

: My reason for using capacity() was that I was swayed by the original implementation of force(). It calls

force0(fd, mappingAddress(offset), mappingLength(offset))

:

I'm wondering if this ought to remain as is or ought to change to specify limit()?

The no-arg force method is specified to force any changes to the buffer content to be written so I don't think it needs to change. We could clarify the spec on this point but I don't think it is strictly needed. -Alan

Andrew Dinn

11:28 a.m.

On 20/02/2019 11:11, Alan Bateman wrote:

...

On 19/02/2019 18:01, Andrew Dinn wrote:

...
: My reason for using capacity() was that I was swayed by the original implementation of force(). It calls

force0(fd, mappingAddress(offset), mappingLength(offset))

:

I'm wondering if this ought to remain as is or ought to change to specify limit()?

The no-arg force method is specified to force any changes to the buffer content to be written so I don't think it needs to change. We could clarify the spec on this point but I don't think it is strictly needed. Ok, thanks.

So, in the next webrev when force() with no args is called on a non-SYNC mode buffer I will ensure it continues to call force0(fd, mappingAddress(offset), mappingLength(offset)) For a SYNC buffer I'll redirect to call force(0, limit()) That will retain the existing behaviour for force() on non-SYNC buffers and ensure the call to Objects.checkFromIndexSize in force(int, int) does not throw any surprise IOOBEs. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Haley

11:33 a.m.

On 2/20/19 11:28 AM, Andrew Dinn wrote:

...

So, in the next webrev when force() with no args is called on a non-SYNC mode buffer I will ensure it continues to call

force0(fd, mappingAddress(offset), mappingLength(offset))

For a SYNC buffer I'll redirect to call

force(0, limit())

<panic> Whoa! If that first argument a file descrpitor? If it is, you do know that 0 is a legal fd, right? </panic> -- Andrew Haley Java Platform Lead Engineer Red Hat UK Ltd. <https://www.redhat.com> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Andrew Dinn

12:32 p.m.

On 20/02/2019 11:33, Andrew Haley wrote:

...

On 2/20/19 11:28 AM, Andrew Dinn wrote:

...
So, in the next webrev when force() with no args is called on a non-SYNC mode buffer I will ensure it continues to call

force0(fd, mappingAddress(offset), mappingLength(offset))

For a SYNC buffer I'll redirect to call

force(0, limit())

<panic> Whoa! If that first argument a file descrpitor? If it is, you do know that 0 is a legal fd, right? </panic> No need to panic. That first argument is not a file descriptor. I was explaining to Alan that I can (and will) implement the force() API method for SYNC buffers by redirecting to force(int,int).

force(int,int) is a new API method to force a specific subregion of the buffer. If it needs to do writeback via an fd it pulls that out of the private MappedByteBuffer field (i.e this.fd) and passes it in the call to native method force0. n.b. for the above redirect case the force will be for a SYNC mapped buffer so an fd will not actually be needed. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

21 Feb 21 Feb

12:02 p.m.

The latest JEP and draft implementation now address the 3 outstanding issues: 1) 2-arg force method now uses integer start offset and length force(int from, int length) 2) length is checked against buffer limit rather than capacity 3) start position and length checks are implemented using Objects.checkFromIndexSize. In consequence, force(int,int) now throws IOOBE. I updated the JEP and javadoc to record this. JEP: http://openjdk.java.net/jeps/8207851 webrev: http://cr.openjdk.java.net/~adinn/pmem/webrev.06 I also made one small, additional correction to the implementation. When force(int,int) is called with a non-SYNC buffer it is expected to redirect to force0. The address arg passed in this call was being computed by adding the supplied offset to buffer start address. However, the underlying implementation of force0 calls msync which requires a page-aligned address. The latest version rounds down the computed address to a page boundary. It also increments length by the amount thus subtracted, ensuring all the bytes in the requested range are written back. regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Andrew Dinn

1 Mar 1 Mar

11:05 a.m.

Hi Alan, On 17/02/2019 17:37, Alan Bateman wrote:

...

On 15/02/2019 17:13, Chris Hegarty wrote:

...
: I see that there are changes to the Java SE Platform, namely to the MapMode constructor and an overload of MappedByteBuffer::force. I see these more as "enablers" in support of this feature ( rather than the core of the feature itself ). They can happen as part of the same changeset, or could possibly be pushed separately upfront.

Yes, the 2-arg force method is useful on its own and could be done in advance (if Andrew wants). There are several detailed API issues with this method but we should be able to agree those quickly (Andrew - these are issues due to MBB being a ByteBuffer so we have to sort out - long from/to vs. int index/size, the upper bound check against the limit rather the capacity, and IAE vs. IIOBE - I'll put these in another mail). Making map mode extensible is also something that can be done in advance. The only piece that is would make it SE scope is the isSync (was isPersistent) method but I don't think it is strictly needed to be exposed initially. Having dealt with the above issues (including removing the isSync method) I have reclassified the scope of the JEP to JDK.

I would like now to submit this JEP and begin review of the implementation patches. In particular, I'd like to proceed with the preparatory patches to i) make map mode extensible and ii) overload force. Is it ok to go ahead with this? regards, Andrew Dinn ----------- Senior Principal Software Engineer Red Hat UK Ltd Registered in England and Wales under Company Registration No. 03798903 Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Alan Bateman

3:34 p.m.

On 01/03/2019 11:05, Andrew Dinn wrote:

...

: Having dealt with the above issues (including removing the isSync method) I have reclassified the scope of the JEP to JDK.

I would like now to submit this JEP and begin review of the implementation patches. In particular, I'd like to proceed with the preparatory patches to i) make map mode extensible and ii) overload force. Is it ok to go ahead with this?

I haven't had time to look at the update webrev and wording in the JEP but "the plan" that we've converged on here seems good to me as the 2-arg force is usual on its own and having MapMode be extensible is consistent with the other extension points in this API. -Alan.

Viswanathan, Sandhya

25 Sep 25 Sep

5:31 p.m.

Hi Alan, It looks like Apache HBASE also uses FileChannel map and MappedByteBuffer mechanism. The feature proposed by Andrew Dinn and Jonathan Halliday will be very useful for Big Data frameworks as well and help them to use NVM without a need to go to JNI. Copying HBASE experts Anoop and Ram to this thread. Apache has API layers to overcome the 2GB limitation through MultiByteBuff class: https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/nio/MultiByteBuf... https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/ByteBufferA... Some example uses of ByteBuffer in HBASE today: https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/bucket/... https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/ByteBuffInput... https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/ipc/ServerRpcCon... Best Regards, Sandhya -----Original Message----- From: Alan Bateman [mailto:Alan.Bateman@oracle.com] Sent: Monday, September 24, 2018 1:15 AM To: Andrew Dinn <adinn@redhat.com>; core-libs-dev@openjdk.java.net; hotspot compiler <hotspot-compiler-dev@openjdk.java.net>; Aundhe, Shirish <shirish.aundhe@intel.com>; Dohrmann, Steve <steve.dohrmann@intel.com>; Viswanathan, Sandhya <sandhya.viswanathan@intel.com>; Deshpande, Vivek R <vivek.r.deshpande@intel.com>; Jonathan Halliday <jonathan.halliday@redhat.com> Subject: Re: RFR: 8207851 JEP Draft: Support ByteBuffer mapped over non-volatile memory On 21/09/2018 16:44, Andrew Dinn wrote:

...

Hi Alan,

Thanks for the response and apologies for failing to notice you had posted it some days ago (doh!).

Jonathan Halliday has already explained how Red Hat might want to use this API. Well, what he said, essentially! In particular, this model provides a way of ensuring that raw byte data is able to be persisted coherently from Java with the minimal possible overhead. It would be up to client code above this layer to implement structuring mechanisms for how those raw bytes get populated with data and to manage any associated issues regarding atomicity, consistency and isolation (i.e. to provide the A, C and I of ACID to this API's D).

The main point of the JEP is to ensure that this such a performant base capability is available for known important cases where that is needed such as, for example, a transaction manager or a distributed cache. If equivalent middleware written in C can use persistent memory to bring the persistent storage tier nearer to the CPU and, hence, lower data durability overheads then we really need an equivalently performant option in Java or risk Java dropping out as a player in those middleware markets.

I am glad to hear that other alternatives might be available and would be happy to consider them. However, I'm not sure that this means this option is not still desirable, especially if it is orthogonal to those other alternatives. Most importantly, this one has the advantage that we know it is ready to use and will provide benefits (we have already implemented a journaled transaction log over it with promising results and someone from our messaging team has already been looking into using it to persist log messages). Indeed, we also know we can use it to provide a base for supporting all the use cases addressed by Intel's libpmem and available to C programmers, e.g. a block store, simply by implementing Java client libraries that provide managed access to the persistent buffer along the same lines as the Intel C libraries.

I'm afraid I am not familiar with Panama 'scopes' and 'pointers' so I can't really compare options here. Can you point me at any info that explains what those terms mean and how it might be possible to use them to access off-heap, persistent data.

2564

Age (days ago)

2788

Last active (days ago)

List overview

Download

84 comments

14 participants

participants (14)

Alan Bateman
Andrew Dinn
Andrew Haley
Brian Goetz
Chris Hegarty
Florian Weimer
Francesco Nigro
Jonathan Halliday
Peter Levart
Roger Riggs
Stuart Marks
Viswanathan, Sandhya
Vladimir Kozlov
White, Derek