Question: ByteBuffer vs MemorySegment for binary (de)serializiation and in-memory buffer pool

Sun Sep 4 15:48:52 UTC 2022

Ahh okay, this clears a lot of things up, thanks again =)

Really appreciate how helpful everyone has been.

By the way, I wrote a blogpost about what I learned about Foreign Memory
API + MemorySegment:
I gave thanks to those on the mailing list, and John Vornee, in the
acknowledgements section

Panama: Not-so-Foreign Memory. Using MemorySegment as a high-performance
ByteBuffer replacement. (gavinray97.github.io)
<https://gavinray97.github.io/blog/panama-not-so-foreign-memory>

On Sun, Sep 4, 2022 at 11:21 AM Radosław Smogura <mail at smogura.eu> wrote:

> Hi Gevin,
>
>
>
> In context of enlarging DB file. You can map more of the file then file
> size. For instance you can map 4MB of file, even if file has size 0. If you
> want to read and write to this memory, before you have to **enlarge**
> file – please read FileChannel.postion JavaDoc.
>
>
>
> In context of managing pages, and zero-copy, I would write something like
> this in pseudo code.
>
>
>
> writeData(MemorySegment dbData, long offset) {
>
> dbData.setAtOffset(offset, ….);
>
> }
>
>
>
> Commit(MemorySgement ms) {
>
>            Ms.force(); // Just ensure data are on disk, so there’s no disk
> failure, with map shared data can be synced to disk at any time by OS
>
> }
>
>
>
> readDataAtDbPosition(long offset) {
>
>            long pageNo = offset / DB_PAGE_SIZE; // Does not have to be
> system page size, it can be few mb or few GB
>
>            if (pageInBuffer()) return buffer.getPage(pageNo);
>
>            removeOldestPageFromBuffer();
>
>            loadPage();
>
> }
>
>
>
> loadPage(pageNo) {
>
>            fileChannel.map(SHARED, pageNo * DB_PAGE_SIZE, DB_PAGE_SIZE,
> memorySession);
>
> }
>
>
>
> I’ve found some details about mmap, which describes this in more detailed
> way [1].
>
>
>
> I hope this could help,
>
> Radoslaw Smogura
>
>
>
> [1] How to use mmap function in C language? (linuxhint.com)
> <https://linuxhint.com/using_mmap_function_linux/>
>
>
>
> *From: *Gavin Ray <ray.gavin97 at gmail.com>
> *Sent: *Sunday, September 4, 2022 5:07 PM
> *To: *Johannes Lichtenberger <lichtenberger.johannes at gmail.com>
> *Cc: *Radosław Smogura <mail at smogura.eu>; Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com>; panama-dev at openjdk.org
> *Subject: *Re: Question: ByteBuffer vs MemorySegment for binary
> (de)serializiation and in-memory buffer pool
>
>
>
> Ah okay, I see what you mean, that is good advice -- thank you!
>
>
>
> Two questions:
>
>
>
> 1)  I'd try to create a segment for the whole file and get pageSize
> fragments at specified offsets from this file memory segment
>
>
>
> How do you do this when the DB file size is dynamic? Write seek()/write()
> calls, it expands the file as it goes
>
> Initially the DB file is 0 bytes and has no data, only as pages are
> flushed to disk and write() is called does the file expand
>
>
>
> 2) use MemorySegment::mapFile and directly obtain a MemorySegment from
> that call
>
>
>
> How do I make sure that the MemorySegment buffer holding the file contents
> is not memory-mapped, if I use memory-mapping to read it in?
>
> I don't mind to use mmap for:
>
>
>
> A) Reading files, if it's faster
>
> B) Allocating memory arenas, with -1 as file descriptor
>
>
>
> But I want to avoid using mmap for the actual page buffers and persistence:
>
> Are You Sure You Want to Use MMAP in Your Database Management System?
> (CIDR 2022) (cmu.edu) <https://db.cs.cmu.edu/mmap-cidr2022/>
>
>
>
> Basically I just want to make sure that once I read the buffer into the
> buffer pool, it's not mapped memory but program-managed memory if that
> makes sense?
>
>
>
>
>
> On Sat, Sep 3, 2022 at 7:15 PM Johannes Lichtenberger <
> lichtenberger.johannes at gmail.com> wrote:
>
> I think you should probably use MemorySegment::mapFile and directly obtain
> a MemorySegment from that call. I'd try to create a segment for the whole
> file and get pageSize fragments at specified offsets from this file memory
> segment. I think otherwise with your current approach you're calling mmap
> and munmap way too often as it involves costly system calls and mappings to
> virtual addresses (the contents should also be paged in on demand, not at
> once). Also I think, that even multiple calls to the `mapFile` method and
> the same file + start address and offset should not add up virtual memory
> if it's a shared mapping (MAP_SHARED and I think it is).
>
>
>
> As far as I can tell the approach works in SirixDB so far, but as I want
> to cache some pages in memory (your buffer manager pendant -- on-heap
> currently) which in my case have to be reconstructed from scattered page
> fragments probably it doesn't seem to be faster nor slower than a pure
> FileChannel based approach. Usually with memory mapping I think you'd
> rather not want to have an application cache as it's already available
> off-heap (that said if you store these page sized memory segment views in
> the buffer manager it should work IMHO).
>
>
>
> Hope any stuff which I mentioned if it's not correct will be corrected by
> more experienced engineers.
>
>
>
> Kind regards
>
> Johannes
>
>
>
> Gavin Ray <ray.gavin97 at gmail.com> schrieb am So., 4. Sep. 2022, 00:26:
>
> Radosław, I tried to implement your advice but I think I might have
> implemented it incorrectly
>
> With JMH, I get very poor results:
>
>
>
> Benchmark                                                 Mode  Cnt
>  Score        Error  Units
>
> DiskManagerBenchmarks._01_01_writePageDiskManager        thrpt    4
> 552595.393 �  77869.814  ops/s
>
> DiskManagerBenchmarks._01_02_writePageMappedDiskManager  thrpt    4
>  174.588 �    111.846  ops/s
>
> DiskManagerBenchmarks._02_01_readPageDiskManager         thrpt    4
> 640469.183 � 104851.381  ops/s
>
> DiskManagerBenchmarks._02_02_readPageMappedDiskManager   thrpt    4
> 133564.674 �  10693.985  ops/s
>
>
>
> The difference in writing is ~550,000 vs 174(!)
>
> In reading it is ~640,000 vs ~130,000
>
>
>
> This is the implementation code:
>
>
>
> public void readPage(PageId pageId, MemorySegment pageBuffer) {
>
>     int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>
>     MemorySegment mappedBuffer =
> raf.getChannel().map(FileChannel.MapMode.READ_WRITE, pageOffset,
> Constants.PAGE_SIZE, session);
>
>     mappedBuffer.load();
>
>     pageBuffer.copyFrom(mappedBuffer);
>
>     mappedBuffer.unload();
>
> }
>
>
>
> public void writePage(PageId pageId, MemorySegment pageBuffer) {
>
>     int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>
>     MemorySegment mappedBuffer =
> raf.getChannel().map(FileChannel.MapMode.READ_WRITE, pageOffset,
> Constants.PAGE_SIZE, session);
>
>     mappedBuffer.copyFrom(pageBuffer);
>
>     mappedBuffer.force();
>
>     mappedBuffer.unload();
>
> }
>
>
>
> Am I doing something wrong here (I think I probably am)
>
>
>
> On Fri, Sep 2, 2022 at 1:16 PM Gavin Ray <ray.gavin97 at gmail.com> wrote:
>
> Thank you very much for the advice, I will implement these suggestions =)
>
>
>
> On Fri, Sep 2, 2022 at 12:12 PM Radosław Smogura <mail at smogura.eu> wrote:
>
> Hi Gavin,
>
>
>
> I see you do a good progress.
>
>
>
> This is good approach. Minor improvement would be to use
> MemorySegment.ofBuffer(), to create memory segment from _*direct*_ byte
> buffer. This way you would have consistency (using only MemorySegment) and
> FileChannels or other methods to manage file size.
>
>
>
> Most probably you would like to use MappedByteBuffer.force() to flush
> changes to disk (equivalent of sync in Linux) – i.e. to be sure transaction
> is persisted or for write ahead log.
>
>
>
> In most cases if you want to work with zero-copy reads, you have to map a
> whole file as direct buffer / memory segment. You would need to enlarge
> file using (most probably file channel) or other methods, if you want to
> append new data (otherwise sigbus or segfault can be generated – can result
> in exception or crash).
>
>
>
> You can compare different approaches using JMH to measure reads and writes
> performance.
>
>
>
> Kind regards,
>
> Rado Smogura
>
>
>
> *From: *Gavin Ray <ray.gavin97 at gmail.com>
> *Sent: *Friday, September 2, 2022 5:50 PM
> *To: *Johannes Lichtenberger <lichtenberger.johannes at gmail.com>
> *Cc: *Maurizio Cimadamore <maurizio.cimadamore at oracle.com>;
> panama-dev at openjdk.org
> *Subject: *Re: Question: ByteBuffer vs MemorySegment for binary
> (de)serializiation and in-memory buffer pool
>
>
>
> On a related note, is there any way to do zero-copy reads from files using
> MemorySegments for non-Memory-Mapped files?
>
>
>
> Currently I'm using "SeekableByteChannel" and wrapping the MemorySegment
> using ".asByteBuffer()"
>
> Is this the most performant way?
>
>
>
> ========================
>
>
>
> class DiskManager {
>
>     private final RandomAccessFile raf;
>
>     private final SeekableByteChannel dbFileChannel;
>
>
>
>     public void readPage(PageId pageId, MemorySegment pageBuffer) {
>
>         int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>
>         dbFileChannel.position(pageOffset);
>
>         dbFileChannel.read(pageBuffer.asByteBuffer());
>
>     }
>
>
>
>     public void writePage(PageId pageId, MemorySegment pageBuffer) {
>
>         int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>
>         dbFileChannel.position(pageOffset);
>
>         dbFileChannel.write(pageBuffer.asByteBuffer());
>
>     }
>
> }
>
>
>
> On Thu, Sep 1, 2022 at 6:13 PM Johannes Lichtenberger <
> lichtenberger.johannes at gmail.com> wrote:
>
> I think it's a really good idea to use off-heap memory for the Buffer
> Manager/the pages with the stored records. In my case, I'm working on an
> immutable, persistent DBMS currently storing JSON and XML with only one
> read-write trx per resource concurrently and if desired in parallel to N
> read-only trx bound to specific revisions (in the relational world the term
> for a resource is a relation/table). During an import of a close to 4Gb
> JSON file with intermediate commits, I found out that depending on the
> number of records/nodes accumulated in the trx intent log (a trx private
> map more or less), after which a commit and thus a sync to disk with
> removing the pages from the log is issued, the GC runs are >= 100ms most of
> the times and the objects are long-lived and are promoted to the old gen
> obviously, which seems to take these >= 100ms. That is I'll have to study
> how Shenandoah works, but in this case, it brings no advantage regarding
> the latency.
>
>
>
> Maybe it would make sense to store the data in the record instances also
> off-head, as Gavin did with his simple Buffer Manager :-) that said
> lowering the max records number after which to commit and sync to disk also
> has a tremendous effect and with Shenandoah, the GC times are less than a
> few ms at least.
>
>
>
> I'm using the Foreign Memory API however already to store the data in
> memory-mapped files, once the pages (or page fragments) and records therein
> are serialized and then written to the memory segment after compression and
> hopefully soon encyrption.
>
>
>
> Kind regards
>
> Johannes
>
>
>
>
>
>
>
> Am Do., 1. Sept. 2022 um 22:52 Uhr schrieb Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com>:
>
>
> On 01/09/2022 19:26, Gavin Ray wrote:
> > I think this is where my impression of verbosity is coming from, in
> > [1] I've linked a gist of ByteBuffer vs MemorySegment implementation
> > of a page header struct,
> > and it's the layout/varhandles that are the only difference, really.
> >
> Ok, I see what you mean, of course; thanks for the Gist.
>
> In this case I think the instance accessor we added on MemorySegment
> will bring the code more or less to the same shape as what it used to be
> with the ByteBuffer API.
>
> Using var handles is very useful when you want to access elements (e.g.
> structs inside other structs inside arrays) as it takes all the offset
> computation out of the way.
>
> If you're happy enough with hardwired offsets (and I agree that in this
> case things might be good enough), then there's nothing wrong with using
> the ready-made accessor methods.
>
> Maurizio
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220904/32b20c21/attachment-0001.htm>