Question: ByteBuffer vs MemorySegment for binary (de)serializiation and in-memory buffer pool

Sun Sep 4 21:55:15 UTC 2022

Great article. I just wanted to mention that for writes I'm not using the
Foreign Memory API, that is mmap anymore. Instead, I opted in SirixDB for
the Lucene approach to only work with a memory mapped file for reads,
because of potentially many munmap / mmap calls to ever growing files. In
SirixDBs case every change is a copy-on-write and is going to be appended
to the file.

Kind regards
Johannes

Gavin Ray <ray.gavin97 at gmail.com> schrieb am So., 4. Sep. 2022, 17:49:

> Ahh okay, this clears a lot of things up, thanks again =)
>
> Really appreciate how helpful everyone has been.
>
> By the way, I wrote a blogpost about what I learned about Foreign Memory
> API + MemorySegment:
> I gave thanks to those on the mailing list, and John Vornee, in the
> acknowledgements section
>
> Panama: Not-so-Foreign Memory. Using MemorySegment as a high-performance
> ByteBuffer replacement. (gavinray97.github.io)
> <https://gavinray97.github.io/blog/panama-not-so-foreign-memory>
>
> On Sun, Sep 4, 2022 at 11:21 AM Radosław Smogura <mail at smogura.eu> wrote:
>
>> Hi Gevin,
>>
>>
>>
>> In context of enlarging DB file. You can map more of the file then file
>> size. For instance you can map 4MB of file, even if file has size 0. If you
>> want to read and write to this memory, before you have to **enlarge**
>> file – please read FileChannel.postion JavaDoc.
>>
>>
>>
>> In context of managing pages, and zero-copy, I would write something like
>> this in pseudo code.
>>
>>
>>
>> writeData(MemorySegment dbData, long offset) {
>>
>> dbData.setAtOffset(offset, ….);
>>
>> }
>>
>>
>>
>> Commit(MemorySgement ms) {
>>
>>            Ms.force(); // Just ensure data are on disk, so there’s no
>> disk failure, with map shared data can be synced to disk at any time by OS
>>
>> }
>>
>>
>>
>> readDataAtDbPosition(long offset) {
>>
>>            long pageNo = offset / DB_PAGE_SIZE; // Does not have to be
>> system page size, it can be few mb or few GB
>>
>>            if (pageInBuffer()) return buffer.getPage(pageNo);
>>
>>            removeOldestPageFromBuffer();
>>
>>            loadPage();
>>
>> }
>>
>>
>>
>> loadPage(pageNo) {
>>
>>            fileChannel.map(SHARED, pageNo * DB_PAGE_SIZE, DB_PAGE_SIZE,
>> memorySession);
>>
>> }
>>
>>
>>
>> I’ve found some details about mmap, which describes this in more detailed
>> way [1].
>>
>>
>>
>> I hope this could help,
>>
>> Radoslaw Smogura
>>
>>
>>
>> [1] How to use mmap function in C language? (linuxhint.com)
>> <https://linuxhint.com/using_mmap_function_linux/>
>>
>>
>>
>> *From: *Gavin Ray <ray.gavin97 at gmail.com>
>> *Sent: *Sunday, September 4, 2022 5:07 PM
>> *To: *Johannes Lichtenberger <lichtenberger.johannes at gmail.com>
>> *Cc: *Radosław Smogura <mail at smogura.eu>; Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com>; panama-dev at openjdk.org
>> *Subject: *Re: Question: ByteBuffer vs MemorySegment for binary
>> (de)serializiation and in-memory buffer pool
>>
>>
>>
>> Ah okay, I see what you mean, that is good advice -- thank you!
>>
>>
>>
>> Two questions:
>>
>>
>>
>> 1)  I'd try to create a segment for the whole file and get pageSize
>> fragments at specified offsets from this file memory segment
>>
>>
>>
>> How do you do this when the DB file size is dynamic? Write seek()/write()
>> calls, it expands the file as it goes
>>
>> Initially the DB file is 0 bytes and has no data, only as pages are
>> flushed to disk and write() is called does the file expand
>>
>>
>>
>> 2) use MemorySegment::mapFile and directly obtain a MemorySegment from
>> that call
>>
>>
>>
>> How do I make sure that the MemorySegment buffer holding the file
>> contents is not memory-mapped, if I use memory-mapping to read it in?
>>
>> I don't mind to use mmap for:
>>
>>
>>
>> A) Reading files, if it's faster
>>
>> B) Allocating memory arenas, with -1 as file descriptor
>>
>>
>>
>> But I want to avoid using mmap for the actual page buffers and
>> persistence:
>>
>> Are You Sure You Want to Use MMAP in Your Database Management System?
>> (CIDR 2022) (cmu.edu) <https://db.cs.cmu.edu/mmap-cidr2022/>
>>
>>
>>
>> Basically I just want to make sure that once I read the buffer into the
>> buffer pool, it's not mapped memory but program-managed memory if that
>> makes sense?
>>
>>
>>
>>
>>
>> On Sat, Sep 3, 2022 at 7:15 PM Johannes Lichtenberger <
>> lichtenberger.johannes at gmail.com> wrote:
>>
>> I think you should probably use MemorySegment::mapFile and directly
>> obtain a MemorySegment from that call. I'd try to create a segment for the
>> whole file and get pageSize fragments at specified offsets from this file
>> memory segment. I think otherwise with your current approach you're calling
>> mmap and munmap way too often as it involves costly system calls and
>> mappings to virtual addresses (the contents should also be paged in on
>> demand, not at once). Also I think, that even multiple calls to the
>> `mapFile` method and the same file + start address and offset should not
>> add up virtual memory if it's a shared mapping (MAP_SHARED and I think
>> it is).
>>
>>
>>
>> As far as I can tell the approach works in SirixDB so far, but as I want
>> to cache some pages in memory (your buffer manager pendant -- on-heap
>> currently) which in my case have to be reconstructed from scattered page
>> fragments probably it doesn't seem to be faster nor slower than a pure
>> FileChannel based approach. Usually with memory mapping I think you'd
>> rather not want to have an application cache as it's already available
>> off-heap (that said if you store these page sized memory segment views in
>> the buffer manager it should work IMHO).
>>
>>
>>
>> Hope any stuff which I mentioned if it's not correct will be corrected by
>> more experienced engineers.
>>
>>
>>
>> Kind regards
>>
>> Johannes
>>
>>
>>
>> Gavin Ray <ray.gavin97 at gmail.com> schrieb am So., 4. Sep. 2022, 00:26:
>>
>> Radosław, I tried to implement your advice but I think I might have
>> implemented it incorrectly
>>
>> With JMH, I get very poor results:
>>
>>
>>
>> Benchmark                                                 Mode  Cnt
>>  Score        Error  Units
>>
>> DiskManagerBenchmarks._01_01_writePageDiskManager        thrpt    4
>> 552595.393 �  77869.814  ops/s
>>
>> DiskManagerBenchmarks._01_02_writePageMappedDiskManager  thrpt    4
>>  174.588 �    111.846  ops/s
>>
>> DiskManagerBenchmarks._02_01_readPageDiskManager         thrpt    4
>> 640469.183 � 104851.381  ops/s
>>
>> DiskManagerBenchmarks._02_02_readPageMappedDiskManager   thrpt    4
>> 133564.674 �  10693.985  ops/s
>>
>>
>>
>> The difference in writing is ~550,000 vs 174(!)
>>
>> In reading it is ~640,000 vs ~130,000
>>
>>
>>
>> This is the implementation code:
>>
>>
>>
>> public void readPage(PageId pageId, MemorySegment pageBuffer) {
>>
>>     int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>>
>>     MemorySegment mappedBuffer =
>> raf.getChannel().map(FileChannel.MapMode.READ_WRITE, pageOffset,
>> Constants.PAGE_SIZE, session);
>>
>>     mappedBuffer.load();
>>
>>     pageBuffer.copyFrom(mappedBuffer);
>>
>>     mappedBuffer.unload();
>>
>> }
>>
>>
>>
>> public void writePage(PageId pageId, MemorySegment pageBuffer) {
>>
>>     int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>>
>>     MemorySegment mappedBuffer =
>> raf.getChannel().map(FileChannel.MapMode.READ_WRITE, pageOffset,
>> Constants.PAGE_SIZE, session);
>>
>>     mappedBuffer.copyFrom(pageBuffer);
>>
>>     mappedBuffer.force();
>>
>>     mappedBuffer.unload();
>>
>> }
>>
>>
>>
>> Am I doing something wrong here (I think I probably am)
>>
>>
>>
>> On Fri, Sep 2, 2022 at 1:16 PM Gavin Ray <ray.gavin97 at gmail.com> wrote:
>>
>> Thank you very much for the advice, I will implement these suggestions =)
>>
>>
>>
>> On Fri, Sep 2, 2022 at 12:12 PM Radosław Smogura <mail at smogura.eu> wrote:
>>
>> Hi Gavin,
>>
>>
>>
>> I see you do a good progress.
>>
>>
>>
>> This is good approach. Minor improvement would be to use
>> MemorySegment.ofBuffer(), to create memory segment from _*direct*_ byte
>> buffer. This way you would have consistency (using only MemorySegment) and
>> FileChannels or other methods to manage file size.
>>
>>
>>
>> Most probably you would like to use MappedByteBuffer.force() to flush
>> changes to disk (equivalent of sync in Linux) – i.e. to be sure transaction
>> is persisted or for write ahead log.
>>
>>
>>
>> In most cases if you want to work with zero-copy reads, you have to map a
>> whole file as direct buffer / memory segment. You would need to enlarge
>> file using (most probably file channel) or other methods, if you want to
>> append new data (otherwise sigbus or segfault can be generated – can result
>> in exception or crash).
>>
>>
>>
>> You can compare different approaches using JMH to measure reads and
>> writes performance.
>>
>>
>>
>> Kind regards,
>>
>> Rado Smogura
>>
>>
>>
>> *From: *Gavin Ray <ray.gavin97 at gmail.com>
>> *Sent: *Friday, September 2, 2022 5:50 PM
>> *To: *Johannes Lichtenberger <lichtenberger.johannes at gmail.com>
>> *Cc: *Maurizio Cimadamore <maurizio.cimadamore at oracle.com>;
>> panama-dev at openjdk.org
>> *Subject: *Re: Question: ByteBuffer vs MemorySegment for binary
>> (de)serializiation and in-memory buffer pool
>>
>>
>>
>> On a related note, is there any way to do zero-copy reads from files
>> using MemorySegments for non-Memory-Mapped files?
>>
>>
>>
>> Currently I'm using "SeekableByteChannel" and wrapping the MemorySegment
>> using ".asByteBuffer()"
>>
>> Is this the most performant way?
>>
>>
>>
>> ========================
>>
>>
>>
>> class DiskManager {
>>
>>     private final RandomAccessFile raf;
>>
>>     private final SeekableByteChannel dbFileChannel;
>>
>>
>>
>>     public void readPage(PageId pageId, MemorySegment pageBuffer) {
>>
>>         int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>>
>>         dbFileChannel.position(pageOffset);
>>
>>         dbFileChannel.read(pageBuffer.asByteBuffer());
>>
>>     }
>>
>>
>>
>>     public void writePage(PageId pageId, MemorySegment pageBuffer) {
>>
>>         int pageOffset = pageId.value() * Constants.PAGE_SIZE;
>>
>>         dbFileChannel.position(pageOffset);
>>
>>         dbFileChannel.write(pageBuffer.asByteBuffer());
>>
>>     }
>>
>> }
>>
>>
>>
>> On Thu, Sep 1, 2022 at 6:13 PM Johannes Lichtenberger <
>> lichtenberger.johannes at gmail.com> wrote:
>>
>> I think it's a really good idea to use off-heap memory for the Buffer
>> Manager/the pages with the stored records. In my case, I'm working on an
>> immutable, persistent DBMS currently storing JSON and XML with only one
>> read-write trx per resource concurrently and if desired in parallel to N
>> read-only trx bound to specific revisions (in the relational world the term
>> for a resource is a relation/table). During an import of a close to 4Gb
>> JSON file with intermediate commits, I found out that depending on the
>> number of records/nodes accumulated in the trx intent log (a trx private
>> map more or less), after which a commit and thus a sync to disk with
>> removing the pages from the log is issued, the GC runs are >= 100ms most of
>> the times and the objects are long-lived and are promoted to the old gen
>> obviously, which seems to take these >= 100ms. That is I'll have to study
>> how Shenandoah works, but in this case, it brings no advantage regarding
>> the latency.
>>
>>
>>
>> Maybe it would make sense to store the data in the record instances also
>> off-head, as Gavin did with his simple Buffer Manager :-) that said
>> lowering the max records number after which to commit and sync to disk also
>> has a tremendous effect and with Shenandoah, the GC times are less than a
>> few ms at least.
>>
>>
>>
>> I'm using the Foreign Memory API however already to store the data in
>> memory-mapped files, once the pages (or page fragments) and records therein
>> are serialized and then written to the memory segment after compression and
>> hopefully soon encyrption.
>>
>>
>>
>> Kind regards
>>
>> Johannes
>>
>>
>>
>>
>>
>>
>>
>> Am Do., 1. Sept. 2022 um 22:52 Uhr schrieb Maurizio Cimadamore <
>> maurizio.cimadamore at oracle.com>:
>>
>>
>> On 01/09/2022 19:26, Gavin Ray wrote:
>> > I think this is where my impression of verbosity is coming from, in
>> > [1] I've linked a gist of ByteBuffer vs MemorySegment implementation
>> > of a page header struct,
>> > and it's the layout/varhandles that are the only difference, really.
>> >
>> Ok, I see what you mean, of course; thanks for the Gist.
>>
>> In this case I think the instance accessor we added on MemorySegment
>> will bring the code more or less to the same shape as what it used to be
>> with the ByteBuffer API.
>>
>> Using var handles is very useful when you want to access elements (e.g.
>> structs inside other structs inside arrays) as it takes all the offset
>> computation out of the way.
>>
>> If you're happy enough with hardwired offsets (and I agree that in this
>> case things might be good enough), then there's nothing wrong with using
>> the ready-made accessor methods.
>>
>> Maurizio
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220904/170d06fc/attachment-0001.htm>