Best practices with reading/writing to a Memory Mapped File

Tue Jun 30 13:35:26 UTC 2020

I just implemented a simple FileChannel based implementation with direct
I/O, but for some reason this must be aligned to the default block size
(for me it's currently 4096 bytes), so direct I/O can't work as SirixDB is
deliberately reading and writing fine grained page-fragments. At least with
Byte Addressable Intel Optane Memory it should be possible to fetch 256
bytes at once efficiently which might be fine granular enough, to fetch a a
predefined number of page-fragment with a few changed records each
concurrently to reconstruct the full-page in-memory.

SirixDB is always appending, I guess I'll just have to do this appending
with a FileChannel or directly through a RandomAccessFile and the read-only
equivalent through a mapped memory segment, which spans the whole file (or
segments with at least 1 Gb or more).

Kind regards
Johannes

Johannes Lichtenberger <lichtenberger.johannes at gmail.com> schrieb am Mo.,
29. Juni 2020, 23:32:

> Some DBMSes seem to even use direct I/O to bypass the Kernel pache cache:
> https://www.scylladb.com/2017/10/05/io-access-methods-scylla/
>
> Kind regards
> Johannes
>
> Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Mo., 29.
> Juni 2020, 18:17:
>
>> Uwe - any tips here?
>>
>> Cheers
>> Maurizio
>> On 29/06/2020 17:15, Johannes Lichtenberger wrote:
>>
>> Hi Maurizio,
>>
>> indeed, now I just allocate basically the whole file again for reading
>> page-fragments. I think for 2,7 Gb of test data stored in a SirixDB
>> resource (I think 223_000_000 nodes roughly) it's currently about 5:50min
>> to traverse the whole file in preorder vs 6:30min for the RandomAccessFile
>> but just from running in IntelliJ. I could also see in the profiler that
>> the mapping and unmapping is the problem when appending (during a
>> transaction commit).
>>
>> The page references keep track of the offsets in the file or memory
>> mapped region (pages of a tree of tries structure, which borrows some ideas
>> of ART(adaptive radix tree) and hash array mapped tries.
>>
>> I think another issue when reading is, that I probably don't even need a
>> Cache anymore. Usually I'm setting the loaded page in
>> PageReference.setPage(Page) and also storing it in a cache. The cache is
>> for instance also used to evict entries, thus also nulling the referenced
>> page again (sort of pointer swizzling). I think both might not be needed
>> when using the mapped memory segment(s).
>>
>> Regarding writing I'm not sure if I can simply delegate reads to the
>> MemoryMappedFileReader and doing the appending directly to the
>> RandomAccessFile. From what I read appending might be the worst use case
>> for Memory Mapped Files, as besides on Linux you probably have to
>> preallocate chunks, tracking the size and afterwards truncate to this size
>> again. On Linux there's a remap function I think.
>>
>> But maybe appending can simply be done by appending to the
>> RandomAccessFile and reading through the mapped memory segment, what do you
>> think? As mentioned earlier I should probably also get rid of the cache and
>> not setting the on-heap Page instance, but I'm also not sure about this.
>>
>> From what I read basically every database system seems to use mmap
>> nowadays, some are in-memory data stores as for instance HyPer and SAP
>> HANA, but also MongoDB and I think Cassandra...
>>
>> Kind regards
>>
>> Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Mo., 29.
>> Juni 2020, 17:31:
>>
>>> Hi Johannes,
>>> glad that you managed to make everything work.
>>>
>>> While I'm not an expert in mmap fine-tuning, one thing that comes to
>>> mind is that memory mapped files are mapped into main memory one page at
>>> a time, so if your pattern of access is really random/sparse, maybe
>>> there's not a lot to be gained by using mapped file in your use case.
>>>
>>> Also, looking at the code, it seems like you are creating a mapped
>>> segment for each page write, which seems odd - typically you'd want a
>>> mapped segment to contain all the memory you need to access, and then
>>> let the loading/unloading of pages to the OS, which generally knows
>>> better. It seems to me that your application is, instead selecting with
>>> PageReference to write, then creates a mapped segment for that page, and
>>> then persists the changes via the mapped segment; I think doing this
>>> probably nullifies all the advantages of keeping the contents of the
>>> file in memory. In fact, with your approach, since the mapped segment is
>>> not stashed anywhere, I don't think the file will be even kept in memory
>>> (you map and then discard soon after, page after page).
>>>
>>> I'd expect some state to remain cached from one write to the next (e.g.
>>> the mapped segment should, ideally, be stashed in some field, and only
>>> discarded if, for some reason, the original bounds are no longer valid -
>>> e.g. because the file is truncated, or expanded). But, assuming your
>>> file size remains stable, your code should keep accessing memory using
>>> _the same_ mapped segment, and the OS will load/unload pages for you as
>>> it sees fit (using heuristics to keep frequently used pages loaded, and
>>> discard the ones that have been used less frequently - all taking into
>>> account how much memory your system has).
>>>
>>> Maurizio
>>>
>>> On 27/06/2020 11:50, Johannes Lichtenberger wrote:
>>> > Hi,
>>> >
>>> > I've fixed my Memory Mapped file implementation using your Foreign
>>> Memory
>>> > API.
>>> >
>>> >
>>> https://github.com/sirixdb/sirix/tree/master/bundles/sirix-core/src/main/java/org/sirix/io/memorymapped
>>> >
>>> > Running my tests (mostly simple integration tests, which test if the
>>> stuff
>>> > I'm storing can be retrieved again or the result of queries are what I
>>> > expect), I can't see a clear performance difference between the
>>> > RandomAccessFile implementation
>>> >
>>> >
>>> https://github.com/sirixdb/sirix/tree/master/bundles/sirix-core/src/main/java/org/sirix/io/file
>>> >
>>> > and the new memorymapped implementation.
>>> >
>>> > So far, I have to create a new mapping everytime I'm appending to the
>>> > memory mapped segment of the underlying file I guess (otherwise the
>>> bounds
>>> > checks will obviously fail):
>>> >
>>> >
>>> https://github.com/sirixdb/sirix/blob/627fa5a57a302b04d7165aad75a780d74e14c2e9/bundles/sirix-core/src/main/java/org/sirix/io/memorymapped/MemoryMappedFileWriter.java#L141
>>> >
>>> > I'm only ever appending data when writing or reading randomly based on
>>> > offsets.
>>> >
>>> > I haven't done any microbenchmarks as of now and did not check bigger
>>> files
>>> > ranging from 1Gb to much more nor did I use a profiler to check what's
>>> > going on. However, maybe creating the mapping often times is costly and
>>> > maybe you can simply spot a performance issue. Or it's IntelliJ and my
>>> > rather small flles for testing as of now.
>>> >
>>> > Will next check if importing a 3,8 Gb JSON file is faster or iterating
>>> > through the whole imported file with around 400_000_000 nodes :-)
>>> >
>>> > If anyone wants to check it it's simply changing
>>> >
>>> > private static final StorageType STORAGE = StorageType.FILE;
>>> >
>>> > to
>>> >
>>> > private static final StorageType STORAGE = StorageType.MEMORY_MAPPED;
>>> >
>>> > in the class: org.sirix.access.ResourceConfiguration
>>> >
>>> > Thanks for all the suggestions and hints so far
>>> > Johannes
>>>
>>