Off heap vs on heap memory access performance for a DBS

Sat Jul 29 11:18:54 UTC 2023

Well, as I'm today (after sleeping) thinking about it, every DBS has to
write the byte sequences at some point (which we may see in the CPU-based
flame graph prominently). I'm not even sure if I'm seeing a real issue
here. That said, I should first check if other DBSes regarding JSON input
are faster after all!? ;-)

So... regarding GC, it may matter, of course, but the serialization or
writing of the byte-"arrays" has to be done at some point, regardless if
I'm storing intermediate in-memory node instances or not!?

Kind regards
Johannes

Am Sa., 29. Juli 2023 um 02:45 Uhr schrieb Johannes Lichtenberger <
lichtenberger.johannes at gmail.com>:

> Hi Maurizio,
>
> I could allocate a MemorySegment, for instance, for each node (but it must
> be fast, of course, due to 1024 nodes per page, and overall, in one test I
> have over 300_000_000 nodes). However, I'd have to adapt the neighbor
> pointers once more nodes are inserted. Maybe this architecture diagram can
> give some insight (JSON mapped to fine granular nodes), and the nodes are
> accessed from a trie due to being a disk-based DBS (mainly an array stored
> in a tree structure to fetch nodes by their 64-bit identifier):
> https://raw.githubusercontent.com/sirixdb/sirixdb.github.io/master/images/sirix-json-tree-encoding.png
>
> Thus, while adding nodes, at least the relationship between the nodes has
> to be adapted (long references):
> https://github.com/sirixdb/sirix/blob/master/bundles/sirix-core/src/main/java/io/sirix/node/delegates/StructNodeDelegate.java
>   all of this can be done without either needing a resized MemorySegment or
> a byte-array. Still, once the read-write trx sets a different name for an
> object field value or a field itself, the String is of variable length. The
> MemorySegment or byte array may have to grow, but these cases are rare, so
> creating a new byte array, for instance, may be acceptable.
>
> I'd, however, have to rewrite a lot of stuff (and due to switching jobs to
> embedded C/C++ stuff, I'm not sure I'm able to do it soon :-/ and hope for
> an easier solution somehow, but can't think of it):
> https://github.com/sirixdb/sirix/blob/fcefecc32567b6c4ead38d38ebeae3c050fab6f2/bundles/sirix-core/src/main/java/io/sirix/access/trx/page/NodePageTrx.java#L212
>
> The tricky thing is also the variable length pages if I'd switch
> completely to store a page as one MemorySegment each (instead of only the
> fine granular nodes as proposed above).
>
> But by looking at the flame graph, serialization is one of the most
> time-intensive tasks, right? Even though I'm already parallelizing it as
> one of the first tasks during a commit for the leaf data pages, your
> proposed way of somehow matching a memory layout to the serialization
> layout may be the only way to get better latency.
>
> kind regards
> Johannes
>
> Am Sa., 29. Juli 2023 um 01:22 Uhr schrieb Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com>:
>
>> If you need to serialize big chunks of memory in files, I think you are
>> in the use case Brian S was describing - e.g. you need some kind of memory
>> mapped solution.
>>
>> E.g. you probably want the memory layout to match the serialized layout,
>> so that your memory operations can be persisted directly on the disk (e.g.
>> calling MS::force).
>>
>> That said, if you have the requirement to go back and from byte arrays,
>> you might be in trouble, because there's no way to "wrap an array" over a
>> piece of off-heap memory. You would have to allocate and then copy, which
>> is very expensive if you have a lot of data.
>>
>> Maurizio
>> On 28/07/2023 22:45, Johannes Lichtenberger wrote:
>>
>> I think the main issue is not even allocation or GC, but instead the
>> serialization of in this case over 300_000_000 nodes in total.
>>
>> I've attached a screenshot showing the flamegraph of the profiler output
>> I posted the link to...
>>
>> What do you think? For starters the page already has a slot array to
>> store byte arrays for the nodes. The whole page should probably be
>> something like a growable MemorySegment, but I could probably first try to
>> simply write byte arrays directly and read from them using something as
>> Chronicle bytes or even MemorySegments and the converting it to byte arrays?
>>
>> Kind regards
>> Johannes
>>
>> Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Fr., 28.
>> Juli 2023, 11:38:
>>
>>>
>>> On 28/07/2023 08:15, Johannes Lichtenberger wrote:
>>> > Hello,
>>> >
>>> > I think I mentioned it already, but currently I'm thinking about it
>>> again.
>>> >
>>> > Regarding the index trie in my spare time project I'm thinking if it
>>> > makes sense, as currently I'm creating fine granular on heap nodes
>>> > during insertions/updates/deletes (1024 per page). Once a page is read
>>> > again from storage I'm storing these nodes in a byte array of byte
>>> > arrays until read for the first time. One thing though is, that the
>>> > nodes may store strings inline and thus are of variable size (and
>>> > thus, the pages are of variable size, too, padded to word aligned
>>> IIRC).
>>> >
>>> > I'm currently auto-committing after approx 500_000 nodes have been
>>> > created (afterwards they can be garbage collected) and in total there
>>> > are more than 320 million nodes in one test.
>>> >
>>> > I think I could store the nodes in MemorySegments instead of using on
>>> > heap classes / instances and dynamically reallocate memory if a node
>>> > value is changed.
>>> >
>>> > However, I'm not sure as it means a lot of work and maybe off heap
>>> > memory access is always slightly worse than on heap!?
>>>
>>> I don't think that's necessarily the case. I mean, array access is the
>>> best, there's more optimizations for it, and the access is more
>>> scrutable to the optimizing compiler.
>>>
>>> If you start using APIs, such as ByteBuffer or MemorySegment, they take
>>> a bit of a hit, depending on usage, as each access has to verify certain
>>> access properties. That said, if your access is "well-behaved" and
>>> SIMD-friendly (e.g. counted loop and such), you can expect performance
>>> of MS/BB to be very good, as all the relevant checks will be hoisted out
>>> of loops.
>>>
>>> With memory segments, since you can also create unsafe segments on the
>>> fly, we're investigating approaches where you can get (at least in
>>> synthetic benchmarks) the same assembly and performance of raw Unsafe
>>> calls:
>>>
>>> https://mail.openjdk.org/pipermail/panama-dev/2023-July/019487.html
>>>
>>> I think one area that requires a lot of thought when it comes to
>>> off-heap is allocation. The GC is mightly fast at allocating objects,
>>> especially small ones that might die soon. The default implementation of
>>> Arena::allocate uses malloc under the hood, so it's not going to be
>>> anywhere as fast.
>>>
>>> That said, starting from Java 20 you can define a custom arena with a
>>> better allocation scheme. For instance, if you are allocating in a tight
>>> loop you can write an "allocator" which just recycles memory (see
>>> SegmentAllocator::slicingAllocator). With malloc out of the way the
>>> situation should improve significantly.
>>>
>>> Ultimately picking the right allocation scheme depends on your workload,
>>> there is no one size-fits-all (as I'm sure you know). But there should
>>> be enough building blocks in the API to allow you to do what you need.
>>>
>>>
>>> >
>>> > I'll check GC pressure again by logging it, but an IntelliJ profiler
>>> > (async profiler JFR) output of a run to store a big JSON file in
>>> > SirixDB can be seen here:
>>> >
>>> https://github.com/sirixdb/sirix/blob/refactoring-serialization/JsonShredderTest_testChicago_2023_07_27_131637.jfr
>>> <https://urldefense.com/v3/__https://github.com/sirixdb/sirix/blob/refactoring-serialization/JsonShredderTest_testChicago_2023_07_27_131637.jfr__;!!ACWV5N9M2RV99hQ!N3Xy1IMnvPeF5Dopce9tfDfeCHtlQddOiTixmyG3PRnoKGiXUXU9MVc112QTGPBbzMYp2SDIdJvX1W1c1Mm4rXDzj_wTWV4Nww$>
>>> >
>>> > I think I had better performance/latency with Shenandoah (not
>>> > generational), but ZGC was worse in other workloads due to caffeine
>>> > caches and not being generational (but that's changing of course).
>>> >
>>> > So, by looking at the profiler output and probably the flame graph
>>> > where G1 work seems to be prominent do you think a refactoring would
>>> > be appropriate using MemorySegments or maybe it's an ideal "big data"
>>> > use case for the generational low latency GCs and the amount of
>>> > objects is not an issue at all!?
>>>
>>> Hard to say. Generational GCs are very very good. And object allocation
>>> might be cheaper than you think. Where off-heap becomes advantageous is
>>> (typically) if you need to work with memory mapped files (and/or native
>>> calls), which is common for database-like use cases.
>>>
>>> Maurizio
>>>
>>> >
>>> > Kind regards
>>> > Johannes
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230729/fcb86e58/attachment-0001.htm>