Off heap vs on heap memory access performance for a DBS

Sat Jul 29 13:03:31 UTC 2023

Well, by importing into MongoDB I run into the issue, that the import gets
killed after some time:

johannes at luna:~/IdeaProjects/sirix/bundles/sirix-core/src/test/resources/json$
mongoimport --db test1 --collection docs1 --file cityofchicago.json
2023-07-29T14:30:06.819+0200 connected to: mongodb://localhost/
2023-07-29T14:30:09.820+0200 [#.......................] test1.docs1
256MB/3.56GB (7.0%)
2023-07-29T14:30:12.820+0200 [###.....................] test1.docs1
512MB/3.56GB (14.0%)
2023-07-29T14:30:15.819+0200 [######..................] test1.docs1
1024MB/3.56GB (28.1%)
2023-07-29T14:30:18.820+0200 [######..................] test1.docs1
1024MB/3.56GB (28.1%)
2023-07-29T14:30:21.819+0200 [######..................] test1.docs1
1024MB/3.56GB (28.1%)
2023-07-29T14:30:24.820+0200 [#############...........] test1.docs1
2.00GB/3.56GB (56.2%)
2023-07-29T14:30:27.820+0200 [#############...........] test1.docs1
2.00GB/3.56GB (56.2%)
2023-07-29T14:30:30.819+0200 [####################....] test1.docs1
3.00GB/3.56GB (84.3%)
2023-07-29T14:30:33.819+0200 [####################....] test1.docs1
3.00GB/3.56GB (84.3%)
2023-07-29T14:30:36.819+0200 [####################....] test1.docs1
3.00GB/3.56GB (84.3%)
2023-07-29T14:30:39.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:30:42.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:30:45.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:30:48.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:30:51.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:30:54.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:30:57.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:00.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:03.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:06.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:09.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:12.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:15.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:18.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:21.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:24.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:27.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:30.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:33.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:36.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:39.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:42.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:45.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:48.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:51.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:54.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:31:57.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:00.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:03.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:06.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:09.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:12.820+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:15.819+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:19.061+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:25.435+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:25.802+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
2023-07-29T14:32:34.906+0200 [########################] test1.docs1
3.56GB/3.56GB (100.0%)
Killed

Maybe some limit reached... but I'm sure I managed to import the file some
time ago!?

With PostgreSQL I run into similar issues, somehow (but maybe there's a way
to stream the JSON file into the jsonb column):

jsonb_test=# select pg_read_file('cityofchicago.json', 0, 100);
                pg_read_file
--------------------------------------------
 {                                         +
   "meta" : {                              +
     "view" : {                            +
       "id" : "ijzp-q8t2",                 +
       "name" : "Crimes - 2001 to present",+

(1 row)

jsonb_test=# select pg_read_file('cityofchicago.json');
ERROR:  file length too large

Maybe they are just to meant to store a single large JSON file into the DB
:-D

So, an advantage of SirixDB as it has no document semantics but much finer
granularity...

kind regards
Johannes

Am Sa., 29. Juli 2023 um 13:18 Uhr schrieb Johannes Lichtenberger <
lichtenberger.johannes at gmail.com>:

> Well, as I'm today (after sleeping) thinking about it, every DBS has to
> write the byte sequences at some point (which we may see in the CPU-based
> flame graph prominently). I'm not even sure if I'm seeing a real issue
> here. That said, I should first check if other DBSes regarding JSON input
> are faster after all!? ;-)
>
> So... regarding GC, it may matter, of course, but the serialization or
> writing of the byte-"arrays" has to be done at some point, regardless if
> I'm storing intermediate in-memory node instances or not!?
>
> Kind regards
> Johannes
>
> Am Sa., 29. Juli 2023 um 02:45 Uhr schrieb Johannes Lichtenberger <
> lichtenberger.johannes at gmail.com>:
>
>> Hi Maurizio,
>>
>> I could allocate a MemorySegment, for instance, for each node (but it
>> must be fast, of course, due to 1024 nodes per page, and overall, in one
>> test I have over 300_000_000 nodes). However, I'd have to adapt the
>> neighbor pointers once more nodes are inserted. Maybe this architecture
>> diagram can give some insight (JSON mapped to fine granular nodes), and the
>> nodes are accessed from a trie due to being a disk-based DBS (mainly an
>> array stored in a tree structure to fetch nodes by their 64-bit
>> identifier):
>> https://raw.githubusercontent.com/sirixdb/sirixdb.github.io/master/images/sirix-json-tree-encoding.png
>>
>> Thus, while adding nodes, at least the relationship between the nodes has
>> to be adapted (long references):
>> https://github.com/sirixdb/sirix/blob/master/bundles/sirix-core/src/main/java/io/sirix/node/delegates/StructNodeDelegate.java
>>   all of this can be done without either needing a resized MemorySegment or
>> a byte-array. Still, once the read-write trx sets a different name for an
>> object field value or a field itself, the String is of variable length. The
>> MemorySegment or byte array may have to grow, but these cases are rare, so
>> creating a new byte array, for instance, may be acceptable.
>>
>> I'd, however, have to rewrite a lot of stuff (and due to switching jobs
>> to embedded C/C++ stuff, I'm not sure I'm able to do it soon :-/ and hope
>> for an easier solution somehow, but can't think of it):
>> https://github.com/sirixdb/sirix/blob/fcefecc32567b6c4ead38d38ebeae3c050fab6f2/bundles/sirix-core/src/main/java/io/sirix/access/trx/page/NodePageTrx.java#L212
>>
>> The tricky thing is also the variable length pages if I'd switch
>> completely to store a page as one MemorySegment each (instead of only the
>> fine granular nodes as proposed above).
>>
>> But by looking at the flame graph, serialization is one of the most
>> time-intensive tasks, right? Even though I'm already parallelizing it as
>> one of the first tasks during a commit for the leaf data pages, your
>> proposed way of somehow matching a memory layout to the serialization
>> layout may be the only way to get better latency.
>>
>> kind regards
>> Johannes
>>
>> Am Sa., 29. Juli 2023 um 01:22 Uhr schrieb Maurizio Cimadamore <
>> maurizio.cimadamore at oracle.com>:
>>
>>> If you need to serialize big chunks of memory in files, I think you are
>>> in the use case Brian S was describing - e.g. you need some kind of memory
>>> mapped solution.
>>>
>>> E.g. you probably want the memory layout to match the serialized layout,
>>> so that your memory operations can be persisted directly on the disk (e.g.
>>> calling MS::force).
>>>
>>> That said, if you have the requirement to go back and from byte arrays,
>>> you might be in trouble, because there's no way to "wrap an array" over a
>>> piece of off-heap memory. You would have to allocate and then copy, which
>>> is very expensive if you have a lot of data.
>>>
>>> Maurizio
>>> On 28/07/2023 22:45, Johannes Lichtenberger wrote:
>>>
>>> I think the main issue is not even allocation or GC, but instead the
>>> serialization of in this case over 300_000_000 nodes in total.
>>>
>>> I've attached a screenshot showing the flamegraph of the profiler output
>>> I posted the link to...
>>>
>>> What do you think? For starters the page already has a slot array to
>>> store byte arrays for the nodes. The whole page should probably be
>>> something like a growable MemorySegment, but I could probably first try to
>>> simply write byte arrays directly and read from them using something as
>>> Chronicle bytes or even MemorySegments and the converting it to byte arrays?
>>>
>>> Kind regards
>>> Johannes
>>>
>>> Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Fr.,
>>> 28. Juli 2023, 11:38:
>>>
>>>>
>>>> On 28/07/2023 08:15, Johannes Lichtenberger wrote:
>>>> > Hello,
>>>> >
>>>> > I think I mentioned it already, but currently I'm thinking about it
>>>> again.
>>>> >
>>>> > Regarding the index trie in my spare time project I'm thinking if it
>>>> > makes sense, as currently I'm creating fine granular on heap nodes
>>>> > during insertions/updates/deletes (1024 per page). Once a page is
>>>> read
>>>> > again from storage I'm storing these nodes in a byte array of byte
>>>> > arrays until read for the first time. One thing though is, that the
>>>> > nodes may store strings inline and thus are of variable size (and
>>>> > thus, the pages are of variable size, too, padded to word aligned
>>>> IIRC).
>>>> >
>>>> > I'm currently auto-committing after approx 500_000 nodes have been
>>>> > created (afterwards they can be garbage collected) and in total there
>>>> > are more than 320 million nodes in one test.
>>>> >
>>>> > I think I could store the nodes in MemorySegments instead of using on
>>>> > heap classes / instances and dynamically reallocate memory if a node
>>>> > value is changed.
>>>> >
>>>> > However, I'm not sure as it means a lot of work and maybe off heap
>>>> > memory access is always slightly worse than on heap!?
>>>>
>>>> I don't think that's necessarily the case. I mean, array access is the
>>>> best, there's more optimizations for it, and the access is more
>>>> scrutable to the optimizing compiler.
>>>>
>>>> If you start using APIs, such as ByteBuffer or MemorySegment, they take
>>>> a bit of a hit, depending on usage, as each access has to verify
>>>> certain
>>>> access properties. That said, if your access is "well-behaved" and
>>>> SIMD-friendly (e.g. counted loop and such), you can expect performance
>>>> of MS/BB to be very good, as all the relevant checks will be hoisted
>>>> out
>>>> of loops.
>>>>
>>>> With memory segments, since you can also create unsafe segments on the
>>>> fly, we're investigating approaches where you can get (at least in
>>>> synthetic benchmarks) the same assembly and performance of raw Unsafe
>>>> calls:
>>>>
>>>> https://mail.openjdk.org/pipermail/panama-dev/2023-July/019487.html
>>>>
>>>> I think one area that requires a lot of thought when it comes to
>>>> off-heap is allocation. The GC is mightly fast at allocating objects,
>>>> especially small ones that might die soon. The default implementation
>>>> of
>>>> Arena::allocate uses malloc under the hood, so it's not going to be
>>>> anywhere as fast.
>>>>
>>>> That said, starting from Java 20 you can define a custom arena with a
>>>> better allocation scheme. For instance, if you are allocating in a
>>>> tight
>>>> loop you can write an "allocator" which just recycles memory (see
>>>> SegmentAllocator::slicingAllocator). With malloc out of the way the
>>>> situation should improve significantly.
>>>>
>>>> Ultimately picking the right allocation scheme depends on your
>>>> workload,
>>>> there is no one size-fits-all (as I'm sure you know). But there should
>>>> be enough building blocks in the API to allow you to do what you need.
>>>>
>>>>
>>>> >
>>>> > I'll check GC pressure again by logging it, but an IntelliJ profiler
>>>> > (async profiler JFR) output of a run to store a big JSON file in
>>>> > SirixDB can be seen here:
>>>> >
>>>> https://github.com/sirixdb/sirix/blob/refactoring-serialization/JsonShredderTest_testChicago_2023_07_27_131637.jfr
>>>> <https://urldefense.com/v3/__https://github.com/sirixdb/sirix/blob/refactoring-serialization/JsonShredderTest_testChicago_2023_07_27_131637.jfr__;!!ACWV5N9M2RV99hQ!N3Xy1IMnvPeF5Dopce9tfDfeCHtlQddOiTixmyG3PRnoKGiXUXU9MVc112QTGPBbzMYp2SDIdJvX1W1c1Mm4rXDzj_wTWV4Nww$>
>>>> >
>>>> > I think I had better performance/latency with Shenandoah (not
>>>> > generational), but ZGC was worse in other workloads due to caffeine
>>>> > caches and not being generational (but that's changing of course).
>>>> >
>>>> > So, by looking at the profiler output and probably the flame graph
>>>> > where G1 work seems to be prominent do you think a refactoring would
>>>> > be appropriate using MemorySegments or maybe it's an ideal "big data"
>>>> > use case for the generational low latency GCs and the amount of
>>>> > objects is not an issue at all!?
>>>>
>>>> Hard to say. Generational GCs are very very good. And object allocation
>>>> might be cheaper than you think. Where off-heap becomes advantageous is
>>>> (typically) if you need to work with memory mapped files (and/or native
>>>> calls), which is common for database-like use cases.
>>>>
>>>> Maurizio
>>>>
>>>> >
>>>> > Kind regards
>>>> > Johannes
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230729/eec31b1d/attachment-0001.htm>