Off heap vs on heap memory access performance for a DBS

Johannes Lichtenberger lichtenberger.johannes at gmail.com
Fri Jul 28 21:57:01 UTC 2023


Well, I also need to be able to change the contents in the byte arrays
after the first creation (for instance change node references as
leftSiblingNodeKey/rightSiblingNodeKey/firstChildKey/lastChildKey... string
values, boolean, numbers, null values...).

Johannes Lichtenberger <lichtenberger.johannes at gmail.com> schrieb am Fr.,
28. Juli 2023, 23:45:

> I think the main issue is not even allocation or GC, but instead the
> serialization of in this case over 300_000_000 nodes in total.
>
> I've attached a screenshot showing the flamegraph of the profiler output I
> posted the link to...
>
> What do you think? For starters the page already has a slot array to store
> byte arrays for the nodes. The whole page should probably be something like
> a growable MemorySegment, but I could probably first try to simply write
> byte arrays directly and read from them using something as Chronicle bytes
> or even MemorySegments and the converting it to byte arrays?
>
> Kind regards
> Johannes
>
> Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Fr., 28.
> Juli 2023, 11:38:
>
>>
>> On 28/07/2023 08:15, Johannes Lichtenberger wrote:
>> > Hello,
>> >
>> > I think I mentioned it already, but currently I'm thinking about it
>> again.
>> >
>> > Regarding the index trie in my spare time project I'm thinking if it
>> > makes sense, as currently I'm creating fine granular on heap nodes
>> > during insertions/updates/deletes (1024 per page). Once a page is read
>> > again from storage I'm storing these nodes in a byte array of byte
>> > arrays until read for the first time. One thing though is, that the
>> > nodes may store strings inline and thus are of variable size (and
>> > thus, the pages are of variable size, too, padded to word aligned IIRC).
>> >
>> > I'm currently auto-committing after approx 500_000 nodes have been
>> > created (afterwards they can be garbage collected) and in total there
>> > are more than 320 million nodes in one test.
>> >
>> > I think I could store the nodes in MemorySegments instead of using on
>> > heap classes / instances and dynamically reallocate memory if a node
>> > value is changed.
>> >
>> > However, I'm not sure as it means a lot of work and maybe off heap
>> > memory access is always slightly worse than on heap!?
>>
>> I don't think that's necessarily the case. I mean, array access is the
>> best, there's more optimizations for it, and the access is more
>> scrutable to the optimizing compiler.
>>
>> If you start using APIs, such as ByteBuffer or MemorySegment, they take
>> a bit of a hit, depending on usage, as each access has to verify certain
>> access properties. That said, if your access is "well-behaved" and
>> SIMD-friendly (e.g. counted loop and such), you can expect performance
>> of MS/BB to be very good, as all the relevant checks will be hoisted out
>> of loops.
>>
>> With memory segments, since you can also create unsafe segments on the
>> fly, we're investigating approaches where you can get (at least in
>> synthetic benchmarks) the same assembly and performance of raw Unsafe
>> calls:
>>
>> https://mail.openjdk.org/pipermail/panama-dev/2023-July/019487.html
>>
>> I think one area that requires a lot of thought when it comes to
>> off-heap is allocation. The GC is mightly fast at allocating objects,
>> especially small ones that might die soon. The default implementation of
>> Arena::allocate uses malloc under the hood, so it's not going to be
>> anywhere as fast.
>>
>> That said, starting from Java 20 you can define a custom arena with a
>> better allocation scheme. For instance, if you are allocating in a tight
>> loop you can write an "allocator" which just recycles memory (see
>> SegmentAllocator::slicingAllocator). With malloc out of the way the
>> situation should improve significantly.
>>
>> Ultimately picking the right allocation scheme depends on your workload,
>> there is no one size-fits-all (as I'm sure you know). But there should
>> be enough building blocks in the API to allow you to do what you need.
>>
>>
>> >
>> > I'll check GC pressure again by logging it, but an IntelliJ profiler
>> > (async profiler JFR) output of a run to store a big JSON file in
>> > SirixDB can be seen here:
>> >
>> https://github.com/sirixdb/sirix/blob/refactoring-serialization/JsonShredderTest_testChicago_2023_07_27_131637.jfr
>> >
>> > I think I had better performance/latency with Shenandoah (not
>> > generational), but ZGC was worse in other workloads due to caffeine
>> > caches and not being generational (but that's changing of course).
>> >
>> > So, by looking at the profiler output and probably the flame graph
>> > where G1 work seems to be prominent do you think a refactoring would
>> > be appropriate using MemorySegments or maybe it's an ideal "big data"
>> > use case for the generational low latency GCs and the amount of
>> > objects is not an issue at all!?
>>
>> Hard to say. Generational GCs are very very good. And object allocation
>> might be cheaper than you think. Where off-heap becomes advantageous is
>> (typically) if you need to work with memory mapped files (and/or native
>> calls), which is common for database-like use cases.
>>
>> Maurizio
>>
>> >
>> > Kind regards
>> > Johannes
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230728/fc2f47f3/attachment.htm>


More information about the panama-dev mailing list