MemorySegment off-heap usage and GC

Johannes Lichtenberger lichtenberger.johannes at gmail.com
Wed Sep 18 16:36:14 UTC 2024


Hi Maurizio,

there are about 310_000_000 nodes created in this test, thus also slots. In
our architecture updates are cheap, also due to versioning of the leaf
pages and a dynamic size of the pages and due to the fine granular
nodes/slots. You're right that now the heap is using more memory due to the
byte array plus MemorySegment wrapping the byte array for each slot
(usually the parameter for setSlot will always be a MemorySegment in the
future). But I'm not sure if I really have to wrap the byte array in a
MemorySegment only to copy it to the right place in the big slots
MemorySegment of the page. I could in the worst case also simply loop over
the array and copy byte by byte, thus I think it should have similar
on-heap / slightly less than the current approach in main which uses the
slots byte array of byte arrays).

The KeyValueLeafPages are cached in Caffeine caches once read from disk,
but I think that should be ok.

Some further background about the main storage:

The current architecture is, that these nodes are during shredding /
insertion of JSON data created as instances of classes representing the
different nodes in JSON (Array, Object, ObjectField, String, Boolean,
Null...) and stored in an index (a simple trie). This trie consists of
IndirectPages with references to 1024 more IndirectPages on the next level
or the KeyValueLeafPages at the bottom of the tree (with max 1024 slots
assigned or deserialized DataRecords). The unique nodeKey for each node is
assigned monotonically increasing. Thus, at first there's no IndirectPage,
but for node 1024 an IndirectPage gets constructed at the top of the tree
with a reference to the KeyValueLeafPage storing the 1024 nodes which have
already been created. The IndirectPages are rather small in comparison, but
kept on-heap.

Another architectural consideration is that the KeyValueLeafPages are
versioned, thus if nodes / slots therein get updated/deleted a new
KeyValueLeafPage is added and the path to the root is copied (at the top of
each revision is a RevisionRootPage). Then the IndirectPage has a reference
to two leaf pages on disk (offsets) and they have to be recombined after
being fetched.

As you can imagine there are a lot of byte-arrays allocated on heap due to
the fine granular nodes. Even more, as the pages are first read into a byte
array, then decompressed into another, then deserialized and the slots
array is filled in this step. Later, nodes are fetched and during that
individual slots are deserialized to the DataRecords, also an array in the
KeyValueLeafPage, which caches them.

I would hope to reduce GC pressure and allocations due to a bigger
refactoring with one MemorySegment slot for each KeyValueLeafPage
(off-heap) and during insertion use smaller MemorySegments for each node as
a constructor parameter which of course have to be copied to the  slots
MemorySegment at last during serialization. During deserialization the
nodes would get a slice of the big slots MemorySegment and we'd as a plus
also get rid of most serialization/deserialization of nodes, as it is
currently:

https://github.com/sirixdb/sirix/blob/update-slots-to-memorysegment/bundles/sirix-core/src/main/java/io/sirix/node/NodeKind.java

I hope that my description sheds some more light.

This should basically be a very very reduced example usage I'm envisioning
(though in Kotlin and not up to date):
https://github.com/GavinRay97/rosetta-db/blob/master/kotlin/app/src/main/kotlin/rosetta/db/App.kt


Simply switching to ZGC + GraalVM again, ZGC is a much better fit for the
current workload in the main branch, though already, too. But I think the
design with the MemorySegments and off-heap storage probably makes more
sense architecturally.

That said it's some work for me to update all nodes, too, to work with
MemorySegments and to also change serialization/deserialization...if you
say it makes no sense, I can safe a lot of my spare time (and I'm also a
Software Engineer in my day to day job ;))

I think the example from Gavin Ray should be a good very simplified
version, maybe it helps for your understanding.

Kind regards
Johannes

Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Mo., 16.
Sept. 2024, 22:59:

> Do you have evidence that on-heap usage has gotten worse in the rewrite?
> Let's leave aside the cryptic GC failures, as I don't think they have
> directly anything to do with memory segment (given you said you are not
> using JNI nor FFM - maybe some library you depend on is... hard to say).
>
> I suspect it's probably hard to say if on-heap is better or worse, given
> that before the DB storage was on-heap, and now it's off-heap. But if the
> storage is off-heap, why is there 4GB on-heap data?
>
> Anyway, my suggestion is to keep chipping away at the problem (not now,
> when you have the time and energy of course!) until you can e.g. reproduce
> what you are seeing with something a bit simpler than the full application.
> When we get there, we'd probably be in a better position to help. At this
> point in times there seem to be too many variables that are influencing the
> discussion, and it's not clear to me that going down the "memory segment is
> the culprit" route will be helpful. The rewrite clearly did something (bad)
> to the overall performance of the application - but I don't think you have
> 4GB of small segment wrappers around heap arrays (or, if that's the case,
> you have bigger problems going on :-) ).
>
> I'm also not 100% sure that the assumption: there's more GC activity, so
> that is the explanation of the 3x slowdown is valid (again, we need a
> smoking gun). Sometimes you get more GC, but that doesn't affect things too
> much - because a lot of the objects are short-lived and
> allocation/deallocation in the TLAB is so cheap. Also, by moving data
> off-heap you should have (at least in principle) have reduced the load on
> the GC somewhat.
>
> Maurizio
> On 16/09/2024 21:40, Johannes Lichtenberger wrote:
>
> It stays at 4Gb max (I've configured 8Gb max, so this should be OK, but
> always Evacuation Failure GCs). Currently I don't have the energy to look
> further into it due to a cold, but I could upload the 3,8Gb JSON file
> somewhere if needed for the test to run, if that helps...
>
> Am Mo., 16. Sept. 2024 um 19:03 Uhr schrieb Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com>:
>
>>
>> On 16/09/2024 17:49, Johannes Lichtenberger wrote:
>> > but I thought it shouldn't be that runtime for shredding a resource
>> > (impirting JSON data) suddenly is 3x worse even now in the middle of a
>> > bigger refactoring in my spare time
>>
>> Ok, so there's a new bullet:
>>
>> * the _execution time_ (e.g. throughput) with the memory segment version
>> is now 3x slower than w/o memory segment
>>
>> What does the memory usage look like? I understand that you see issues
>> with GC failures due to "pinning" - but does the overall heap usage seem
>> ok? Or has heap usage also increased?
>>
>> Maurizio
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240918/0eafc6d4/attachment-0001.htm>


More information about the panama-dev mailing list