MemorySegment off-heap usage and GC

Mon Sep 16 15:58:37 UTC 2024

Hi Johannes,
I'm trying to uplevel as much as possible here. Is this correct:

1. your application, even when using a byte[][] backing storage already 
had an allocation issue
2. it is not clear from the information you shared, where this 
allocation issue is coming from (it predates memory segment)
3. when you made the switch to use memory segments instead of byte[][] 
things got worse not better.

Does that accurately reflect your case? IMHO, the crux of the issue is 
(1)/(2). If there was already some allocation issue in your 
application/framework, then adopting memory segment is unlikely to make 
that disappear (esp. with the kind of code we're staring at right now, 
which I think is allocating _more_ temp objects in the heap).

You referred to these big traversals several times. What does a 
traversal do? In principle, if your data is already in memory, I'd 
expect a traversal not to allocate any memory (regardless of the backing 
storage being used).

So I feel like I probably don't understand what's going on :-)

It would be beneficial for the discussion to come up with some 
simplified model of how the code used to work before (using some mock 
pseudo code and data structures), which problems you identified, and why 
and how you thought using a memory segment would improve over that. This 
might also people (other than me!) to provide more feedback.

Maurizio

On 16/09/2024 16:23, Johannes Lichtenberger wrote:
>
> Hi Maurizio,
>
> thanks for all the input AFAICT I'm not using any JNI...
>
> So, the problem is, that I'm creating too many allocations (I think in 
> one test it was 2,7Gb/s) and it's much more with > 1 trx (and by far 
> the most objects allocated were/are byte-arrays), in the main branch. 
> Thus, I had the idea to replace this slots byte array of byte arrays 
> with a single MemorySegment. I think for now it would be even optimal 
> to use a single on-heap byte-array.
>
> The setSlot method is currently mainly called once during 
> serialization of the DataRecords during a sync / commit to disk. Also 
> it's called during deserialization, but even though slots may be added 
> in random order they are appended to the MemorySegment. I think that 
> usually records are added/deleted instead of updated (besides the long 
> "pointers" to neighbour nodes/records).
>
> It's basically a binary encoding for tree-structured data with fine 
> grained nodes (firstChild/rightSibling/leftSibling/parebt/lastChild) 
> and the nodes are stored in a dense trie where the leaf pages hold 
> mostly 1024 nodes.
>
> Up to a predefined very small threshold N page fragments are fetched 
> in parallel from disk if thers's no in memory reference and not found 
> in a Caffeine cache, which are then combined to a full page, thus 
> setSlot is called for slots which are not currently set, but are set 
> in the current page fragment once during reconstruction of the full page.
>
> So, I assume afterwards they are only ever set in a single read-write 
> trx per resource and only seldom variable length data may be adapted. 
> If that's not the case I could also try to leave some space after each 
> slot, thus that it can probably grow without having to shift other 
> data or something like that.
>
> At least I think the issue with a much worse runtime of traversing 
> roughly 310_000_000 nodes in a preorder traversal (remember that they 
> are stored in pages) currently switching from 1 to 5 trxs in parallel 
> is due to the objects allocated (without the MemorySegment):
>
> https://github.com/sirixdb/sirix/blob/main/analysis-single-trx.jfr 
> <https://urldefense.com/v3/__https://github.com/sirixdb/sirix/blob/main/analysis-single-trx.jfr__;!!ACWV5N9M2RV99hQ!PZjliXD6cF77z6VQbG0HoRr9sTYhYMqnXNbcRaPb8CHFWPu8ZR4NLPzxrkm-EjhrU5u33ZhN68JOHuNfBB2iC3SR5yjfWDVmJw$>
>
> vs.
>
> https://github.com/sirixdb/sirix/blob/main/analysis-5-trxs.jfr 
> <https://urldefense.com/v3/__https://github.com/sirixdb/sirix/blob/main/analysis-5-trxs.jfr__;!!ACWV5N9M2RV99hQ!PZjliXD6cF77z6VQbG0HoRr9sTYhYMqnXNbcRaPb8CHFWPu8ZR4NLPzxrkm-EjhrU5u33ZhN68JOHuNfBB2iC3SR5yhCs_wJyA$>
>
> Andrei Pangin helped a bit analyzing the async profiler snapshots, as 
> the runtime of 5 trxs in parallel is almost exactly 4x slower than 
> with a single trx and it's most probably due to the amount of 
> allocations (even though GC seems ok).
>
> So all in all I've had a specific runtime performance problem and 
> (also paging a lot, so I think it makes sense that it may be due to 
> the allocation rate).
>
> I hope the nodes can simply get a MemorySegment constructor param in 
> the future instead of a couple of object delegates... so that I can 
> directly use MemorySegments instead of having to convert between byte 
> arrays back and forth during serialization/deserialization. It's even 
> we can get (almost) rid of the whole step and we gain better data 
> locality.
>
> Hope it makes some sense now, but it may also be worth looking into a 
> single bigger byte array instead of a MemorySegment (even though I 
> think that off-heap memory usage might not be a bad idea for a 
> database system).
>
> You may have a quick look into the 2 profiles I provided...
>
> Kind regards and thanks a lot for your input. If it may help I can 
> provide a bigger JSON file I used for importing  / the test.
>
> Johannes
>
>
> Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Mo., 
> 16. Sept. 2024, 12:31:
>
>
>     On 16/09/2024 11:26, Maurizio Cimadamore wrote:
>     > I've rarely had these "Evacuation Failure: Pinned" log entries
>     > regarding the current "master" branch on Github, but now it's
>     even worse.
>
>     Zooming in on this aspect: this would suggest that your heap
>     memory is
>     being kept "pinned" somewhere.
>
>     Are you, by any chance, using downcall method handles with the
>     "critical" Linker option? Or any form of critical JNI?
>
>     It wouldn be interesting (separately from the "architectural" angle
>     discussed in my previous reply) to see which method call(s) is
>     causing
>     this exactly...
>
>     Cheers
>     Maurizio
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240916/c5020289/attachment.htm>