MemorySegment off-heap usage and GC
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Sep 16 10:26:06 UTC 2024
Hi Johannes,
without knowing more high-level information on what the usage patterns
for the methods in your code is it’s hard for me to provide guidance, or
to try and explain why heap allocation is more pronounced than before. I
will try nevertheless :-)
Eyeballing your code, I only see three places where segments are created:
* KeyValueLeafPage constructor (because of allocate)
* resize (because of allocate)
* setSlot (because of MemorySegment::ofArray)
* setDeweyId (because of MemorySegment::ofArray)
So, it doesn’t seem like you are creating tons of heap objects. But all
the setXYZ methods seem to create new heap objects (a new memory
segment) while before you just stored the incoming byte array into some
field of the class. So, if your application is sensitive to that kind of
thing (e.g. these set methods are called a lot), that could explain the
increase in heap pressure.
In general, looking at the before/after code, there is now more code in
the set/get code path - as you have to go from byte[] to segment and
then back. If your public-facing API is taking byte[], you might not
have many other options than to change your internal representation to
match that.
(in saying that, I’d note that the “before” implementation doesn’t
perform defensive copies of the incoming byte arrays, and that might
also affect performance).
Another reason where performance might be lost is in the resize method.
Every now and then, when you call setdata, the memory segment might not
be big enough, so the code allocates a new one, and copies all the data
over. This is not going to be fast (even igoring the heap pressure
problem, even though the two could be related).
The reality here is that it is going to be hard to match the performance
of somtething like this:
|public void setSlot(byte[] recordData, int offset) { slots[offset] =
recordData; } |
There’s no moving of the data off-heap, no need for resizing ever, and
this sets you up for avoiding deserialiation altogether:
|public byte[] getSlot(int slotNumber) { return slots[slotNumber]; } |
Anything you change here is added cost. Memory segment is the tip of the
iceberg here, I think (you’d face exactly the same problems trying to
use ByteBuffer, or any other such API).
Something that would be more “apple to apple” would be if you could
avoid resizing - after all in the old code you always allocate a number
of slots/records/deweyIds equal to some known constant
(Constants.NDP_NODE_COUNT). Maybe this constant is super high (seems
1024), so you cannot afford to have the equivalent bytes to be
pre-allocated when you allocate a page (but that’s another “shortcut”
you can take because you’re exploiting an on-heap representation). But
even if you pre-allocated big-enough segments, you’d still have to
convert the input data into a segment, and then extracting the output
data into a byte array (instead of simply setting a pointer/getting a
pointer - another on-heap-oriented assumption).
To sum up, I think in order for the switch to off-heap/segments to be
beneficial there has to be some use case you want to address that cannot
be addressed in the current setup. Your setup is currently highly
optimized under the assumption that data lives on-heap (and that you
don’t care too much about data being manipulated via the array instances
being fed to the API /after/ said arrays have been stored in the page).
If those assumptions work well for you, given they also provide best
possible performance, why to change? Of course, if you want to pass the
page data to some native function you will run into a road-block, as now
you will have to collect all your page data by chasing pointers and
copying it in some contiguous memory region. But if you don’t need that,
why bother? Sometimes the simplest design is also the best. That’s not
to say that there aren’t way to perhaps use memory segments more
effectively to do what you want to do. For instance, if you "squint"
your old code is really creating a page with a fixed size of
(Constants.NDP_NODE_COUNT) data pointers. So, you could create a page
with an array of N memory segments, to be filled later. Each page will
be associated with an _arena_ so that all the segment data you create
will effectively share the same lifetime. This will allow you to avoid
resizing, while still only "allocating as you need". But you will need
to copy data in/out of heap, so this will add serialization cost
compared to your original solution.
Just trying to be honest here, and set the right expectations. Going
off-heap is not a “make me go fast” kind of toggle. It often requires
compromises, to make the Java and the off-heap side of the world “align”
somehow.
Maurizio
On 14/09/2024 15:17, Johannes Lichtenberger wrote:
> Hello,
>
> I'm currently refactoring my little database project in my spare time
> from using a very simple byte[][] slots array of byte-arrays to a
> single MemorySegment (or two depending on if DeweyIDs are stored or
> not (usually not)):
>
> from
>
> https://github.com/sirixdb/sirix/blob/main/bundles/sirix-core/src/main/java/io/sirix/page/KeyValueLeafPage.java
>
> to
>
> https://github.com/sirixdb/sirix/blob/1aaafd13693c0cf7e073d400766525eed7a24ad6/bundles/sirix-core/src/main/java/io/sirix/page/KeyValueLeafPage.java
>
> However, now I had to introduce reference counting / pinning/unpinning
> of the pages, and they have to be closed, for instance, once they are
> evicted from cache(s).
>
> Implementing a "real" slotted page with shifting and resizing... has
> gotten much more complicated. Furthermore (besides that,
> pinning/unpinning and deterministic closing is tricky ;-)) I'm also
> facing much worse GC performance (attached).
>
> Of course, I'm in the middle of refactoring, and I'd give the
> nodes/records in the page a slice from the MemorySegment of the page.
> Currently, I have to convert back and forth for
> serialization/deserialization from byte-arrays to MemorySegments, then
> copying these to the page MemorySegment... which is currently one
> issue, but I'm not sure if that's all.
>
> All in all I'm not sure if there's other stuff I'm missing because I'm
> now using `Arena.ofShared()` and I think this stuff is a bit strange:
>
> [3,127s][info ][gc ] GC(7) Pause Young (Normal) (G1 Evacuation
> Pause) (Evacuation Failure: Pinned) 645M->455M(5124M) 9,563ms
> [3,253s][info ][gc ] GC(8) Pause Young (Normal) (G1 Evacuation
> Pause) 783M->460M(5124M) 4,580ms
> [5,094s][info ][gc ] GC(9) Pause Young (Normal) (G1 Evacuation
> Pause) 3524M->897M(5124M) 40,103ms
> [5,200s][info ][gc ] GC(10) Pause Young (Normal) (G1 Evacuation
> Pause) (Evacuation Failure: Pinned) 1381M->947M(5124M) 29,005ms
> [5,696s][info ][gc ] GC(11) Pause Young (Normal) (G1 Evacuation
> Pause) 1499M->1191M(5124M) 25,405ms
> [5,942s][info ][gc ] GC(12) Pause Young (Normal) (G1 Evacuation
> Pause) (Evacuation Failure: Pinned) 1647M->1379M(5124M) 22,006ms
> [5,979s][info ][gc ] GC(13) Pause Young (Normal) (G1 Evacuation
> Pause) (Evacuation Failure: Pinned) 1899M->1411M(5124M) 7,634ms
> [6,628s][info ][gc ] GC(14) Pause Young (Normal) (G1 Evacuation
> Pause) 2243M->1801M(5124M) 36,093ms
> [6,725s][info ][gc ] GC(15) Pause Young (Normal) (G1 Evacuation
> Pause) (Evacuation Failure: Pinned) 2469M->1873M(5124M) 13,836ms
> [7,436s][info ][gc ] GC(16) Pause Young (Normal) (G1 Evacuation
> Pause) 2857M->2283M(5740M) 64,219ms
> [7,525s][info ][gc ] GC(17) Pause Young (Normal) (G1 Evacuation
> Pause) (Evacuation Failure: Pinned) 3115M->2343M(5740M) 14,110ms
> [8,274s][info ][gc ] GC(18) Pause Young (Normal) (G1 Evacuation
> Pause) 3659M->2783M(5740M) 42,159ms
> [9,011s][info ][gc ] GC(19) Pause Young (Concurrent Start) (G1
> Evacuation Pause) (Evacuation Failure: Pinned) 4027M->3239M(5740M)
> 51,686ms
> [9,011s][info ][gc ] GC(20) Concurrent Mark Cycle
> [9,165s][info ][gc ] GC(20) Pause Remark 4171M->2535M(5360M)
> 3,315ms
> [9,446s][info ][gc ] GC(20) Pause Cleanup 2759M->2759M(5360M)
> 0,253ms
> [9,448s][info ][gc ] GC(20) Concurrent Mark Cycle 436,601ms
> [9,500s][info ][gc ] GC(21) Pause Young (Prepare Mixed) (G1
> Evacuation Pause) 2783M->1789M(5360M) 30,267ms
> [10,575s][info ][gc ] GC(22) Pause Young (Mixed) (G1 Evacuation
> Pause) 3745M->2419M(5360M) 73,025ms
> [11,266s][info ][gc ] GC(23) Pause Young (Normal) (G1
> Evacuation Pause) 3987M->2829M(5360M) 55,028ms
> [11,762s][info ][gc ] GC(24) Pause Young (Concurrent Start) (G1
> Evacuation Pause) 4149M->3051M(6012M) 65,550ms
> [11,762s][info ][gc ] GC(25) Concurrent Mark Cycle
> [11,869s][info ][gc ] GC(25) Pause Remark 3143M->1393M(5120M)
> 4,415ms
> [12,076s][info ][gc ] GC(25) Pause Cleanup 1593M->1593M(5120M)
> 0,240ms
> [12,078s][info ][gc ] GC(25) Concurrent Mark Cycle 316,410ms
>
> I've rarely had these "Evacuation Failure: Pinned" log entries
> regarding the current "master" branch on Github, but now it's even
> worse. Plus, I think I'm still missing to close/clear pages in all
> cases (to close the arenas), which turned out to be tricky. I'm also
> storing the two most recently accessed pages in fields; sometimes,
> they are not read/put into a cache; there are page "fragments" that
> must be recombined for a full page...
>
> So maybe you know why the GC is much worse now (I guess even if I fail
> to close a page, I'd get an OutOfMemoryError or something like that,
> as the segments are off-heap (despite my array-based memory segments
> (ofArray), which may be a problem, hmm).
>
> All in all I faced a much worse performance with N-read only trxs
> traversing a large file in parallel, likely due to ~2,7Gb object
> allocation rate for a single trx already (and maybe not that much read
> from the page caches), that's why I thought I'd have to try the single
> MemorySegment approach for each page.
>
> The G1 log:
>
> https://raw.githubusercontent.com/sirixdb/sirix/main/g1.log.4
>
> kind regards
> Johannes
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240916/9a472f86/attachment-0001.htm>
More information about the panama-dev
mailing list