Off heap vs on heap memory access performance for a DBS
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jul 28 23:21:50 UTC 2023
If you need to serialize big chunks of memory in files, I think you are
in the use case Brian S was describing - e.g. you need some kind of
memory mapped solution.
E.g. you probably want the memory layout to match the serialized layout,
so that your memory operations can be persisted directly on the disk
(e.g. calling MS::force).
That said, if you have the requirement to go back and from byte arrays,
you might be in trouble, because there's no way to "wrap an array" over
a piece of off-heap memory. You would have to allocate and then copy,
which is very expensive if you have a lot of data.
Maurizio
On 28/07/2023 22:45, Johannes Lichtenberger wrote:
> I think the main issue is not even allocation or GC, but instead the
> serialization of in this case over 300_000_000 nodes in total.
>
> I've attached a screenshot showing the flamegraph of the profiler
> output I posted the link to...
>
> What do you think? For starters the page already has a slot array to
> store byte arrays for the nodes. The whole page should probably be
> something like a growable MemorySegment, but I could probably first
> try to simply write byte arrays directly and read from them using
> something as Chronicle bytes or even MemorySegments and the converting
> it to byte arrays?
>
> Kind regards
> Johannes
>
> Maurizio Cimadamore <maurizio.cimadamore at oracle.com> schrieb am Fr.,
> 28. Juli 2023, 11:38:
>
>
> On 28/07/2023 08:15, Johannes Lichtenberger wrote:
> > Hello,
> >
> > I think I mentioned it already, but currently I'm thinking about
> it again.
> >
> > Regarding the index trie in my spare time project I'm thinking
> if it
> > makes sense, as currently I'm creating fine granular on heap nodes
> > during insertions/updates/deletes (1024 per page). Once a page
> is read
> > again from storage I'm storing these nodes in a byte array of byte
> > arrays until read for the first time. One thing though is, that the
> > nodes may store strings inline and thus are of variable size (and
> > thus, the pages are of variable size, too, padded to word
> aligned IIRC).
> >
> > I'm currently auto-committing after approx 500_000 nodes have been
> > created (afterwards they can be garbage collected) and in total
> there
> > are more than 320 million nodes in one test.
> >
> > I think I could store the nodes in MemorySegments instead of
> using on
> > heap classes / instances and dynamically reallocate memory if a
> node
> > value is changed.
> >
> > However, I'm not sure as it means a lot of work and maybe off heap
> > memory access is always slightly worse than on heap!?
>
> I don't think that's necessarily the case. I mean, array access is
> the
> best, there's more optimizations for it, and the access is more
> scrutable to the optimizing compiler.
>
> If you start using APIs, such as ByteBuffer or MemorySegment, they
> take
> a bit of a hit, depending on usage, as each access has to verify
> certain
> access properties. That said, if your access is "well-behaved" and
> SIMD-friendly (e.g. counted loop and such), you can expect
> performance
> of MS/BB to be very good, as all the relevant checks will be
> hoisted out
> of loops.
>
> With memory segments, since you can also create unsafe segments on
> the
> fly, we're investigating approaches where you can get (at least in
> synthetic benchmarks) the same assembly and performance of raw
> Unsafe calls:
>
> https://mail.openjdk.org/pipermail/panama-dev/2023-July/019487.html
>
> I think one area that requires a lot of thought when it comes to
> off-heap is allocation. The GC is mightly fast at allocating objects,
> especially small ones that might die soon. The default
> implementation of
> Arena::allocate uses malloc under the hood, so it's not going to be
> anywhere as fast.
>
> That said, starting from Java 20 you can define a custom arena with a
> better allocation scheme. For instance, if you are allocating in a
> tight
> loop you can write an "allocator" which just recycles memory (see
> SegmentAllocator::slicingAllocator). With malloc out of the way the
> situation should improve significantly.
>
> Ultimately picking the right allocation scheme depends on your
> workload,
> there is no one size-fits-all (as I'm sure you know). But there
> should
> be enough building blocks in the API to allow you to do what you need.
>
>
> >
> > I'll check GC pressure again by logging it, but an IntelliJ
> profiler
> > (async profiler JFR) output of a run to store a big JSON file in
> > SirixDB can be seen here:
> >
> https://github.com/sirixdb/sirix/blob/refactoring-serialization/JsonShredderTest_testChicago_2023_07_27_131637.jfr
> <https://urldefense.com/v3/__https://github.com/sirixdb/sirix/blob/refactoring-serialization/JsonShredderTest_testChicago_2023_07_27_131637.jfr__;!!ACWV5N9M2RV99hQ!N3Xy1IMnvPeF5Dopce9tfDfeCHtlQddOiTixmyG3PRnoKGiXUXU9MVc112QTGPBbzMYp2SDIdJvX1W1c1Mm4rXDzj_wTWV4Nww$>
> >
> > I think I had better performance/latency with Shenandoah (not
> > generational), but ZGC was worse in other workloads due to caffeine
> > caches and not being generational (but that's changing of course).
> >
> > So, by looking at the profiler output and probably the flame graph
> > where G1 work seems to be prominent do you think a refactoring
> would
> > be appropriate using MemorySegments or maybe it's an ideal "big
> data"
> > use case for the generational low latency GCs and the amount of
> > objects is not an issue at all!?
>
> Hard to say. Generational GCs are very very good. And object
> allocation
> might be cheaper than you think. Where off-heap becomes
> advantageous is
> (typically) if you need to work with memory mapped files (and/or
> native
> calls), which is common for database-like use cases.
>
> Maurizio
>
> >
> > Kind regards
> > Johannes
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230729/c2be964e/attachment-0001.htm>
More information about the panama-dev
mailing list