Allocation scheme using mmap

Tue Oct 29 09:52:08 UTC 2024

On 27/10/2024 13:33, Johannes Lichtenberger wrote:
> Hello,
>
> I'm trying to implement something similar to the BufferManager 
> described in 2.1 in [1] for UmbraDB:
>
> So I wanted to create the first buffer size class with fixed sized 
> pages as follows:
>
> // Allocate the memory MemorySegment reservedMemory = bufferManager.allocateMemory(totalMemorySize);
>
> // Partition into page-sized chunks List<MemorySegment> pages =new ArrayList<>();
>
> for (long offset =0; offset + pageSize < totalMemorySize; offset += pageSize) {
>    MemorySegment pageSegment = reservedMemory.asSlice(offset, pageSize);
>    pages.add(pageSegment);
> }
> Using `mmap` to create the virtual address mapping for the big chunk 
> allocation like this:
> public MemorySegmentallocateMemory(long size)throws Throwable {
>    // Call mmap to reserve virtual memory MemorySegment addr = (MemorySegment)mmap.invoke(MemorySegment.NULL,// Let OS choose the starting address size,// Size of the memory to reserve PROT_READ |PROT_WRITE,// Read and write permissions MAP_PRIVATE |MAP_ANONYMOUS,// Private, anonymous mapping -1,// No file descriptor 0 // No offset );
>    if (addr == MemorySegment.NULL) {
>      throw new OutOfMemoryError("Failed to allocate memory via mmap");
>    }
>    return addr;
> }
> First thing I noticed is that I need addr.reinterpret(size) here

Yes, you get back a zero-length memory segment (as you are calling a raw 
mmap downcall method handle), so you need to resize.

> , but now I wonder how Arena::allocate is actually implemented for 
> Linux (calling malloc?). I think the native memory shouldn't be 
> allocated in this case up until something is written to the 
> MemorySegment slices, right? Of course we already have a lot of 
> MemorySegment instances on the Java heap which are allocated, in 
> comparison to a C or C++ version.
Arena::allocate in its basic implementation just calls malloc (well, we 
do that via Unsafe::allocateMemory, but it's similar). Then we also 
reinterpret to the correct size. Of course there could be smarter 
allocation strategies, but they can be built on top, by defining custom 
arenas (the Arena interface can be implemented exactly for this 
purpose). We will investigate better strategies, esp. for the confined 
case -- but I think delaying allocation until the bits are actually 
accessed (which is kind of what you get with mmap) might not be a great 
general strategy.
>
> Next, I'm not sure if I missed something, but it seems ridiculous hard 
> to get a file descriptor (the actual int) for some syscalls, I guess 
> as it's platform specific, but if I didn't miss something I'd have to 
> use reflection, right? For instance if you have a FileChannel.

There's this:

https://bugs.openjdk.org/browse/JDK-8292771

We have been close to add a new restricted method to get the descriptor. 
Other workarounds were highlighted in the JBS issue - but perhaps this 
is something that can be re-assessed. Perhaps Uwe or Per can chime in 
here, and see if this is needed or not (if not we should just close the 
JBS issue).

>
> My idea of using something like this is based on the idea of reducing 
> allocations on the Java heap, as I described the problem a couple of 
> months ago. Up until now I never "recycled/reused" pages when a read 
> from disk was issued and when the page was not cached. I've always 
> created new instances of these potentially big objects after a disk 
> read, so in addition I'd implement something to cache and reuse Java 
> pages (which kind of wrap the MemorySegments):
>
> In addition to the actual native memory I want to buffer page 
> instances which use the MemorySegments on top, which can be reclaimed 
> through filling the underlying MemorySegments with 0 again and some 
> other cleanup stuff. So essentially I'd never unmap or use madvice 
> don't need, as long as the process is running.

I think this is an area where perhaps Uwe can help? The latest Lucene is 
doing something quite similar to what you are trying to do I believe:

https://www.elastic.co/search-labs/blog/lucene-and-java-moving-forward-together

Cheers
Maurizio

>
> Hope that makes sense and hopefully the ConcurrentHashMap to retrieve 
> a page with a certain size, plus taking the first entry from a Deque 
> of free pages... doesn't add more CPU cycles and synchronization 
> overhead, but the allocation rate was 2,7Gb for a single txn.
>
> Kind regards
> Johannes
>
> [1] https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241029/19cfe0bf/attachment-0001.htm>