MemorySegment JVM memory leak

Wed Apr 22 23:22:38 UTC 2020

Hi Maurizio,

I did not want to hijack this thread, but my comments about the new unsafe memory segemnts comes below.

> Ok - glad that we're on the same page. Have you looked into the new
> 'unsafe' native segment creation? If you already have a mapped memory
> address you can basically create a custom memory segment with:
> 
> * custom address
> * custom size
> * _optional_ thread owner (meaning if there's no owner, you are unconfined)
> * custom cleanup action (you'll need to do something to unmap the
> address here, perhaps in native code)
> 
> 
> See here:
> 
> https://github.com/openjdk/panama-foreign/blob/foreign-
> memaccess/src/jdk.incubator.foreign/share/classes/jdk/incubator/foreign/Mem
> orySegment.java#L510
> 
> 
> The only caveat is that, if you want to use this you need to pass a
> command line flag - which might or might not be ok in your case.

Hi, I had not too much time to look into that yet, but as discussed in person, this looks like a way to allow Lucene/Solr/Elasticsearch to get rid of thread confinement (which we can't handle) and still have the same what we have today: We can crush the JVM is you close the index at the wrong time, because there's a bug in our tracking. For Solr and Elasticsearch that's unlikely, but people are already used to it. By default it uses Unsafe.invokeCleaner(ByteBuffer).

So in short, adding a command line flag to Elasticsearch or Solr isn't an issue. As we are on Java 11 minimum at the moment, we will keep the current code and only optionally the user can use the new PanamaMMapDirectory I/O abstraction class (I just call it like that), which allows to map larger segments than 1 Gigabyte (limit of ByteBuffer regarding sizes in powers of 2). One issue that you see with that is that the previously mentioned people with 1 Terabyte of index data (and possible several of them) need a lot of maps and so configuration of Elasticsearch servers also requires you to raise the maximum map count of the kernel (because we have maany maps). So having larger segments is essential in the future, but we can't use them because of the current way of thread confinement. Also serial confinement does not help us, because we have no control about the threads and who is transferring ownership to where.

So what we like to have:
- Memory map larger segments than 1 GiB with 64 bit offsets
- No native code to do that, no sun.misc.Unsafe#invokeCleaner()
- No thread confinement
- When we are done and close the index, we would like to release the segment, knowing about the risk to crash the OS. That’s the same what sun.misc.Unsafe#invokeCleaner does for us.

With the proposed API it looks to me that one can do this:
- Mmap a segment in a defined thread (I would use one that's created by the directory abstraction, so you always know which thread creates the segments) using MemorySegment#mapFromPath
- Retrieve the MemoryAddress baseAddress() and its long size from the segment
- Use the new API to create an unsafe segments from that address and size (call the above new method), without a cleaner and without a thread

On close:
- Close the original segment (needs to be done by the thread who created it, this is why I would spawn a thread to handle that)
- Make sure to stop using the unsafe MemorySegment

Questions:
- Is there a way to figure out if the user has set the command line option at runtime (without try-error)
- Can you mapFromPath using a custom offset? Of course you canmap the whole file and create a slice, but you are wasting address space if you don't need it. The code behind MappedMemorySegmentImpl uses FileChannel, but always passes 0L as startOffset. Can we add the offset too? Should not be to hard to add this missing parameter.
- Would it be possible to create a memory mapped segment like described above and make it unsafe from the beginning?
- In which JDK 15 EA build can we test it, or do I need another build?

Thanks,
Uwe

> >>> I agree, it's a problem for anonymous mapping if they can't be cleaned up,
> >> but those should be unmapped after usage. For mmapped files, the memory
> >> consumption is not higher, it's just better use of file system cache. If the
> kernel
> >> lacks enough space for other stuff that has no disk backend (anonymous
> >> mapping), it will free the occupied resources or add a disk backend by using
> >> swap file.
> >>
> >> This is not what I observed on my machines (and I suspect Ahmed also
> >> seeing the same). If you just do a loop iterating over a 100G mapped
> >> file, you will eventually run out of RAM and the system will start
> >> swapping like crazy to the point of stopping being responsive. I don't
> >> think this is an acceptable behavior, at least in this specific case.
> > The problem is that you are looping from the beginning to the end of the
> region. I am not fully familiar with the Linux kernel code in recent Lucene
> versions, but it tries to be intelligent regarding memory mapping. If you read
> the whole file like this you are somehow misusing mmap API. Reading the file
> with sequential IO is much better. MMAP is ideal for random access to files
> where you need not all at once.
> >
> > If you touch every block one by one it's the same like
> MappedMyteBuffer#load(). Stuff that was recently loaded is preferred to be
> kept in physical memory, so stuff that was longer not access has to go to swap.
> How this happens depends on the vm.swappiness sysctl kernel setting (which is
> 60% by default in default, a bad setting e.g. for some worksloads on servers,
> see below). With 60% swapping out is preferred over just freeing recently
> acclaimed buffers. Especially with the sequential read antipattern, I would not
> be surprised, if Linux kernel has an optimization to assume this stuff seems to
> be needed more often (as sequential reads are mostly a sign of database scans,
> where file system caching is hardly required).
> Right, I've been bitten by swappiness many times. I also thought that
> was probably the culprit here.
> >
> >
> > If you remember my talk on the committers meeting in Brussels: Elasticsearch
> servers sometimes memorymap up to a terabyte of memory on 64 or 128 GiB
> pyhsical RAM machines and still work fine. All of this is mmapped in
> MappedByteBuffers each 1 GiB of size (due to 32 bit limitation). The difference
> to your stuff is: We use random access and not everyting is needed at same
> time. IO pressure is much lower for your synthetic test, where system has no
> time to cleanup, as it wants to swap in pages as fast as possible. If you have
> random access not with sequential access, the system also has time to free
> other resource and do a decision to free resources that were longer not used.
> 
> True, I suppose that access idiom accounts for the biggest difference.
> 
> Thanks
> Maurizio