Replacing mmap with userfaultfd
Radim Vansa
rvansa at azul.com
Tue May 16 08:17:17 UTC 2023
Hi all,
I was exploring alternative options to support repeated checkpoints [1]
and I'd like to share results for review and further suggestions.
Currently the CRaC fork of CRIU uses by default --mmap-page-image during
restore - that significantly speeds up loading the image, but if another
checkpoint were performed we would keep the mapping to directory with
previous checkpoint. In [1] I've temporarily mapped those pages
read-only, blocking any thread that would access it, copied the data to
newly allocated memory and then remapped the copy to the original
address space. This solution has two downsides:
1) there is asymmetry in the design as one part (mapping) is handled in
CRIU while the 'fixup' before next checkpoint happens in JVM
2) handling of writes to those write-protected pages happens through
handling SIGSEGV. JVM already supports user handling of signal (in
addition to its own), but from POSIX view the state of process after
SIGSEGV is undefined so this is rather crude solution.
Newer Linux kernels support handling missing pages through
`userfaultfd`. This would be a great solution for (2), but it is not
possible to register this handling on memory mapped to file (except
tmpfs on recent kernels). As for loading all memory through userfaultfd
on demand, CRIU already supports that with --lazy-pages, where a server
process is started and the starting application fetches pages as it
needs them. If we're looking into a more symmetric (1) solution we could
implant the handler directly to the JVM process. When testing that,
though, I found that there's still a big gap between this lazy loading
and simple mmaped file. To check the possible performance I've
refactored and modified [2] - the source code for the test is in attachment.
In order to run this I had to enable non-root userfaultfd, set number of
huge pages in my system and generate a 512 MB testing file:
echo 1 | sudo tee /proc/sys/vm/unprivileged_userfaultfd
echo 256 | sudo tee /proc/sys/vm/nr_hugepages
dd if=/dev/random of=/home/rvansa/random bs=1048576 count=512
After compiling and running this on my machine I got pretty consistently
these results:
Tested file has 512 MB (536870912 bytes)
mmaped file:
Page size: 4096
Num pages: 131072
TOTAL 35222423, AVG 268, MIN 12, MAX 47249 ... VALUE -920981768
--------------------------------------
Userfaultfd, regular pages:
Page size: 4096
Num pages: 131072
TOTAL 729833293, AVG 5568, MIN 4358, MAX 126324 ... VALUE -920981768
--------------------------------------
Userfaultfd, huge pages:
Page size: 2097152
Num pages: 256
TOTAL 104413970, AVG 407867, MIN 351902, MAX 2019050 ... VALUE 899929091
This shows that userfaultfd with 4kB pages (within single process) is
20x slower, and with 2MB pages is 3x slower than simply reading the
mmapped file. The most critical factor is probably the number of context
switches and the latency of all of that; the need to copy the data
multiple times still slows the hugepages version significantly, though.
We could use a hybrid approach that would proactively load the data,
trying to limit number of pagefaults that actually need to be handled; I
am not sure how could I POC such approach quickly, though. In any case I
think that getting to performance comparable to mmaped files would be
very hard.
Thanks for any further suggestions.
Radim
[1] https://github.com/openjdk/crac/pull/57
[2] https://noahdesu.github.io/2016/10/10/userfaultfd-hello-world.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: experiment.c
Type: text/x-csrc
Size: 7192 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20230516/edc3b14e/experiment-0001.c>
More information about the crac-dev
mailing list