Replacing mmap with userfaultfd

Tue May 16 08:17:17 UTC 2023

Hi all,

I was exploring alternative options to support repeated checkpoints [1] 
and I'd like to share results for review and further suggestions.

Currently the CRaC fork of CRIU uses by default --mmap-page-image during 
restore - that significantly speeds up loading the image, but if another 
checkpoint were performed we would keep the mapping to directory with 
previous checkpoint. In [1] I've temporarily mapped those pages 
read-only, blocking any thread that would access it, copied the data to 
newly allocated memory and then remapped the copy to the original 
address space. This solution has two downsides:

1) there is asymmetry in the design as one part (mapping) is handled in 
CRIU while the 'fixup' before next checkpoint happens in JVM

2) handling of writes to those write-protected pages happens through 
handling SIGSEGV. JVM already supports user handling of signal (in 
addition to its own), but from POSIX view the state of process after 
SIGSEGV is undefined so this is rather crude solution.

Newer Linux kernels support handling missing pages through 
`userfaultfd`. This would be a great solution for (2), but it is not 
possible to register this handling on memory mapped to file (except 
tmpfs on recent kernels). As for loading all memory through userfaultfd 
on demand, CRIU already supports that with --lazy-pages, where a server 
process is started and the starting application fetches pages as it 
needs them. If we're looking into a more symmetric (1) solution we could 
implant the handler directly to the JVM process. When testing that, 
though, I found that there's still a big gap between this lazy loading 
and simple mmaped file. To check the possible performance I've 
refactored and modified [2] - the source code for the test is in attachment.

In order to run this I had to enable non-root userfaultfd, set number of 
huge pages in my system and generate a 512 MB testing file:

echo 1 | sudo tee /proc/sys/vm/unprivileged_userfaultfd
echo 256 | sudo tee /proc/sys/vm/nr_hugepages
dd if=/dev/random of=/home/rvansa/random bs=1048576 count=512

After compiling and running this on my machine I got pretty consistently 
these results:

Tested file has 512 MB (536870912 bytes)
mmaped file:
Page size: 4096
Num pages: 131072
TOTAL 35222423, AVG 268, MIN 12, MAX 47249 ... VALUE -920981768
--------------------------------------
Userfaultfd, regular pages:
Page size: 4096
Num pages: 131072
TOTAL 729833293, AVG 5568, MIN 4358, MAX 126324 ... VALUE -920981768
--------------------------------------
Userfaultfd, huge pages:
Page size: 2097152
Num pages: 256
TOTAL 104413970, AVG 407867, MIN 351902, MAX 2019050 ... VALUE 899929091

This shows that userfaultfd with 4kB pages (within single process) is 
20x slower, and with 2MB pages is 3x slower than simply reading the 
mmapped file. The most critical factor is probably the number of context 
switches and the latency of all of that; the need to copy the data 
multiple times still slows the hugepages version significantly, though.

We could use a hybrid approach that would proactively load the data, 
trying to limit number of pagefaults that actually need to be handled; I 
am not sure how could I POC such approach quickly, though. In any case I 
think that getting to performance comparable to mmaped files would be 
very hard.

Thanks for any further suggestions.

Radim

[1] https://github.com/openjdk/crac/pull/57

[2] https://noahdesu.github.io/2016/10/10/userfaultfd-hello-world.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: experiment.c
Type: text/x-csrc
Size: 7192 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20230516/edc3b14e/experiment-0001.c>