Replacing mmap with userfaultfd

Tue May 16 10:53:12 UTC 2023

I've yet experimented a bit with two things: First was to remove the 
copy by allocating a new anonymous memory, reading the data directly 
into that, remapping to the original location and waking up the thread. 
This proved to be even slower than UFFDIO_COPY in both cases.

The second was to use the MAP_POPULATE flag with mmapped-file: the mmap 
operation then took a significantly longer time (almost as if this was 
synchronous rather than hint) but the total time for mmap and reading 
would be lower.

Results from attached test:

Tested file has 512 MB (536870912 bytes)
mmap call took 872 ns
mmapped file:
Page size: 4096
Num pages: 131072
TOTAL 35825555 (33845372), AVG 258, MIN 12, MAX 31720 ... VALUE -920981768
--------------------------------------
mmap call took 20329669 ns
mmapped file, populated:
Page size: 4096
Num pages: 131072
TOTAL 8158203 (6215554), AVG 47, MIN 12, MAX 8432 ... VALUE -920981768
--------------------------------------
Userfaultfd, regular pages, copy:
Page size: 4096
Num pages: 131072
TOTAL 673848374 (671470568), AVG 5122, MIN 4305, MAX 285817 ... VALUE 
-920981768
--------------------------------------
Userfaultfd, regular pages, remap:
Page size: 4096
Num pages: 131072
TOTAL 1126162943 (1123755653), AVG 8573, MIN 7070, MAX 203940 ... VALUE 
-920981768
--------------------------------------
Userfaultfd, huge pages, copy:
Page size: 2097152
Num pages: 256
TOTAL 105654950 (105609575), AVG 412537, MIN 356688, MAX 2008166 ... 
VALUE 899929091
--------------------------------------
Userfaultfd, huge pages, remap:
Page size: 2097152
Num pages: 256
TOTAL 165370794 (165326676), AVG 645807, MIN 575171, MAX 3516155 ... 
VALUE 899929091
--------------------------------------

Radim

On 16. 05. 23 10:17, Radim Vansa wrote:
> Caution: This email originated from outside of the organization. Do 
> not click links or open attachments unless you recognize the sender 
> and know the content is safe.
>
>
> Hi all,
>
> I was exploring alternative options to support repeated checkpoints [1]
> and I'd like to share results for review and further suggestions.
>
> Currently the CRaC fork of CRIU uses by default --mmap-page-image during
> restore - that significantly speeds up loading the image, but if another
> checkpoint were performed we would keep the mapping to directory with
> previous checkpoint. In [1] I've temporarily mapped those pages
> read-only, blocking any thread that would access it, copied the data to
> newly allocated memory and then remapped the copy to the original
> address space. This solution has two downsides:
>
> 1) there is asymmetry in the design as one part (mapping) is handled in
> CRIU while the 'fixup' before next checkpoint happens in JVM
>
> 2) handling of writes to those write-protected pages happens through
> handling SIGSEGV. JVM already supports user handling of signal (in
> addition to its own), but from POSIX view the state of process after
> SIGSEGV is undefined so this is rather crude solution.
>
> Newer Linux kernels support handling missing pages through
> `userfaultfd`. This would be a great solution for (2), but it is not
> possible to register this handling on memory mapped to file (except
> tmpfs on recent kernels). As for loading all memory through userfaultfd
> on demand, CRIU already supports that with --lazy-pages, where a server
> process is started and the starting application fetches pages as it
> needs them. If we're looking into a more symmetric (1) solution we could
> implant the handler directly to the JVM process. When testing that,
> though, I found that there's still a big gap between this lazy loading
> and simple mmaped file. To check the possible performance I've
> refactored and modified [2] - the source code for the test is in 
> attachment.
>
> In order to run this I had to enable non-root userfaultfd, set number of
> huge pages in my system and generate a 512 MB testing file:
>
> echo 1 | sudo tee /proc/sys/vm/unprivileged_userfaultfd
> echo 256 | sudo tee /proc/sys/vm/nr_hugepages
> dd if=/dev/random of=/home/rvansa/random bs=1048576 count=512
>
> After compiling and running this on my machine I got pretty consistently
> these results:
>
> Tested file has 512 MB (536870912 bytes)
> mmaped file:
> Page size: 4096
> Num pages: 131072
> TOTAL 35222423, AVG 268, MIN 12, MAX 47249 ... VALUE -920981768
> --------------------------------------
> Userfaultfd, regular pages:
> Page size: 4096
> Num pages: 131072
> TOTAL 729833293, AVG 5568, MIN 4358, MAX 126324 ... VALUE -920981768
> --------------------------------------
> Userfaultfd, huge pages:
> Page size: 2097152
> Num pages: 256
> TOTAL 104413970, AVG 407867, MIN 351902, MAX 2019050 ... VALUE 899929091
>
> This shows that userfaultfd with 4kB pages (within single process) is
> 20x slower, and with 2MB pages is 3x slower than simply reading the
> mmapped file. The most critical factor is probably the number of context
> switches and the latency of all of that; the need to copy the data
> multiple times still slows the hugepages version significantly, though.
>
> We could use a hybrid approach that would proactively load the data,
> trying to limit number of pagefaults that actually need to be handled; I
> am not sure how could I POC such approach quickly, though. In any case I
> think that getting to performance comparable to mmaped files would be
> very hard.
>
> Thanks for any further suggestions.
>
> Radim
>
> [1] https://github.com/openjdk/crac/pull/57
>
> [2] https://noahdesu.github.io/2016/10/10/userfaultfd-hello-world.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: experiment.c
Type: text/x-csrc
Size: 9092 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/crac-dev/attachments/20230516/5463a063/experiment.c>