RFR: 8341334: CDS: Parallel pretouch and relocation [v6]

Tue Nov 5 15:01:36 UTC 2024

On Mon, 4 Nov 2024 15:06:18 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> In Leyden performance investigations, we have figured that `ArchiveRelocationMode=0` is saving 5..7 ms on HelloWorld startup. Mainline defaults to `ARM=1`, _losing_ as much. `ARM=0` was switched to `ARM=1` with [JDK-8294323](https://github.com/openjdk/jdk/commit/be70bc1c58eaec876aa1ab36eacba90b901ac9b8), which was delivered to JDK 17+ in in Apr 2023.
>> 
>> Profiling shows we spend time mem-faulting the memory loading the RO/RW regions, about 15 MB total. 15 MB in 5ms amounts to >4GB/sec, close to the single-threaded limits. I suspect the impact is larger if we relocate larger Metaspace, e.g. after dumping a CDS archive from a large application. There is little we can do to make the actual relocation part faster: the overwhelming majority of samples is on kernel side.
>> 
>> This PR implements the infrastructure for fast archive workers, and leverages it to do two things: 
>> 1. Parallel pretouch of mmap-ed regions. This causes the faults we would otherwise process in a single thread to be processed in multiple threads. The key thing is to eat the page faults in multiple threads; pretouching with a single thread would not help. This improvement gives the biggest bang in both `ArchiveRelocationMode`-s.
>> 2. Parallel core regions relocation. The RW/RO regions this code traverses is large, and we can squeeze more performance by parallelizing it. Without pretouch from (1), this step serves as one for RW/RO regions. 
>> 
>> (I'll put some performance data in the comments to show how these interact)
>> 
>> We can, in principle, only do (1), and have only a few hundred microseconds of unrealized gain without (2). I think we can do parallelism for heap region relocations as well, but so far I see it is a very small fraction of time spent in loading, so I left it for future work. I think (1) covers a lot of ground for heap region relocations already.
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `runtime/cds`
>>  - [x] Linux AArch64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision:
> 
>  - Merge branch 'master' into JDK-8341334-cds-parallel-relocation
>  - Make sure we gracefully shutdown whatever happens, refix shutdown race
>  - Simpler bitmap distribution
>  - Capitalize constants
>  - Do not create worker threads too early: Mac/Windows are not yet ready to use Semaphores
>  - Don't change the patching order in -ArchiveParallelIteration case
>  - Flags
>  - Work

Looks good.

I ran through this in my head a couple of times, especially wrt the communication with the master thread. Thank you for the good comments. 

It seems good. I needed some minutes to understand the end semaphore communication, that the actual counter is in _running_workers and only the last worker signals.

Some questions:

- You currently start the threads unconditionally whenever a FileMapInfo is created. That means we also start them when dumping. Can this be avoided?
- if we load both dynamic archive and static archive, thread group startup and shutdown happens twice?
- On machines < 12 cores, we run with just one worker thread. Should we then not just do the work inside the master thread itself? Or is this for you already parallelization, since master thread partakes in this work?
- Should we limit the number of workers on machines with many cores? (see also related question on pretouch task)

src/hotspot/share/cds/filemap.cpp line 1758:

> 1756:     char* start = _from + MIN2(_bytes, _bytes * chunk / max_chunks);
> 1757:     char* end   = _from + MIN2(_bytes, _bytes * (chunk + 1) / max_chunks);
> 1758:     os::pretouch_memory(start, end);

What happens if I have many cores and a small memory range? We would have many workers for a potentially smallish total range. Could start-end then end up being tiny? 

On Linux, we would do madvise MADV_POPULATE_WRITE. Could we end up feeding invalid range lengths to madvise, not page aligned? Or, could it just be inefficient if many threads try to madvise the same overlapping areas (see len calculation in os::pd_pretouch_memory)

src/hotspot/share/cds/filemap.cpp line 1992:

> 1990:     bm->iterate(reloc, start, end);
> 1991:   }
> 1992: };

I wondered why you did not do a SharedDataRelocationTask with a single bitmap and a single relocator, then did that task twice, once with ro region, once for rw region. 

But I assume you want to save the communication overhead per run_task.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21302#pullrequestreview-2415865581
PR Review Comment: https://git.openjdk.org/jdk/pull/21302#discussion_r1829496265
PR Review Comment: https://git.openjdk.org/jdk/pull/21302#discussion_r1829504351