RFR: 8341334: CDS: Parallel relocation [v7]

Wed Nov 6 10:03:33 UTC 2024

On Tue, 5 Nov 2024 20:12:06 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> In Leyden performance investigations, we have figured that `ArchiveRelocationMode=0` is saving 5..7 ms on HelloWorld startup. Mainline defaults to `ARM=1`, _losing_ as much. `ARM=0` was switched to `ARM=1` with [JDK-8294323](https://github.com/openjdk/jdk/commit/be70bc1c58eaec876aa1ab36eacba90b901ac9b8), which was delivered to JDK 17+ in in Apr 2023.
>> 
>> Profiling shows we spend time mem-faulting the memory loading the RO/RW regions, about 15 MB total. 15 MB in 5ms amounts to >4GB/sec, close to the single-threaded limits. I suspect the impact is larger if we relocate larger Metaspace, e.g. after dumping a CDS archive from a large application. There is little we can do to make the actual relocation part faster: the overwhelming majority of samples is on kernel side.
>> 
>> This PR implements the infrastructure for fast archive workers, and leverages it to perform parallel core regions relocation. The RW/RO regions this code traverses is large, and we can squeeze more performance by parallelizing it. Without pretouch from (1), this step serves as one for RW/RO regions. 
>> 
>> (I'll put some performance data in the comments to show how these interact)
>> 
>> Additional testing:
>>  - [x] Linux x86_64 server fastdebug, `runtime/cds`
>>  - [x] Linux AArch64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision:
> 
>  - More perf touchups
>  - Cap the max number of workers
>  - Revert pre-touch parts: there are startup regressions on smaller thread counts
>  - Handle single-threaded modes better
>  - Merge branch 'master' into JDK-8341334-cds-parallel-relocation
>  - Merge branch 'master' into JDK-8341334-cds-parallel-relocation
>  - Make sure we gracefully shutdown whatever happens, refix shutdown race
>  - Simpler bitmap distribution
>  - Capitalize constants
>  - Do not create worker threads too early: Mac/Windows are not yet ready to use Semaphores
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/7909eb5a...d4d739b1

Small nit, otherwise fine.

src/hotspot/share/cds/filemap.cpp line 1972:

> 1970:     BitMap::idx_t size  = bm->size();
> 1971:     BitMap::idx_t start = MIN2(size, size * chunk / max_chunks);
> 1972:     BitMap::idx_t end   = MIN2(size, size * (chunk + 1) / max_chunks);

Can you assert here that `(end - start) > 1`? Should never happen with the limited number of workers, but it would be nice to know. E.g. if someone changes the number of workers again.

If it were to happen, the effect would be that we silently skip relocation. Better to fall over this early.

-------------

Marked as reviewed by stuefe (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/21302#pullrequestreview-2417757100
PR Review Comment: https://git.openjdk.org/jdk/pull/21302#discussion_r1830711538