RFR: 8341334: CDS: Parallel pretouch and relocation [v3]
Andrew Dinn
adinn at openjdk.org
Wed Oct 2 16:26:36 UTC 2024
On Wed, 2 Oct 2024 14:24:50 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> In Leyden performance investigations, we have figured that `ArchiveRelocationMode=0` is saving 5..7 ms on HelloWorld startup. Mainline defaults to `ARM=1`, _losing_ as much. `ARM=0` was switched to `ARM=1` with [JDK-8294323](https://github.com/openjdk/jdk/commit/be70bc1c58eaec876aa1ab36eacba90b901ac9b8), which was delivered to JDK 17+ in in Apr 2023.
>>
>> Profiling shows we spend time mem-faulting the memory loading the RO/RW regions, about 15 MB total. 15 MB in 5ms amounts to >4GB/sec, close to the single-threaded limits. I suspect the impact is larger if we relocate larger Metaspace, e.g. after dumping a CDS archive from a large application. There is little we can do to make the actual relocation part faster: the overwhelming majority of samples is on kernel side.
>>
>> This PR implements the infrastructure for fast archive workers, and leverages it to do two things:
>> 1. Parallel pretouch of mmap-ed regions. This causes the faults we would otherwise process in a single thread to be processed in multiple threads. The key thing is to eat the page faults in multiple threads; pretouching with a single thread would not help. This improvement gives the biggest bang in both `ArchiveRelocationMode`-s.
>> 2. Parallel core regions relocation. The RW/RO regions this code traverses is large, and we can squeeze more performance by parallelizing it. Without pretouch from (1), this step serves as one for RW/RO regions.
>>
>> (I'll put some performance data in the comments to show how these interact)
>>
>> We can, in principle, only do (1), and have only a few hundred microseconds of unrealized gain without (2). I think we can do parallelism for heap region relocations as well, but so far I see it is a very small fraction of time spent in loading, so I left it for future work. I think (1) covers a lot of ground for heap region relocations already.
>>
>> Additional testing:
>> - [x] Linux x86_64 server fastdebug, `runtime/cds`
>> - [ ] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision:
>
> Do not create worker threads too early: Mac/Windows are not yet ready to use Semaphores
src/hotspot/share/cds/archiveUtils.hpp line 346:
> 344: // to take 1/4 CPUs to provide decent parallelism without letting workers
> 345: // stumble over each other.
> 346: static constexpr int cpus_per_worker = 4;
Could these static constexpr values maybe be declared either with am ALL_CAPS_NAME or _leading_underscore. They currently look like locals/params at point of use.
src/hotspot/share/cds/filemap.cpp line 1959:
> 1957: int chunks_per_bitmap = max_chunks / 2;
> 1958: if (chunk < chunks_per_bitmap) {
> 1959: bm = _rw_bm;
How is it guaranteed that the two regions will each have half the chunks?
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21302#discussion_r1784860716
PR Review Comment: https://git.openjdk.org/jdk/pull/21302#discussion_r1784860903
More information about the hotspot-runtime-dev
mailing list