RFR: 8341334: CDS: Parallel pretouch and relocation

Aleksey Shipilev shade at openjdk.org
Wed Oct 2 10:54:06 UTC 2024


On Wed, 2 Oct 2024 10:45:14 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> In Leyden performance investigations, we have figured that `ArchiveRelocationMode=0` is saving 5..7 ms on HelloWorld startup. Mainline defaults to `ARM=1`, _losing_ as much. `ARM=0` was switched to `ARM=1` with [JDK-8294323](https://github.com/openjdk/jdk/commit/be70bc1c58eaec876aa1ab36eacba90b901ac9b8), which was delivered to JDK 17+ in in Apr 2023.
> 
> Profiling shows we spend time mem-faulting the memory loading the RO/RW regions, about 15 MB total. 15 MB in 5ms amounts to >4GB/sec, close to the single-threaded limits. I suspect the impact is larger if we relocate larger Metaspace, e.g. after dumping a CDS archive from a large application. There is little we can do to make the actual relocation part faster: the overwhelming majority of samples is on kernel side.
> 
> This PR implements the infrastructure for fast archive workers, and leverages it to do two things: 
> 1. Parallel pretouch of mmap-ed regions. This causes the faults we would otherwise process in a single thread to be processed in multiple threads. This gives the biggest bang in both `ArchiveRelocationMode`-s.
> 2. Parallel core regions relocation. The RW/RO regions this code traverses is large, and we can squeeze more performance by parallelizing it. Without pretouch from (1), this step serves as one for RW/RO regions. We can, in principle, only do the pretouch in (1), and have only a few hundred microseconds of unrealized gain.
> 
> (I'll put some performance data in the comments to show how these interact)
> 
> I think we can do parallelism for heap region relocations as well, but so far I see it is a very small fraction of time spent in loading, so I left it for future work. I think (1) covers a lot of ground for heap region relocations already.
> 
> Additional testing:
>  - [x] Linux x86_64 server fastdebug, `runtime/cds`
>  - [ ] Linux x86_64 server fastdebug, `all`

With `HelloStream` example from the [JEP-483](https://openjdk.org/jeps/483), on my 5950X desktop:

Default `-XX:ArchiveRelocationMode=1` is responsive to both `ArchivePreTouch` and `ArchiveParallelRelocation`:


# -ArchivePreTouch -ArchiveParallelRelocation (current mainline)
  Time (mean ± σ):      45.8 ms ±   0.5 ms    [User: 30.2 ms, System: 23.0 ms]
  Range (min … max):    44.8 ms …  47.4 ms    500 runs

# +ArchivePreTouch -ArchiveParallelRelocation
  Time (mean ± σ):      40.9 ms ±   0.5 ms    [User: 29.6 ms, System: 30.7 ms]
  Range (min … max):    39.7 ms …  42.6 ms    500 runs

# -ArchivePreTouch +ArchiveParallelRelocation
  Time (mean ± σ):      40.1 ms ±   0.4 ms    [User: 30.4 ms, System: 25.0 ms]
  Range (min … max):    39.1 ms …  41.3 ms    500 runs

# +ArchivePreTouch +ArchiveParallelRelocation
  Time (mean ± σ):      39.9 ms ±   0.3 ms    [User: 30.7 ms, System: 30.2 ms]
  Range (min … max):    38.9 ms …  41.2 ms    500 runs


`-XX:ArchiveRelocationMode=0` (current Leyden premain; JDK prior to JDK-8294323) is responsive to `ArchivePreTouch`, since it never actually goes into improved relocation code.


# -ArchivePreTouch -ArchiveParallelRelocation (current mainline)
  Time (mean ± σ):      41.1 ms ±   0.4 ms    [User: 28.5 ms, System: 20.2 ms]
  Range (min … max):    40.3 ms …  43.0 ms    500 runs

# +ArchivePreTouch -ArchiveParallelRelocation
  Time (mean ± σ):      39.8 ms ±   0.4 ms    [User: 28.8 ms, System: 30.5 ms]
  Range (min … max):    38.8 ms …  41.1 ms    500 runs

# -ArchivePreTouch +ArchiveParallelRelocation
  Time (mean ± σ):      41.1 ms ±   0.4 ms    [User: 28.2 ms, System: 20.4 ms]
  Range (min … max):    40.1 ms …  42.6 ms    500 runs

# +ArchivePreTouch +ArchiveParallelRelocation
  Time (mean ± σ):      39.8 ms ±   0.4 ms    [User: 29.4 ms, System: 29.8 ms]
  Range (min … max):    38.8 ms …  41.1 ms    500 runs

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21302#issuecomment-2388349450


More information about the hotspot-runtime-dev mailing list