RFR: 8259380: Correct pretouch chunk size to cap with actual page size

Thomas Schatzl tschatzl at openjdk.java.net
Fri Jan 8 11:08:57 UTC 2021


On Thu, 7 Jan 2021 16:56:37 GMT, Patrick Zhang <qpzhang at openjdk.org> wrote:

> This is actually a regression, with regards to JVM startup time extreme slowdown, initially found at an aarch64 platform (Ampere Altra core).
> 
> The chunk size of pretouching should cap with the input page size which probably stands for large pages size if UseLargePages was set, otherwise processing chunks with much smaller size inside large size pages would hurt performance.
> 
> This issue was introduced during a refactor on chunk calculations [JDK-8254972](https://bugs.openjdk.java.net/browse/JDK-8254972) (https://github.com/openjdk/jdk/commit/2c7fc85be92c60f4262aff3bc80e704792c1e810) but did not cause any problem immediately since the default PreTouchParallelChunkSize for all platforms are 1GB which can cover all popular sizes of large pages in use by most kernel variations. Later on, [JDK-8254699](https://bugs.openjdk.java.net/browse/JDK-8254699) (https://github.com/openjdk/jdk/commit/805d05812c5e831947197419d163f9c83d55634a) set default 4MB for Linux platform, which is helpful to speed up startup time for some platforms. For example, most x64, since the popular default large page size (e.g. CentOS) is 2MB. In contrast, most default large page size with aarch64 platforms/kernels (e.g. CentOS) are 512MB, so using the 4MB chunk size to do page walk through the pages inside 512MB large page hurt performance of startup time.
> 
> In addition, there will be a similar problem if we set -XX:PreTouchParallelChunkSize=4k at a x64 Linux platform, the startup slowdown will show as well.
> 
> Tests:
> https://bugs.openjdk.java.net/secure/attachment/92623/pretouch_chunk_size_fix_testing.txt
> The 4 before-after comparisons show the JVM startup time go back to normal.
> 1). 33.381s to 0.870s
> 2). 20.333s to 2.740s
> 3). 15.090s to 6.268s
> 4). 38.983s to 6.709s
> (Use the start time of pretouching the first Survivor space as a rough measurement, while \time, or GCTraceTime can generate similar results)

Thanks for finding and reporting this issue and even providing a patch.

After having looked at the issue we (in the Oracle GC team) think this problem is serious enough to actually go into JDK16. Since backporting after having this pushed to some (this) repo is some extra effort, would you mind closing this PR here on openjdk/jdk and reopening a new one on openjdk/jdk16?

It will then be automatically forward ported to this repo. Not only is backporting some additional effort, there is concern that it won't make it into jdk16 otherwise - Jan 14 is cutoff date for bugs of this seriousness, and we'd need to get an exception for this otherwise.

As one of the persons typically triaging new issues in the bug tracker, I would also like to ask you to not open new issues immediately. We are looking at these issues three times a week, and if you open them yourselves, issues might not be handled correctly (i.e. like in this case immediately put into openjdk/jdk16). You can still create a PR and everything even if an issue is in "New" state.

I'll start looking at your change immediately.

Thanks,
  Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/1978



More information about the hotspot-gc-dev mailing list