RFC: improving NMethod code locality in CodeCache

Wed Jan 5 19:16:33 UTC 2022

Hi Boris,

Thank you for the comments.

> You say [1] that branch prediction hardware can become overloaded in the case of 15K compiled methods.
> In your numbers, I see the maxium is 7K methods ~ 50MB (on Renaissance benchmark).

This 15K is not the threshold. It is application dependent. I found DaCapo eclipse benchmark got ~6.0% improvement on Graviton2 when the tiered compilation was off and the code cache was limited to 64M. The eclipse benchmark has ~5K C2 methods. These numbers show the total created C2 methods during benchmarks run. However the most important metrics are the number of the hottest C2 methods and the corresponding memory map. The metrics change whilst an application runs. The sparse memory map causes branch prediction issues. 

> Also on aws-graviton-getting-started link [2] we see that the recommended CodeCacheSize value is 64M - more than that makes a performance impact.
> These cases may be also different by the contents of the code cache: I guess it's tiered compilation in benchmarks and non-tiered C2 in [2].

The advice is to turn off the tiered compilation and to limit the code cache to 64M at the same time.
It is based on:
1. All compiled methods will be C2 compiled. The sizes of non-tiered C2 methods are smaller than the sizes of tiered C2 methods. One thing I noticed tiered C2 methods have more code. I'll check what also differs.
2. Limiting the code cache to 64M, we force Sweeper to remove cold methods. This results the set of the hottest methods to be compact. For our application we saw the reduction of code cache usage from 130M to 37M. 

> - What is the typical CodeCache size for real-world applications? Is it common for CodeCache get hundreds of megabytes? Can it be simulated with benchmarks?

Besides the data from our application, I don't have such data. I'll think how to collect it.
With the tiered compilation on the default code cache size is 240M: 116M for non-profiled methods, 116M for profiled methods, 8M for non-nmethods. This is for x86_64 and arm64. With the tiered compilation off the default code cache size is 48M. I have not seen much research of code cache usage by real-world applications on x86.
I think we can use some DaCapo and Renaissance benchmarks. SpecJbb needs to be checked as well.

> I am not sure that branch predictors are often limited to a certain amount of memory, which is much less than the possible size of the code.

They are. See:
"The BTB in contemporary Intel chips" https://xania.org/201602/bpu-part-three
"Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!" https://blog.cloudflare.com/branch-predictor/

> There are now 3 generations of AWS Graviton HW. Do you observe same branch prediction and code cache size effects on all three?

I have no data for Graviton 1 and Graviton 3. I have plans to try Graviton 3 as soon as I get access to it. Graviton 1 based instances (A1) are not so widely used as Graviton 2 instances. Anyway it might be interesting to get data from Graviton 1. Graviton 1 is based on Cortex-A72 which differs very much from Neoverse N1. Among arm64 implementations, Apple M1 is a good candidate to check.

> - What does maximum CodeCache limit mean, is this distance from the first method to the last?

This is the maximum amount of memory reserved for CodeCache. The memory is split into several heaps if the tiered compilation is on. It is not the distance between methods.
See https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/codeCache.cpp#L1485 for details.

> Will it help if C2 put the metatadata and things to the next page after the instructions page? I mean it worth putting them not too far from each other.

What does "page" mean in this context? OS page? Or CodeCache page?

> I believe it makes sence to work with Sweaper so that it removes cold methods actively from the CodeCache (see the Hotness Code picture on Page 65, [3]).

Currently Sweeper relies on nmethod allocation and state change events. These events cause updating the temperature of nmethods. If no such events happen the temperature is not updated.
In order to remove cold methods more actively, we need a sampling mechanism. See ideas in https://bugs.openjdk.java.net/browse/JDK-8279184.

> In general a GC-like approach can be applied to the CodeCache to make it clean, small and hot.

IMHO, this can get arbitrary complex up to a full-blown, generational CodeCache GC.

Thanks,
Evgeny

From: Boris Ulasevich <boris.ulasevich at bell-sw.com>
Date: Thursday, 23 December 2021 at 15:59
To: "Astigeevich, Evgeny" <eastig at amazon.co.uk>, "hotspot-dev at openjdk.java.net" <hotspot-dev at openjdk.java.net>
Subject: RE: RFC: improving NMethod code locality in CodeCache

Hi Evgeny,

Thank you for sharing the data. It is very detailed and well structured. It is indeed interesting that the code itself takes ~1/2 of the volume and sometimes even less. So judging from the numbers, we can (theoretically) double the code dencity. I agree that it is worth doing.

You say [1] that branch prediction hardware can become overloaded in the case of 15K compiled methods. In your numbers, I see the maxium is 7K methods ~ 50MB (on Renaissance benchmark). This is quite a load, yes. Also on aws-graviton-getting-started link [2] we see that the recommended CodeCacheSize value is 64M - more than that makes a performance impact. These cases may be also different by the contents of the code cache: I guess it's tiered compilation in benchmarks and non-tiered C2 in [2].

My questions are
- What is the typical CodeCache size for real-world applications? Is it common for CodeCache get hundreds of megabytes? Can it be simulated with benchmarks?
- I am not sure that branch predictors are often limited to a certain amount of memory, which is much less than the possible size of the code. There are now 3 generations of AWS Graviton HW. Do you observe same branch prediction and code cache size effects on all three?
- What does maximum CodeCache limit mean, is this distance from the first method to the last? Will it help if C2 put the metatadata and things to the next page after the instructions page? I mean it worth putting them not too far from each other.

Besides code density issue in case of a limited CodeCache size (either a small amount of memory or a limitation of branch predictor) I believe it makes sence to work with Sweaper so that it removes cold methods actively from the CodeCache (see the Hotness Code picture on Page 65, [3]). After the virtual machine warms up, the compiler threads are idle anyway. In general a GC-like approach can be applied to the CodeCache to make it clean, small and hot.

thanks,
Boris

[1] https://mail.openjdk.java.net/pipermail/hotspot-dev/2021-November/056198.html
[2] https://github.com/aws/aws-graviton-getting-started/blob/main/java.md
[3] http://cr.openjdk.java.net/~thartmann/papers/2014-Code_Cache_Optimizations-thesis.pdf 

Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.