RFR: 8350852: Implement JMH benchmark for sparse CodeCache

Mon Mar 10 22:18:52 UTC 2025

On Thu, 27 Feb 2025 22:23:23 GMT, Evgeny Astigeevich <eastigeevich at openjdk.org> wrote:

> This benchmark is used to check performance impact of the code cache being sparse.
> 
> We use C2 compiler to compile the same Java method multiple times to produce as many code as needed. The Java method is not trivial. It adds two 40 digit positive integers. These compiled methods represent the active methods in the code cache. We split active methods into groups. We put a group into a fixed size code region. We make a code region aligned by its size. CodeCache becomes sparse when code regions are not fully filled. We measure the time taken to call all active methods.
> 
> Results: code region size 2M (2097152) bytes
> - Intel Xeon Platinum 8259CL
> 
> |activeMethodCount	|groupCount	|Methods/Group	|Score	|Error	|Units	|Diff	|
> |---	|---	|---	|---	|---	|---	|---	|
> |128	|1	|128	|19.577	|0.619	|us/op	|	|
> |128	|32	|4	|22.968	|0.314	|us/op	|17.30%	|
> |128	|48	|3	|22.245	|0.388	|us/op	|13.60%	|
> |128	|64	|2	|23.874	|0.84	|us/op	|21.90%	|
> |128	|80	|2	|23.786	|0.231	|us/op	|21.50%	|
> |128	|96	|1	|26.224	|1.16	|us/op	|34%	|
> |128	|112	|1	|27.028	|0.461	|us/op	|38.10%	|
> |256	|1	|256	|47.43	|1.146	|us/op	|	|
> |256	|32	|8	|63.962	|1.671	|us/op	|34.90%	|
> |256	|48	|5	|63.396	|0.247	|us/op	|33.70%	|
> |256	|64	|4	|66.604	|2.286	|us/op	|40.40%	|
> |256	|80	|3	|59.746	|1.273	|us/op	|26%	|
> |256	|96	|3	|63.836	|1.034	|us/op	|34.60%	|
> |256	|112	|2	|63.538	|1.814	|us/op	|34%	|
> |512	|1	|512	|172.731	|4.409	|us/op	|	|
> |512	|32	|16	|206.772	|6.229	|us/op	|19.70%	|
> |512	|48	|11	|215.275	|2.228	|us/op	|24.60%	|
> |512	|64	|8	|212.962	|2.028	|us/op	|23.30%	|
> |512	|80	|6	|201.335	|12.519	|us/op	|16.60%	|
> |512	|96	|5	|198.133	|6.502	|us/op	|14.70%	|
> |512	|112	|5	|193.739	|3.812	|us/op	|12.20%	|
> |768	|1	|768	|325.154	|5.048	|us/op	|	|
> |768	|32	|24	|346.298	|20.196	|us/op	|6.50%	|
> |768	|48	|16	|350.746	|2.931	|us/op	|7.90%	|
> |768	|64	|12	|339.445	|7.927	|us/op	|4.40%	|
> |768	|80	|10	|347.408	|7.355	|us/op	|6.80%	|
> |768	|96	|8	|340.983	|3.578	|us/op	|4.90%	|
> |768	|112	|7	|353.949	|2.98	|us/op	|8.90%	|
> |1024	|1	|1024	|368.352	|5.961	|us/op	|	|
> |1024	|32	|32	|463.822	|6.274	|us/op	|25.90%	|
> |1024	|48	|21	|457.674	|15.144	|us/op	|24.20%	|
> |1024	|64	|16	|477.694	|0.986	|us/op	|29.70%	|
> |1024	|80	|13	|484.901	|32.601	|us/op	|31.60%	|
> |1024	|96	|11	|480.8	|27.088	|us/op	|30.50%	|
> |1024	|112	|9	|474.416	|10.053	|us/op	|28.80%	|
> 
> - AArch64 Neoverse N1
> 
> |activeMethodCount	|groupCount	|Methods/Group	|Score	|Error	|Units	|Diff	|
> |---	|---	|---	|---	|---	|---	|---	|
> |128	|1	|128	|25.297	|0.792	|us/op	|	|
> |128	|32	|4	|31.451...

@dean-long 

> On Neoverse, what's the size of a region

I don't find anything about this in Neoverse docs. Although in Arm Neoverse N1 Software Optimization Guide, 4.9 Branch instruction alignment, I found:

> Branch instruction and branch target instruction alignment and density can affect performance.
> For best-case performance, consider the following guidelines.
> ...
> - When possible, a branch and its target should be located within the same 2M aligned
memory region.
>
> Consider aligning subroutine entry points and branch targets to 32B boundaries, within the
bounds of the code-density requirements of the program.

This is the place where I've got an idea of 2M used in the benchmark.

> why must it split the code into separate regions at all?

According to the Arm blog post, this is the front-end and related to the branch predictor. So I guess it helps to predict targets.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23831#issuecomment-2711970096