RFC: AArch64: Set Segmented CodeCache default size to 127M

Mon Feb 21 16:49:32 UTC 2022

Hi Andrew,

Sorry for the late reply. It was half term time.

Thank you for your feedback.

> I have seen bug reports from customers mystified
> at poor OpenJDK performance which have turned out
> to be code cache thrashing.

I think we have the case of code cache trashing. An application consumes
~90% of the code cache whatever the code cache is given. If the big code
cache is given the set of hot methods becomes sparse. The size of the set
is less than 32M. We have a few ideas to solve the trashing.

> I'd like to see more information. What was the *average performance
> gain* of all your benchmarks?

Full dacapo results ('-' means benchmark's time decreased, '+' means increased):
+------------+-------------+------------------+-----------------+
|   Bench    | New vs Base | COV base results | COV new results |
+------------+-------------+------------------+-----------------+
| tradebeans | -9.10%      | 11.99%           | 3.41%           |
| eclipse    | -3.57%      | 1.04%            | 0.91%           |
| tradesoap  | -3.03%      | 0.86%            | 0.46%           |
| tomcat     | -1.45%      | 0.99%            | 0.86%           |
| pmd        | -1.05%      | 0.62%            | 0.87%           |
| lusearch   | -0.81%      | 0.29%            | 0.39%           |
| zxing      | -0.46%      | 1.28%            | 0.82%           |
| biojava    | -0.04%      | 0.18%            | 0.19%           |
| jme        | 0.01%       | 0.01%            | 0.01%           |
| batik      | 0.08%       | 0.40%            | 0.45%           |
| luindex    | 0.42%       | 0.56%            | 0.70%           |
| fop        | 0.58%       | 1.18%            | 1.09%           |
| avrora     | 0.72%       | 2.05%            | 1.45%           |
| xalan      | 0.82%       | 2.82%            | 3.63%           |
| sunflow    | 4.57%       | 10.86%           | 10.84%          |
+------------+-------------+------------------+-----------------+
Each benchmark was run 10 times, 10 iterations per run. The result of the 10th iteration was used.

Renaissance results ('-' means benchmark's time decreased, '+' means increased):
+------------------+--------------+-------------------+-----------------+
|      Bench       | New vs Base  | COV base results  | COV new results |
+------------------+--------------+-------------------+-----------------+
| scrabble         | -13.47%      | 7.01%             | 7.43%           |
| dotty            | -9.03%       | 1.77%             | 1.82%           |
| naive-bayes      | -4.14%       | 9.72%             | 8.94%           |
| finagle-http     | -3.93%       | 0.95%             | 0.83%           |
| finagle-chirper  | -2.75%       | 2.45%             | 3.09%           |
| movie-lens       | -1.79%       | 1.39%             | 1.12%           |
| scala-doku       | -1.72%       | 27.20%            | 29.54%          |
| als              | -1.64%       | 0.64%             | 1.24%           |
| par-mnemonics    | -1.09%       | 11.69%            | 11.39%          |
| rx-scrabble      | -0.98%       | 1.36%             | 0.36%           |
| future-genetic   | -0.95%       | 1.14%             | 2.06%           |
| log-regression   | -0.86%       | 0.99%             | 1.62%           |
| dec-tree         | -0.74%       | 1.52%             | 1.69%           |
| chi-square       | -0.51%       | 1.20%             | 0.85%           |
| mnemonics        | -0.05%       | 0.74%             | 0.75%           |
| fj-kmeans        | 0.01%        | 0.95%             | 0.90%           |
| page-rank        | 0.06%        | 1.02%             | 0.80%           |
| scala-stm-bench7 | 0.16%        | 6.90%             | 7.43%           |
| reactors         | 0.97%        | 28.07%            | 12.42%          |
| scala-kmeans     | 1.22%        | 0.88%             | 0.39%           |
| gauss-mix        | 1.70%        | 1.83%             | 1.42%           |
| akka-uct         | 4.30%        | 5.20%             | 9.94%           |
| philosophers     | 12.64%       | 18.43%            | 17.64%          |
+------------------+--------------+-------------------+-----------------+
Each benchmark was run 10 times, 180 seconds per run. The second half of run's results was used.

I created https://bugs.openjdk.java.net/browse/JDK-8280872 "AArch64: Position non-nmethod segment in between profiled and non-profiled segments for 128M+ CodeCache".
It should reduce the number of trampolines.
There are also:
https://bugs.openjdk.java.net/browse/JDK-8280152 "AArch64: Duplicated trampolines in C2 NMethod Stub Code section"
https://bugs.openjdk.java.net/browse/JDK-8280481 "Duplicated static stubs in NMethod Stub Code section"

Implementing them we will improve the code cache usage but they won't fix the code cache trashing.

Thanks,
Evgeny

On 11/02/2022, 16:30, "hotspot-dev on behalf of Andrew Haley" <hotspot-dev-retn at openjdk.java.net on behalf of aph-open at littlepinkcloud.com> wrote:

    On 2/10/22 23:02, Astigeevich, Evgeny wrote:
    > We’d like to discuss a proposal for setting TieredCompilation Segmented CodeCache default size to 127M on AArch64 (https://bugs.openjdk.java.net/browse/JDK-8280150).

    I don't think so, at least not without a lot more information.

    This would halve the size of the code cache, potentially causing
    severe regressions in production. I have seen bug reports from
    customers mystified at poor OpenJDK performance which have turned out
    to be code cache thrashing. This is very hard to diagnose without
    making some inspired guesses at what the root cause may be. We'd be
    moving the threshold for cache exhaustion much closer to our default
    configuration.

    So, this is a trade off between a small expected gain and a much
    larger (but hopefully rare) loss.

    I'd like to see more information. What was the *average performance
    gain* of all your benchmarks? I don't think anyone is interested in
    cherry-picked best cases.

    A quick back-of-the-envelope calculation tells me that about 3.5% of
    the code cache is occupied by trampolines and the extra bytes used by
    far calls. However, many of the far calls are never needed; I don't
    have stats for that, but I'd guess about half of them. But given the
    (plausible ?)  assumption that the dynamic frequency of calls is the
    same as the static frequency, I wouldn't be surprised if the cost of
    trampoline calls is about 2% of the total instruction count, so it'd
    be nice to be rid of them if there were no cost; but there is a cost.

    --
    Andrew Haley  (he/him)
    Java Platform Lead Engineer
    Red Hat UK Ltd. <https://www.redhat.com>
    https://keybase.io/andrewhaley
    EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.