[External] : RFC: improving NMethod code locality in CodeCache

Thu Dec 2 17:16:26 UTC 2021

Hi Tobias,

Thank you for your comments and references.

> Is it really a problem with branch prediction or more with instruction caching?

This is a problem with dynamic branch prediction.
For example, to improve branch prediction on Graviton2 we recommend to disable tiered compilation and to restrict the size of the code cache (see https://github.com/aws/aws-graviton-getting-started/blob/main/java.md). It has shown large (1.5x) improvements in some Java workloads.

> (Re-)moving the metadata will improve locality but does that really have an effect on branch prediction?

Code placement also affects BTB which is part of dynamic branch prediction.
See:
https://arxiv.org/pdf/1804.00261.pdf "A Survey of Techniques for Dynamic Branch Prediction"
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.2910&rep=rep1&type=pdf "The Effect of Code Reordering on Branch Prediction"

> Did you gather some numbers via hardware performance counters (iCache, ITLB, branch prediction misses)?

I don't have numbers for DaCapo and Renaissance benchmarks.
We found the branch prediction problem with CodeCache in our Java workloads by analysing hardware performance counters and creating a dynamic map of CodeCache hot regions.
80% - 90% of hot code was C2 nmethods. The map of hot regions showed the code of those C2 nmethods was sparse.
The analysis also showed nmethods had high ratio non-executable data (aka metadata) vs code. DaCapo and Renaissance benchmarks confirmed this was not specific to our Java workloads.

> There is lots of code that depends on the current layout and we would need to make all of that dependent on a flag.

Yes, this is one of challenges, maybe the biggest one.

> > 2. Where to put:
> >     a. Different segments for code and nmethod data. This will require updating NMethod because it uses code_offset, stub_offset from header_begin.
> >     b. The same segment but in a different part (e.g., code grows from lower addresses upwards and metadata from high addresses downwards). This might allow keeping NMethod using code_offset, stub_offset.
> >     c.  Or in a completely different place (C-heap, Metaspace,...)
>
>  It depends on what we want to improve: (i) Code locality in the same nmethod or (ii) code locality
>  between different nmethods.

We want to improve code locality between different methods. For example, CodeCache has nmethods sequentially: A,B,C,D,E,F,G. There are two hottest call chains: A->E->G and C->G. We can relocate A, C, E and G close to each other to help BTB: A, C, E, G, B, D, F. The problem is that we have to do this each time hottest call chains/graphs changes. Removing metadata, code of all nmethods A - G can be covered by BTB. No need for relocations. In theory, at some point, e.g. 50000 and more c2 nmethods, this will stop working. We'll need to do relocations.

BTW, I tried to relocate NMethod. It is not easy to implement because being relocated is a part of NMethod design. Once a nmethod gets into CodeCache it stays at the same place till it dies.

>  Solution b) would only improve code locality in the same nmethod but the overall layout of
>  executable code in the code cache would still be sparse.

Why would b) not improve code locality of all methods? If methods use 64Mb we will have 32Mb of metadata at the beginning and 32Mb of code at the ending of a segment. The code of the methods would be close to each other.

>  I think c) would be the ideal solution: The code cache would only contain executable code and all
>  the metadata would be somewhere else. But solution a) would lead to the same layout and might be
>  easier to implement.

I think b) might be easier to implement. I need to go to some low-level design to estimate complexity.

I've created a review of a design document: https://github.com/eastig/codecache/pull/1.
I'll be updating it according with email discussion.

Thanks,
Evgeny

On 29/11/2021, 09:02, "Tobias Hartmann" <tobias.hartmann at oracle.com> wrote:

    Hi Evgeny,

    Thanks for sharing these results and starting the discussion.

    Some comments below.

    On 23.11.21 18:34, Astigeevich, Evgeny wrote:
    > We have cases where the CodeCache contains more than 15,000 compiled methods. In these cases, we saw a negative performance effect. The hot executable code is not contiguous, so branch prediction hardware can become overloaded.

    Is it really a problem with branch prediction or more with instruction caching? With the current
    implementation, the hot instructions of a single nmethod are already contiguous but different
    nmethods might be located far away (and there's lots of metadata in-between). (Re-)moving the
    metadata will improve locality but does that really have an effect on branch prediction?

    Did you gather some numbers via hardware performance counters (iCache, ITLB, branch prediction misses)?

    > The data show that due to intervening non-executable data in NMethods, executable code is sparse in the CodeCache. The data also show the most contributors of non-executable data are the header and scopes sections. Arm64 vs x86_64 looks consistent except the stub code. On arm64 the size of the stub code is 4-5 times bigger.
    >
    > We’d like to have an option to configure the CodeCache to support C2 nmethods with separated executable code and non-executable data.

    It would definitely be nice to have this as an option (rather than replacing the current
    implementation) but I wonder how feasible it is. There is lots of code that depends on the current
    layout and we would need to make all of that dependent on a flag.

    > According to the fixed JDK-8152664 (https://bugs.openjdk.java.net/browse/JDK-8152664) “Support non-continuous CodeBlobs in HotSpot”, NMethod sections can be located in different places of memory. The discussion of it: https://mail.openjdk.java.net/pipermail/hotspot-dev/2016-April/022500.html. Separating code will complicate maintenance of the CodeCache. Different parts of memory for a nmethod need to be allocated/released.

    Ever since I finished the implementation of the Segmented Code Cache
    (https://openjdk.java.net/jeps/197), I wanted to work on this but never got to it. I think that the
    additional complexity in the code cache is worth it but of course that has to be proven by a
    performance evaluation.

    For reference, here's my old thesis and the paper we published back then:
    http://cr.openjdk.java.net/~thartmann/papers/2014-Code_Cache_Optimizations-thesis.pdf
    http://cr.openjdk.java.net/~thartmann/papers/2014-PPPJ-Efficient_Code_Cache_Management.pdf

    > There is JDK-7072317 “move metadata from CodeCache” (https://bugs.openjdk.java.net/browse/JDK-7072317) which the implementation works can be done under.

    Yes, that makes sense.

    > There can be different approaches for the implementation:
    >
    > 1. What to separate:
    >     a. All code (main plus stub) from other sections.
    >     b. Or only main code because this is the code where an application should spend most of the time.
    >     c. Or the header and scope sections.

    I would say that from a performance perspective, only the main code matters because the stubs are
    used for slow paths. If it simplifies prototyping, I would go with b) first.

    > 2. Where to put:
    >     a. Different segments for code and nmethod data. This will require updating NMethod because it uses code_offset, stub_offset from header_begin.
    >     b. The same segment but in a different part (e.g., code grows from lower addresses upwards and metadata from high addresses downwards). This might allow keeping NMethod using code_offset, stub_offset.
    >     c.  Or in a completely different place (C-heap, Metaspace,...)

    It depends on what we want to improve: (i) Code locality in the same nmethod or (ii) code locality
    between different nmethods.

    Solution b) would only improve code locality in the same nmethod but the overall layout of
    executable code in the code cache would still be sparse.

    I think c) would be the ideal solution: The code cache would only contain executable code and all
    the metadata would be somewhere else. But solution a) would lead to the same layout and might be
    easier to implement.

    > It needs to be investigated if the separation of sections which are frequently accessed during the normal execution of the code (e.g., oop section) affects the performance negatively. We might need to change NMethodSweeper to preserve the code locality property.

    Yes, that is a concern. A thorough performance evaluation is required.

    > We would like to get feedback on the above approaches (or something different) before implementing JDK-7072317.

    Hope that helps. I'm curious what others think.

    Best regards,
    Tobias

Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.