RFC: improving NMethod code locality in CodeCache

Sun Dec 26 18:58:10 UTC 2021

I've filed a draft "umbrella" JEP that mostly discusses periodically compacting the code cache, but mentions this work as a preliminary project. See

https://bugs.openjdk.java.net/browse/JDK-8279184: Instruction Issue Cache Hardware Accommodation

Thanks,
Paul

-----Original Message-----
From: hotspot-dev <hotspot-dev-retn at openjdk.java.net> on behalf of "Astigeevich, Evgeny" <eastig at amazon.co.uk>
Date: Thursday, December 2, 2021 at 2:08 PM
To: "Schmidt, Lutz" <lutz.schmidt at sap.com>, "hotspot-dev at openjdk.java.net" <hotspot-dev at openjdk.java.net>
Cc: Tobias Hartmann <tobias.hartmann at oracle.com>
Subject: Re: RFC: improving NMethod code locality in CodeCache

Hi Lutz,

Thank you for your comments.
From the data I've got NMethod constants section does not take a lot of space: total of them is below 0.6%. This is similar for both x86_64 and arm64. I guess it should be similar for other architectures.

A decision to move stub code out depends on the stub code contribution. For x86_64 it is below 2%. So it might be kept with the main code. On arm64 it is currently up to 12%. I found couple issues in generated stub code. Resolving them should reduce arm64 stub code size. I am not sure it would possible to get arm64 stub code below 2%.

> When considering performance, it is beneficial to have data which is being patched (frequently) separated from the instruction stream.

All CPUs I have worked with don't like self modifying code and mixing code with modifiable data. An exception is literal pools holding constants embedded into code.

> Instruction stream compactness may have an influence if the prediction engine not only remembers the branch direction, but the (limited length) distance as well.

In "A Survey of Techniques for Dynamic Branch Prediction"(https://arxiv.org/pdf/1804.00261.pdf) I found that a distance between branches can be taken into account. I've seen this.

Thanks,
Evgeny

On 29/11/2021, 12:20, "Schmidt, Lutz" <lutz.schmidt at sap.com> wrote:

    Hi,

    a few thoughts immediately popped up when reading Evgeny's RFC and Tobias' comments. If my comments seem influenced by s390x - that might well be. It's the architecture I know best.

     - The biggest concern I have relates to pc-relative addressing.
        o nmethod constants are currently located next to the instruction section.
          Putting them into a separately allocated area may break the pc-relative limit.
          s390x limit: +/- 4GB, no fallback implemented.
        o relative branches either are
           + short distance, mostly intra-nmethod
           + long distance, mostly inter-nmethod
           + not possible in general, e.g., runtime calls
          The branch optimization (in shorten_branches) might less often be possible.
          One example would be if stub code is moved to a separately allocated area.
     - When considering performance, it is beneficial to have data which is being
       patched (frequently) separated from the instruction stream.
       s390x: never modify data in a cache line where instructions are fetched from.
       That will kill your performance big time.
     - I'm not a branch prediction expert. Instruction stream compactness may have an
       influence if the prediction engine not only remembers the branch direction, but
       the (limited length) distance as well.

    Thanks,
    Lutz

    On 29.11.21, 10:03, "hotspot-dev on behalf of Tobias Hartmann" <hotspot-dev-retn at openjdk.java.net on behalf of tobias.hartmann at oracle.com> wrote:

        Hi Evgeny,

        Thanks for sharing these results and starting the discussion.

        Some comments below.

        On 23.11.21 18:34, Astigeevich, Evgeny wrote:
        > We have cases where the CodeCache contains more than 15,000 compiled methods. In these cases, we saw a negative performance effect. The hot executable code is not contiguous, so branch prediction hardware can become overloaded.

        Is it really a problem with branch prediction or more with instruction caching? With the current
        implementation, the hot instructions of a single nmethod are already contiguous but different
        nmethods might be located far away (and there's lots of metadata in-between). (Re-)moving the
        metadata will improve locality but does that really have an effect on branch prediction?

        Did you gather some numbers via hardware performance counters (iCache, ITLB, branch prediction misses)?

        > The data show that due to intervening non-executable data in NMethods, executable code is sparse in the CodeCache. The data also show the most contributors of non-executable data are the header and scopes sections. Arm64 vs x86_64 looks consistent except the stub code. On arm64 the size of the stub code is 4-5 times bigger.
        >
        > We’d like to have an option to configure the CodeCache to support C2 nmethods with separated executable code and non-executable data.

        It would definitely be nice to have this as an option (rather than replacing the current
        implementation) but I wonder how feasible it is. There is lots of code that depends on the current
        layout and we would need to make all of that dependent on a flag.

        > According to the fixed JDK-8152664 (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-8152664&data=04%7C01%7Clutz.schmidt%40sap.com%7C17b6b19707b845d65b6308d9b316d9b6%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637737734063133916%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0j0bCjbCv7AQH1uULiERMIcfUWaTWzh%2FIJbKuMO70Ow%3D&reserved=0) “Support non-continuous CodeBlobs in HotSpot”, NMethod sections can be located in different places of memory. The discussion of it: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openjdk.java.net%2Fpipermail%2Fhotspot-dev%2F2016-April%2F022500.html&data=04%7C01%7Clutz.schmidt%40sap.com%7C17b6b19707b845d65b6308d9b316d9b6%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637737734063133916%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4bXS2plxpknWzKwY9qdJl%2BTGEHiwV1LgMnIkHGwkG8A%3D&reserved=0. Separating code will complicate maintenance of the CodeCache. Different parts of memory for a nmethod need to be allocated/released.

        Ever since I finished the implementation of the Segmented Code Cache
        (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fopenjdk.java.net%2Fjeps%2F197&data=04%7C01%7Clutz.schmidt%40sap.com%7C17b6b19707b845d65b6308d9b316d9b6%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637737734063133916%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ylfS6p71bpm7XmNRfG0vjSw6ZqRPOoJvSRujzYkQz8g%3D&reserved=0), I wanted to work on this but never got to it. I think that the
        additional complexity in the code cache is worth it but of course that has to be proven by a
        performance evaluation.

        For reference, here's my old thesis and the paper we published back then:
        https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcr.openjdk.java.net%2F~thartmann%2Fpapers%2F2014-Code_Cache_Optimizations-thesis.pdf&data=04%7C01%7Clutz.schmidt%40sap.com%7C17b6b19707b845d65b6308d9b316d9b6%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637737734063143871%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=8KgOtwbSULPN%2FlUz10%2B9itGl%2Fmmvm6bV4y6D%2BcsT%2Bu4%3D&reserved=0
        https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcr.openjdk.java.net%2F~thartmann%2Fpapers%2F2014-PPPJ-Efficient_Code_Cache_Management.pdf&data=04%7C01%7Clutz.schmidt%40sap.com%7C17b6b19707b845d65b6308d9b316d9b6%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637737734063143871%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=gDYHJdpnK1%2FgcxDGZsYJ0X0Ku%2BIwS9KWrk8ggSfUVt0%3D&reserved=0

        > There is JDK-7072317 “move metadata from CodeCache” (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.openjdk.java.net%2Fbrowse%2FJDK-7072317&data=04%7C01%7Clutz.schmidt%40sap.com%7C17b6b19707b845d65b6308d9b316d9b6%7C42f7676cf455423c82f6dc2d99791af7%7C0%7C0%7C637737734063143871%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=p6sjPC9HXMlydHk5mi4DlQh2ZOG4MYvcLte%2FAz%2B3ZbU%3D&reserved=0) which the implementation works can be done under.

        Yes, that makes sense.

        > There can be different approaches for the implementation:
        >
        > 1. What to separate:
        >     a. All code (main plus stub) from other sections.
        >     b. Or only main code because this is the code where an application should spend most of the time.
        >     c. Or the header and scope sections.

        I would say that from a performance perspective, only the main code matters because the stubs are
        used for slow paths. If it simplifies prototyping, I would go with b) first.

        > 2. Where to put:
        >     a. Different segments for code and nmethod data. This will require updating NMethod because it uses code_offset, stub_offset from header_begin.
        >     b. The same segment but in a different part (e.g., code grows from lower addresses upwards and metadata from high addresses downwards). This might allow keeping NMethod using code_offset, stub_offset.
        >     c.  Or in a completely different place (C-heap, Metaspace,...)

        It depends on what we want to improve: (i) Code locality in the same nmethod or (ii) code locality
        between different nmethods.

        Solution b) would only improve code locality in the same nmethod but the overall layout of
        executable code in the code cache would still be sparse.

        I think c) would be the ideal solution: The code cache would only contain executable code and all
        the metadata would be somewhere else. But solution a) would lead to the same layout and might be
        easier to implement.

        > It needs to be investigated if the separation of sections which are frequently accessed during the normal execution of the code (e.g., oop section) affects the performance negatively. We might need to change NMethodSweeper to preserve the code locality property.

        Yes, that is a concern. A thorough performance evaluation is required.

        > We would like to get feedback on the above approaches (or something different) before implementing JDK-7072317.

        Hope that helps. I'm curious what others think.

        Best regards,
        Tobias

Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.