RFR(M): 8231460: Performance issue (CodeHeap) with large free blocks

Tue Nov 5 14:54:25 UTC 2019

Hi Thomas,

thank you very much for your elaborate comments and suggestions! Please find my comments inline below.

Regards,
Lutz

On 04.11.19, 18:02, "Thomas Stüfe" <thomas.stuefe at gmail.com> wrote:

    Hi Lutz,

    I like this patch and have a number of remarks:

    Segment map splicing:

    - If the last byte of the leading block happens to contain 0xFD, you do not need to increment the fragmentation counter after splicing since the leading hop would be fully formed.

R: You are right. I'm not sure if this extra check will pay off performance-wise, though.

    - If patch complexity is a concern, I would maybe leave out the segmap_template. That would simplify the coding a bit and I am not sure how much this actually brings. I really am curious. Is it better to let the compiler store immediates or to memcpy the template? Would the latter not mean I pay for the loads too?

R: Introducing the segmap_template was my first attempt to boost performance. Improvement was visible, but of course in no way comparable with the effect we see now. Unfortunately, I did not archive the numbers. The improvement over a byte loop depends on the compiler, the memcpy implementation, and CPU capabilities. On s390x, for example, there exists special hardware support for mem->mem moves of up to 256 bytes. 

    - In ::allocate(), I like that you re-use block. This is just bikeshedding, but you also could reshape this function a tiny bit for a clean single exit at the end of the function.

R: Where to exit a function is personal taste. Or is there an OpenJDK coding guideline for that? In this particular case, an additional local variable would be needed to hold the function result (block->allocated_space() or NULL). I do not really like that.

    - Comment nits:

    - * code block (e.g. nmethod) when only a pointer to somewhere inside the
    + * code block (e.g. nmethod) when only a pointer to a location inside the

    - *  - Segment addresses should be aligned to be multiples of CodeCacheSegmentSize.
    + *  - Segment start addresses should be aligned to be multiples of CodeCacheSegmentSize.

    - *  - Allocation in the code cache can only happen at segment boundaries.
    + *  - Allocation in the code cache can only happen at segment start addresses.

R: agreed. Will be in next webrev.

    - I would like comments for the newly added functions in heap.hpp, not in the cpp file, because my IDE expands those comments from the hpp. But thats just me :=)

R: I prefer interface documentation in *.hpp (where you look what'S available), functional description in *.cpp (where the code is).

    - block_at() could be implemented now using address_for and a cast

R: I tried to keep the diff in check. 

    - fragmentation_limit, freelist_limit: I thought using enums for constants is not done anymore, and the preferred way is to use static const. See e.g. markWord.hpp.

R: If static const is the way to go, I'll go that way. 

    - +  // Boundaries of committed space.
      +  // Boundaries of reserved space.
       Thanks for the comments. Out of curiosity, can the lower part be uncommitted? Or would low() == low_boundary() always be true?

R: The interface distinguishes low() and low_boundary(). I did not dive into VirtualSpace to find out if the two values are always the same. Instead of relying on implementation detals, I prefer to adhere to the API.

    - +  bool contains(const void* p) const             { return low() <= p && p < high(); }
      Is comparing with void* legal C++?

R: None of our various compilers complained...

    - CodeHeap::merge_right():
        - Nits, I would maybe rename "follower" to "beg_follower", or, alternatively, "beg" to "leader".
        - I wonder whether we can have a "wipe header function" which only wipes the FreeBlock header of the follower, instead of wiping the whole segment.

R: If naming ist o be changed, I'd rather do something like
     size_t end_ix = segment_for(a) + a->length();

    - CodeHeap::add_to_freelist():
       I am astonished that this brings such a performance increase. I would naively have thought the deallocation to be pretty random, and therefore this technique to have an improvement of factor 2 in general, but we also need to find the start of the last_insert_free_block which may take some hops, and then, the block might not even be free...

R: Well, it all depends on the deallocation pattern. If you deallocate with ascending block addresses, the optimization is perfect. If you deallocate in the other direction, the optimization is not so good. But it will help as long as the the deallocated block is at a higher address than the remembered free block.

    --

    General remarks, not necessarily for your patch:

    - This code could really gain readability if the segment id would be its an own type, e.g. "typdef int segid_t". APIs like "mark_segmap_as_used(segid_t beg, segid_t end) are immediately clearer than when size_t is used.

R: Same as above: wanted to keep the diff in check.

    - In CodeHeap::verify(), do we actually check that the segment map is correctly formed? That from every point in the map, we reach the correct start of the associated code blob? I could not find this but I may be blind.

R: I had that, but it was way too expensive, even for a fastdebug build. Check webrev version *.00

    Thanks, Thomas

    On Thu, Oct 31, 2019 at 5:55 PM Schmidt, Lutz <mailto:lutz.schmidt at sap.com> wrote:
    Hi Andrew, (and hi to the interested crowd),

    Please accept my apologies for taking so long to get back.

    These tests (OverflowCodeCacheTest and StressCodeCacheTest) were causing me quite some headaches. Some layer between me and the test prevents the vm (in particular: the VMThread) from terminating normally. The final output from my time measurements is therefore not generated or thrown away. Adding to that were some test machine unavailabilities and a bug in my measurement code, causing crashes. 

    Anyway, I added some on-the-fly output, printing the timer values after 10k measurement intervals. This reveals some interesting, additional facts about the tests and the CodeHeap management methods. For detailed numbers, refer to the files attached to the bug (https://bugs.openjdk.java.net/browse/JDK-8231460). For even more detail, I can provide the jtr files on request. 

    OverflowCodeCacheTest
    =====================
    This test runs (in my setup) with a 1GB CodeCache.

    For this test, CodeHeap::mark_segmap_as_used() is THE performance hog. 40% of all calls have to mark more than 16k segment map entries (in the not optimized case). Basically all of these calls convert to len=1 calls with the optimization turned on. Note that during FreeBlock joining, the segment count is forced to 1(one). No wonder the time spent in CodeHeap::mark_segmap_as_used() collapses from >80sec (half of the test runtime) to <100msec. 

    CodeHeap::add_to_freelist() on the other hand, is almost not observable. Average free list length is at two elements, making even linear search really quick. 

    StressCodeCacheTest
    ===================
    With a 1GB CodeCache, this test runs into a 12 min timeout, set by our internal test environment. Scaling back to 300MB prevents the test from timing out.

    For this test, CodeHeap::mark_segmap_as_used() is not a factor. From 200,000 calls, only a few (less than 3%) had to process a block consisting of more than 16 segments. Note that during FreeBlock joining, the segment count is forced to 1(one). 

    Another method is popping up as performance hog instead: CodeHeap::add_to_freelist(). More than 8 out of 40 seconds of test runtime (before optimization) are spent in this method, for just 160,000 calls. The test seems to create a long list of non-contiguous free blocks (around 5,500 on average). This list is linearly scanned to find the insert point for the free block at hand.

    Suffering as well from the long free block list is CodeHeap::search_freelist(). It uses another 2.7 seconds for 270,000 calls.  

    SPEVjvm2008 suite
    =================
    With respect to the task at hand, this is a well-behaved test suite. Timing shows some before/after difference, but nothing spectacular. The measurements due not provide evidence of a performance bottleneck. 

    There were some minor adjustments to the code. Unused code blocks have been removed as well. I have therefore created a new webrev. You can find it here:
       http://cr.openjdk.java.net/~lucy/webrevs/8231460.01/ 

    Thanks for investing your time!
    Lutz

    On 21.10.19, 15:06, "Andrew Dinn" <mailto:adinn at redhat.com> wrote:

        Hi Lutz,

        On 21/10/2019 13:37, Schmidt, Lutz wrote:
        > I understand what you are interested in. And I was hoping to be able
        > to provide some (first) numbers by today. Unfortunately, the
        > measurement code I activated last Friday was buggy and blew most of
        > the tests I had hoped to run over the weekend.
        > 
        > I will take your modified test and run it with and without my
        > optimization. In parallel, I will try to generate some (non-random)
        > numbers for other tests.
        > 
        > I'll be back as soon as I have results.

        Thanks for trying the test and also for deriving some call stats from a
        real example. I'm keen to see how much your patch improves things.

        regards,

        Andrew Dinn
        -----------
        Senior Principal Software Engineer
        Red Hat UK Ltd
        Registered in England and Wales under Company Registration No. 03798903
        Directors: Michael Cunningham, Michael ("Mike") O'Neill