RFR(M): 8231460: Performance issue (CodeHeap) with large free blocks

Schmidt, Lutz lutz.schmidt at sap.com
Mon Nov 4 15:35:30 UTC 2019


Hi Andrew, 

thank you for your thoughts. I do not agree to your conclusion, though. 

There are two bottlenecks in the CodeHeap management code. One is in CodeHeap::mark_segmap_as_used(), uncovered by OverflowCodeCacheTest.java. The other is in CodeHeap::add_to_freelist(), uncovered by StressCodeCacheTest.java.

Both bottlenecks are tackled by the recommended changeset. 

CodeHeap::mark_segmap_as_used() is no longer O(n*n) for the critical "FreeBlock-join" case. It actually is O(1) now. The time reduction from > 80 seconds to just a few milliseconds is proof of that statement.

CodeHeap::add_to_freelist() is still O(n*n), with n being the free list length. But the kick-in point of the non-linearity could be significantly shifted towards larger n. The time reduction from approx. 8 seconds to 160 milliseconds supports this statement.

I agree it would be helpful to have a "real-world" example showing some improvement. Providing such evidence is hard, though. I could instrument the code and print some values form time to time. It's certain this additional output will mess up success/failure decisions in our test environment. Not sure everybody likes that. But I will give it a try and take the hits. This will be a multi-day effort.

On a general note, I am always uncomfortable knowing of a O(n*n) effort, in particular when it could be removed or at least tamed considerably. Experience tells (at least to me) that, at some point in time, n will be large enough to hurt. 

I'll be back.

Thanks, 
Lutz


On 04.11.19, 11:08, "Andrew Dinn" <adinn at redhat.com> wrote:

    Hi Lutz,
    
    I'll summarize my thoughts here rather than answer point by point.
    
    The patch successfully addresses the worst case performance but it seems
    to me extremely unlikely that we will see anything that approaches that
    case in real applications. So, that doesn't argue for pushing the patch.
    
    The patch does not seem to make a significant difference to the stress
    test. This test is also not necessarily 'representative' of real cases
    but it is much more likely to be so than the worst case test. That
    suggests to me that the current patch is perhaps not worth pursuing (it
    ain't really broke so ...).  Especially so given that it is not possible
    to distinguish any benefit when running the Spec benchmark apps. One
    could argue that the patch looks like it will do no harm and may do good
    in pathological  cases but that's not really good enough reason to make
    a change. We really need evidence that this is worth doing.
    
    The free list 'search bottleneck' certainly looks like a more promising
    problem to tackle than the 'merge problem'. However, once again this
    'problem' may just be an artefact of running this specific test rather
    than anything that might happen in real life.
    
    I think the only way to find out for sure whether the current patch or a
    patch that addresses the 'search bottleneck' is going to be beneficial
    is to instrument the JVM to record traces for code-cache use from real
    apps and then replay allocations/frees based on those traces to see what
    difference a patch makes and how much this might help the overall
    execution time.
    
    regards,
    
    
    Andrew Dinn
    -----------
    
    On 31/10/2019 16:55, Schmidt, Lutz wrote:
    > Hi Andrew, (and hi to the interested crowd),
    > 
    > Please accept my apologies for taking so long to get back.
    > 
    > These tests (OverflowCodeCacheTest and StressCodeCacheTest) were causing me quite some headaches. Some layer between me and the test prevents the vm (in particular: the VMThread) from terminating normally. The final output from my time measurements is therefore not generated or thrown away. Adding to that were some test machine unavailabilities and a bug in my measurement code, causing crashes. 
    > 
    > Anyway, I added some on-the-fly output, printing the timer values after 10k measurement intervals. This reveals some interesting, additional facts about the tests and the CodeHeap management methods. For detailed numbers, refer to the files attached to the bug (https://bugs.openjdk.java.net/browse/JDK-8231460). For even more detail, I can provide the jtr files on request. 
    > 
    > 
    > OverflowCodeCacheTest
    > =====================
    > This test runs (in my setup) with a 1GB CodeCache.
    > 
    > For this test, CodeHeap::mark_segmap_as_used() is THE performance hog. 40% of all calls have to mark more than 16k segment map entries (in the not optimized case). Basically all of these calls convert to len=1 calls with the optimization turned on. Note that during FreeBlock joining, the segment count is forced to 1(one). No wonder the time spent in CodeHeap::mark_segmap_as_used() collapses from >80sec (half of the test runtime) to <100msec. 
    > 
    > CodeHeap::add_to_freelist() on the other hand, is almost not observable. Average free list length is at two elements, making even linear search really quick. 
    > 
    > 
    > StressCodeCacheTest
    > ===================
    > With a 1GB CodeCache, this test runs into a 12 min timeout, set by our internal test environment. Scaling back to 300MB prevents the test from timing out.
    > 
    > For this test, CodeHeap::mark_segmap_as_used() is not a factor. From 200,000 calls, only a few (less than 3%) had to process a block consisting of more than 16 segments. Note that during FreeBlock joining, the segment count is forced to 1(one). 
    > 
    > Another method is popping up as performance hog instead: CodeHeap::add_to_freelist(). More than 8 out of 40 seconds of test runtime (before optimization) are spent in this method, for just 160,000 calls. The test seems to create a long list of non-contiguous free blocks (around 5,500 on average). This list is linearly scanned to find the insert point for the free block at hand.
    > 
    > Suffering as well from the long free block list is CodeHeap::search_freelist(). It uses another 2.7 seconds for 270,000 calls.  
    > 
    > 
    > SPEVjvm2008 suite
    > =================
    > With respect to the task at hand, this is a well-behaved test suite. Timing shows some before/after difference, but nothing spectacular. The measurements due not provide evidence of a performance bottleneck. 
    > 
    > 
    > There were some minor adjustments to the code. Unused code blocks have been removed as well. I have therefore created a new webrev. You can find it here:
    >    http://cr.openjdk.java.net/~lucy/webrevs/8231460.01/ 
    > 
    > Thanks for investing your time!
    > Lutz
    > 
    > 
    > On 21.10.19, 15:06, "Andrew Dinn" <adinn at redhat.com> wrote:
    > 
    >     Hi Lutz,
    >     
    >     On 21/10/2019 13:37, Schmidt, Lutz wrote:
    >     > I understand what you are interested in. And I was hoping to be able
    >     > to provide some (first) numbers by today. Unfortunately, the
    >     > measurement code I activated last Friday was buggy and blew most of
    >     > the tests I had hoped to run over the weekend.
    >     > 
    >     > I will take your modified test and run it with and without my
    >     > optimization. In parallel, I will try to generate some (non-random)
    >     > numbers for other tests.
    >     > 
    >     > I'll be back as soon as I have results.
    >     
    >     Thanks for trying the test and also for deriving some call stats from a
    >     real example. I'm keen to see how much your patch improves things.
    >     
    >     regards,
    >     
    >     
    >     Andrew Dinn
    >     -----------
    >     Senior Principal Software Engineer
    >     Red Hat UK Ltd
    >     Registered in England and Wales under Company Registration No. 03798903
    >     Directors: Michael Cunningham, Michael ("Mike") O'Neill
    >     
    >     
    > 
    > 
    
    



More information about the hotspot-compiler-dev mailing list