RFR(S): 8020151: PSR:PERF Large performance regressions when code cache is filled

Fri Sep 13 13:11:15 PDT 2013

Hi Valadimir,

thanks for looking at this. See comments inline:

On 13.09.2013 21:16, Vladimir Kozlov wrote:
> We need to ask PSR (Scott Oaks) to test these changes before we push 
> them. Prepare JPRT build bundles for them.
>
Started jprt (Stockholm) more than 5 hours ago - still in the queue.
> I am glad that we can remove speculatively_disconnect without any 
> performance issues. But we need to ask Scott to test it.
Yes, we have to have test the patch on the original program which 
exposed the bug.
>
> You need to fix an other bug (8022968) filed by Chris Pl. when C2 
> failed to allocate scratch buffer we do not report full codecache. 
> Otherwise we continue to compile and fail without sweeping. Currently 
> handle_full_code_cache() is called only from ciEnv::register_method(). 
> See calls init_scratch_buffer_blob() and cb->initialize() in 
> Compile::init_buffer().
>
Yes, I'll look at this on Monday. I think it should be fairly 
independent of this change.
> Also you need to coordinate your changes with Chris's changes for:
>
> 8016270: "CodeCache is full" message and compilation disabled, but 
> codecache is not actually full
>
> it switch off compilation gradually based on failed compilation method 
> size.
>
I have to look this bug in more detail.
> About changes:
>
> nmethod.cpp:
>
> Adjust comment that race is during sweeping:
>
>     // transition its state from 'not_entrant' to 'zombie'
>     // during CodeCache sweeping without having to wait
>     // for stack scanning.
>
> mark_as_seen_on_stack() has assert which you will hit. I surprise that 
> you don't not. You can relax the assert to check is_alive() instead.
>
Thanks for catching this. That was a last second change.
> compileBroker.cpp:
>
> Why you need this? And why scaling?:
>
> +        // Only one thread at a time can do sweeping. Scale the
> +        // wait time according to the number of compiler threads.
> +        wait_time = 100 * CICompilerCount;
>
I am not sure if we need this. The idea behind is that currently only 
one thread can sweep
at a time. The scaling has the effect that - on average - every 100ms 
there is at least one
sweep if the code cache is full.
> Swap next lines:
>
> 1953         NMethodSweeper::possibly_sweep();
> 1954         NMethodSweeper::log_sweep("disable_compiler");
>
>
I'll change that.
> I will review sweeper.cpp changes later.
>
Many thanks,
Albert

> Thanks,
> Vladimir
>
> On 9/13/13 8:48 AM, Albert Noll wrote:
>> Hi,
>>
>> thanks again for the feedback. This is the updated version.
>> http://cr.openjdk.java.net/~anoll/8020151/webrev.02/
>> <http://cr.openjdk.java.net/%7Eanoll/8020151/webrev.02/>
>>
>> @Vladimir:
>> I hope I could fix the code issues properly. I tried your suggested
>> change in JavaCalls::call_helper() but it did not lead
>> to more code reanimation.
>>
>> @Chris P.
>> Thanks again for your feedback. You are right, the algorithm could be
>> made more efficient.
>> However, I did not consider that code is being re-animated, so it
>> probably does not help
>> to reduce fragmentation. We are working on a patch where we have
>> multiple code heaps.
>> This should reduce fragmentation, since we can put subs, adapters, code
>> compiled with
>> different tiers into different heaps. More about this will follow most
>> likely next week.
>>
>> I did experiments that show the performance of the current version.
>> Short version:
>> The current patch performs, with the used benchmarks, significantly
>> better than
>> the default way of handling a full code cache. This version has no
>> performance regression
>> when the code cache does not filled up.
>>
>> More detailed data can be found here:
>> https://bugs.openjdk.java.net/browse/JDK-8020151
>>
>>
>> Change since last version (webrev.01) :
>>
>> Since the standard way of handling a full code cache
>> (speculatively_disconnect + re-animate)
>> performs worse (and does not solve the problem) I see no reason to keep.
>> As a result, I removed
>> that code.
>>
>> This is a large change to an import part of Hotspot; detailed and
>> critical reviews are very welcome.
>>
>> Best,
>> Albert
>>
>>
>> On 04.09.2013 22:33, Vladimir Kozlov wrote:
>>> On 9/4/13 4:07 AM, Albert Noll wrote:
>>>> Hi,
>>>>
>>>> here is an updated version of the patch. In summary, changes to the
>>>> previous version are:
>>>>
>>>> 1) Adapt NmethodSweepFraction based on the code cache size:
>>>>      Before the patch, NmethodSweepFraction was defined as a constant
>>>> value.
>>>>      I.e., NmethodSweepFraction was 16 for a code cache size of 32m as
>>>> well as a
>>>>      code cache size of 256m. This seems unreasonable, since
>>>> NmethodSweepFraction
>>>>      was originally introduce to keep the time spent in the sweeper
>>>> (note that sweeping
>>>>      was originally done at safepoints) at a reasonable level.
>>>
>>> Albert,
>>>
>>> General question first. Do we still need to sweep in fractions since
>>> we don't do sweeping at safepoint and it stops for safepoints? The
>>> only drawback is one Compiler thread is used for sweeping and can't
>>> compile when it sweeps. But we will address that later by using
>>> special sweeping thread.
>>>
>>>>      The patch adapts NmethodSweepFraction to the 
>>>> ReservedCodeCacheSize.
>>>> I.e.,
>>>>      each sweep operation roughly  covers 16m of the code cache.
>>>
>>> Why you stop at 16?:
>>>
>>> +    } else if (ReservedCodeCacheSize > (256 * M)) {
>>> +      FLAG_SET_DEFAULT(NmethodSweepFraction, 16);
>>>
>>> ReservedCodeCacheSize could be 2GB so one sweep have to process 128Mb.
>>> Next should cover all cases, I think:
>>>
>>> FLAG_SET_DEFAULT(NmethodSweepFraction, 1 + ReservedCodeCacheSize / (16
>>> * M))
>>>
>>>>
>>>> 2) The parameter NmethodSweepFraction is replaced by a function that
>>>> computes the
>>>>      NmethodSweepFraction based on the code cache size. See
>>>> 'sweeper.cpp' for a detailed
>>>>      description
>>>>
>>>> 3) The initial hotness counter depends on the code cache size. More
>>>> specifically, the initial
>>>>      hotness counter is 2*ReservedCodeCacheSize. As a result, a method
>>>> stays longer in the
>>>>      code cache if ReservedCodeCacheSize is larger.
>>>
>>> I don't like spreading '(ReservedCodeCacheSize / M) * 2' through the
>>> code. Use NMethodSweeper::get_hotness_counter_reset_val() in nmethod
>>> constuctors.
>>>
>>> Also you need to check for min value if ReservedCodeCacheSize < 1Mb.
>>>
>>> Also drop 'get_' from name. We usually don't use 'get_' in accessor
>>> methods. the same for nmethod::get_hotness_counter().
>>>
>>> I already asked you to make hotness_counter_decay member of nmethod so
>>> that you don't need to pass it to dec_hotness_counter(). Or remove it
>>> at all since it is always 1:
>>>
>>> void dec_hotness_counter() { _hotness_counter--;  }
>>>
>>> I would prefer all code which modifies _hotness_counter be in nmethod
>>> class.
>>>
>>>>
>>>> 4) Newly compiled methods are guaranteed to not be evicted by the
>>>> sweeper for 10
>>>>      sweep cycles. This ensures the newly compiled methods are not
>>>> immediately made
>>>>      not entrant after compialtion.
>>>>
>>>> 5) The hotness counter is reset EVERY TIME active methods are scanned.
>>>> In the previous version
>>>>      the hotness counter was only reset after a full sweep cycle of 
>>>> the
>>>> code cache. Resetting the
>>>>     hotness counter more frequently provides a better hotness coverage
>>>> of methods.
>>>>
>>>> 6) Methods are flushed in blocks of 1m. The algorithm computes the
>>>> average hotness of a
>>>>      a nmethod block and evicts the entire block. This should reduce
>>>> fragmentation.
>>>>
>>>> Please let me know what you think about these changes. Performance
>>>> results can be found at:
>>>> https://bugs.openjdk.java.net/browse/JDK-8020151
>>>> I will continuously provide more results.
>>>>
>>>> Here is the new webrev:
>>>> http://cr.openjdk.java.net/~anoll/8020151/webrev.01/
>>>> <http://cr.openjdk.java.net/%7Eanoll/8020151/webrev.01/>
>>>
>>> Very nice comment in sweeper.hpp. Could you please extend it?
>>>
>>> 2) sweep nmethods
>>>
>>> Sweeping is currently done by Compiler thread between compilations or
>>> at least each 5 sec (NmethodSweepCheckInterval) when CodeCache is full.
>>>
>>> The comment is not cleare that nmethods could be marked as not-entrant
>>> by different code (deoptimization, dependency invalidation, replace
>>> old nmethod) and not just by sweeper (which does that only for flushed
>>> nmethods).
>>>
>>> Each nmethod's state change happens during separate sweeps. It may
>>> take at least 3 sweeps before nmethod's space is freed.
>>>
>>> 3) code cache flushing
>>> ... as a VM operation at safepoint.
>>>
>>> Flashing VM operation requires safepoint.
>>>
>>> As you pointed current code re-animates nmethods only if we get
>>> compilation request. It means that it depends on profiling counters.
>>> It could be the case that a method have to be executed in interpreter
>>> for some time again to hit compilation threshold.
>>>
>>> I think we need to re-animate nmethod immediately during java call:
>>>
>>> JavaCalls::call_helper()
>>>
>>>   if (CompilationPolicy::must_be_compiled(method)) {
>>>     CompileBroker::compile_method(method, InvocationEntryBci,
>>>
>>> CompilationPolicy::policy()->initial_compile_level(),
>>>                                   methodHandle(), 0,
>>> "must_be_compiled", CHECK);
>>> + } else if (UseCodeCacheFlushing) {
>>> +   nmethod* saved = CodeCache::reanimate_saved_code(method());
>>> +   if (saved != NULL) {
>>> +     method->set_code(method, saved);
>>> +   }
>>>   }
>>>
>>> Can you verify how this work?
>>>
>>> I will send additional comments when look through the rest of code.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>> Best,
>>>> Albert
>>>>
>>>>
>>>> On 22.08.2013 15:16, Albert Noll wrote:
>>>>> Hi Igor,
>>>>>
>>>>> thanks again for your comments. You are right. I will run some
>>>>> benchmarks to see if removing
>>>>> the disconnect logic is feasible or not.
>>>>>
>>>>> Best,
>>>>> Albert
>>>>>
>>>>> On 22.08.2013 12:39, Igor Veresov wrote:
>>>>>> It's worth a try. But the "hotness" logic is probabilistic and
>>>>>> imprecise - it will notice only methods that are on stack during
>>>>>> safepoints. Those are going to be pretty narrow snapshots of
>>>>>> activity. I suspect that for large flat profiles (like enterprise
>>>>>> apps and friends) you could be missing methods that are rather warm
>>>>>> in reality, which will cause recompilation oscillations. Although if
>>>>>> the statistics are allowed to accumulate enough may be it's going to
>>>>>> work out, an experiment will tell. The patch is a good start, the
>>>>>> stack sampling, IMO, is totally the right approach for filtering out
>>>>>> the hot methods.
>>>>>>
>>>>>> The "disconnect" logic on the other hand is sort of precise.
>>>>>> Although, looking now at the code it's not quite clear to me how it
>>>>>> works, it doesn't seem to be any patching going on to divert the
>>>>>> control for the cases when the nmethod is called directly or through
>>>>>> an IC. So I guess it's not really a full disconnect? Anyways, in
>>>>>> theory, with some work, we can make the disconnect logic to 
>>>>>> precisely
>>>>>> measure the time the method is inactive. Which should provide 
>>>>>> precise
>>>>>> information about the warm/cold methods.
>>>>>>
>>>>>> Btw, also just noticed a bunch of flaws in the interaction of the
>>>>>> disconnect logic and tiered. The nmethod's "reconnection" happens in
>>>>>> CompileBroker::compile_method(), which firstly will be called by
>>>>>> tiered only after a rather substantial number of invocations in the
>>>>>> interpreter (up to 128), and secondly will be subject to all the
>>>>>> prioritization rules (it probably should not), and also we don't
>>>>>> check if the comp level of the reanimated nmethod matches the
>>>>>> request. If the disconnect logic is to stay, the interpreter should
>>>>>> be able to know if the method has saved code and be able to call 
>>>>>> into
>>>>>> the runtime immediately to reanimate it.
>>>>>>
>>>>>> igor
>>>>>>
>>>>>> On Aug 21, 2013, at 10:27 PM, Albert Noll <albert.noll at oracle.com
>>>>>> <mailto:albert.noll at oracle.com>> wrote:
>>>>>>
>>>>>>> Hi Igor,
>>>>>>>
>>>>>>> thanks for looking at the patch. Actually, I think - just as
>>>>>>> Vladimir pointed out - that we can get rid of
>>>>>>> the "disconnect" logic. We now have the hotness of a method and if
>>>>>>> the code cache fills up, we and
>>>>>>> we decide to schedule the method for removal, we set it to
>>>>>>> not_entrant.
>>>>>>> It seems that adding the method to the list of disconnected methods
>>>>>>> just buys a little more time until we decide to make the method
>>>>>>> not-entrant. However, we can have the same effect by setting the
>>>>>>> threshold differently.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Albert
>>>>>>>
>>>>>>> On 22.08.2013 10:02, Igor Veresov wrote:
>>>>>>>> May be instead of "(_traversals > _last_flush_traversal_id + 
>>>>>>>> 2)" we
>>>>>>>> should timestamp a method when it's disconnected, and then use a
>>>>>>>> rule like if a method has been disconnected for k *
>>>>>>>> reverse_free_ratio() seconds then it's ok to kill it. We can also
>>>>>>>> sort the nmethods that pass that filter by the amount of time they
>>>>>>>> were disconnected and select most likely candidates for flushing.
>>>>>>>> This should allow to basically do disconnect/flush in every
>>>>>>>> traversal, which should make things faster. Timestamps would be
>>>>>>>> obtained only once per traversal or something like that. What do
>>>>>>>> you think?
>>>>>>>>
>>>>>>>> Pretty cool idea to reverse-prioritize disconnects on hotness.
>>>>>>>>
>>>>>>>> igor
>>>>>>>>
>>>>>>>> On Aug 21, 2013, at 4:42 AM, Albert Noll <albert.noll at oracle.com
>>>>>>>> <mailto:albert.noll at oracle.com>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> could I have reviews for this patch? Please note
>>>>>>>>> that I do not yet feel very confident with the sweeper,
>>>>>>>>> so please take a close look.
>>>>>>>>>
>>>>>>>>> jbs: https://jbs.oracle.com/bugs/browse/JDK-8020151
>>>>>>>>> webrev: http://cr.openjdk.java.net/~anoll/8020151/webrev.00/
>>>>>>>>> <http://cr.openjdk.java.net/%7Eanoll/8020151/webrev.00/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Many thanks in advance,
>>>>>>>>> Albert
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Problem: There can be large performance regressions when the code
>>>>>>>>> cache fills up. There are
>>>>>>>>> several reasons for the performance regression: First (1), when
>>>>>>>>> the code cache is full and methods
>>>>>>>>> are speculatively disconnected, the oldest methods (based on
>>>>>>>>> compilation ID) are scheduled for
>>>>>>>>> flushing. This can result in flushing hot methods. Second (2),
>>>>>>>>> when compilation is disabled due to a full
>>>>>>>>> code cache, the number of sweeps can go down. A lower number of
>>>>>>>>> sweep operations results
>>>>>>>>> in slower method flushing.
>>>>>>>>>
>>>>>>>>> Solution:
>>>>>>>>> Introduce a hotness counter that is set to a particular value
>>>>>>>>> (e.g., 100) when there is an activation
>>>>>>>>> of the method during stack scanning. The counter is 
>>>>>>>>> decremented by
>>>>>>>>> 1 every time the sweeper
>>>>>>>>> is invoked.
>>>>>>>>>
>>>>>>>>> ad (1):
>>>>>>>>>   A VM operation that speculatively disconnects nmethods, selects
>>>>>>>>> the methods that should be
>>>>>>>>>   flushed based on the hotness. For example, if 50% of the code
>>>>>>>>> cache shall be flushed, we flush
>>>>>>>>>   those methods that have not been active while stack scanning 
>>>>>>>>> for
>>>>>>>>> the longest time. Note that
>>>>>>>>>   while this strategy is more likely to flush cold methods, it is
>>>>>>>>> not clear to what extent the new
>>>>>>>>>   strategy fragments the code cache.
>>>>>>>>>
>>>>>>>>>   Changes in NMethodSweeper::speculative_disconnect_nmethods(bool
>>>>>>>>> is_full)
>>>>>>>>>
>>>>>>>>> ad (2)
>>>>>>>>>   Currently, methods are removed from the code cache if:
>>>>>>>>>     a) code cache is full
>>>>>>>>>     b) class is unloaded
>>>>>>>>>     c) method is replaced by another version (i.e., compiled with
>>>>>>>>> a different tier)
>>>>>>>>>     d) deopt
>>>>>>>>>
>>>>>>>>>    The current patch adds a 5-th possibility to remove a method
>>>>>>>>> from the code cache.
>>>>>>>>>    In particular, if a method has not been active during stack
>>>>>>>>> scanning for a long-enough
>>>>>>>>>    amount of time, the method is removed from the code cache. The
>>>>>>>>> amount of time
>>>>>>>>>    required to flush the method depends on the available space in
>>>>>>>>> the code cache.
>>>>>>>>>
>>>>>>>>>    Here is one example: If a method was seen on a stack the
>>>>>>>>> hotness counter
>>>>>>>>>    is set to 100. A sweep operation takes roughly place every
>>>>>>>>> 100ms. I.e., it takes
>>>>>>>>>    100ms * 100 = 10s until the hotness counter reaches 0. The
>>>>>>>>> threshold that determines
>>>>>>>>>    if a method should be removed from the code cache is 
>>>>>>>>> calculated
>>>>>>>>> as follows:
>>>>>>>>>
>>>>>>>>>    threshold = -100 + (CodeCache::reverse_free_ratio() *
>>>>>>>>> NMethodSweepActivity)
>>>>>>>>>
>>>>>>>>>     For example, if 25% of the code cache is free,
>>>>>>>>> reverse_free_ratio returns 4.
>>>>>>>>>     The default value of NMethodSweepActivity is 10. As a result,
>>>>>>>>> threshold = -60.
>>>>>>>>>     Consequently, all methods that have a hotness value smaller
>>>>>>>>> than -60 (which
>>>>>>>>>     means they have not been seen on the stack for 16s) are
>>>>>>>>> scheduled to be flushed
>>>>>>>>>     from the code cache. See an illustration of the threshold 
>>>>>>>>> as a
>>>>>>>>> function of the available
>>>>>>>>>     code cache in threshold.pdf
>>>>>>>>>
>>>>>>>>>     Note that NMethodSweepActivity is a parameter that can be
>>>>>>>>> specified via a -XX
>>>>>>>>>     flag.
>>>>>>>>>
>>>>>>>>> Changes in NMethodSweeper::sweep_code_cache()
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> A very preliminary performance evaluation looks promising. I used
>>>>>>>>> the DaCapo
>>>>>>>>> benchmarks where a series of benchmarks is executed in the 
>>>>>>>>> same VM
>>>>>>>>> instance.
>>>>>>>>> See performance.pdf . The x-axis shows the benchmarks. Assume we
>>>>>>>>> have 2 benchmarks
>>>>>>>>> (BM). The execution sequence is as follows:
>>>>>>>>>
>>>>>>>>> BM1 (Run 1-1)
>>>>>>>>> BM1 (Run 2-1)
>>>>>>>>> BM2 (Run 1-1)
>>>>>>>>> BM2 (Run 2-1)
>>>>>>>>>
>>>>>>>>> BM1 (Run 1-2)
>>>>>>>>> BM1 (Run 2-2)
>>>>>>>>> BM2 (Run 1-2)
>>>>>>>>> BM2 (Run 2-2)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> A value larger than 0 on the x-axis indicates that the version
>>>>>>>>> including the proposed patch is faster.
>>>>>>>>> I.e., the values are calculated as follows: (T_original /
>>>>>>>>> T_with_patch) - 1. T is the execution time
>>>>>>>>> (wall clock time) of the benchmark. ReservedCodeCacheSize is set
>>>>>>>>> to 50m.  I used three runs and
>>>>>>>>> the arithmetic average to compare the numbers. I know, we need
>>>>>>>>> much more data, however,
>>>>>>>>> I think we can see a trend.
>>>>>>>>>
>>>>>>>>> The current patch does not trigger a warning that the code cache
>>>>>>>>> is full and compilation has been
>>>>>>>>> disabled.
>>>>>>>>>
>>>>>>>>> Please let me know that you think.
>>>>>>>>> <threshold.pdf><performance.pdf>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>