RFR(S): 8020151: PSR:PERF Large performance regressions when code cache is filled
Vladimir Kozlov
vladimir.kozlov at oracle.com
Fri Sep 13 15:00:12 PDT 2013
sweeper:
Nice comment about nmethod removal!
I thought you will remove next check in mark_active_nmethods(), we
talked about that:
if (CompiledIC_lock->is_locked() || Patching_lock->is_locked()) return;
Swap next lines:
223 // Only set hotness counter
224 } else {
Next comment does not match the check (flushed_memory > 0), also rename
it to 'freed_memory':
367 // Note that typically several kB are released per sweep cycle
(16MB).
Next original comment was correct:
397 // since the locks acquired below might safepoint.
It means that safepoint may happen during lock, may be:
397 // since safepoint may happen during acquired below locks.
Move next comment below if ((nm->hotness_counter() < threshold) check.
I think you need to explain what threshold formula means:
+ double threshold = -reset_val +
(CodeCache::reverse_free_ratio() * NmethodSweepActivity);
The less space we have - the bigger reverse_free_ratio(), when it
reaches reset_val/NmethodSweepActivity threshold become > 0 and we start
mark nmethods.
What happened to sort_nmethods_by_hotness()?
arguments.cpp: add checks that specified NmethodSweepActivity and
NmethodSweepFraction has reasonable values.
Additional comments:
I talked with Igor and we think that instead of setting and checking
traversal_count we need to add several additional states to the nmethod
and have well defined state transition machine.
Thanks,
Vladimir
On 9/13/13 1:11 PM, Albert Noll wrote:
> Hi Valadimir,
>
> thanks for looking at this. See comments inline:
>
> On 13.09.2013 21:16, Vladimir Kozlov wrote:
>> We need to ask PSR (Scott Oaks) to test these changes before we push
>> them. Prepare JPRT build bundles for them.
>>
> Started jprt (Stockholm) more than 5 hours ago - still in the queue.
>> I am glad that we can remove speculatively_disconnect without any
>> performance issues. But we need to ask Scott to test it.
> Yes, we have to have test the patch on the original program which
> exposed the bug.
>>
>> You need to fix an other bug (8022968) filed by Chris Pl. when C2
>> failed to allocate scratch buffer we do not report full codecache.
>> Otherwise we continue to compile and fail without sweeping. Currently
>> handle_full_code_cache() is called only from ciEnv::register_method().
>> See calls init_scratch_buffer_blob() and cb->initialize() in
>> Compile::init_buffer().
>>
> Yes, I'll look at this on Monday. I think it should be fairly
> independent of this change.
>> Also you need to coordinate your changes with Chris's changes for:
>>
>> 8016270: "CodeCache is full" message and compilation disabled, but
>> codecache is not actually full
>>
>> it switch off compilation gradually based on failed compilation method
>> size.
>>
> I have to look this bug in more detail.
>> About changes:
>>
>> nmethod.cpp:
>>
>> Adjust comment that race is during sweeping:
>>
>> // transition its state from 'not_entrant' to 'zombie'
>> // during CodeCache sweeping without having to wait
>> // for stack scanning.
>>
>> mark_as_seen_on_stack() has assert which you will hit. I surprise that
>> you don't not. You can relax the assert to check is_alive() instead.
>>
> Thanks for catching this. That was a last second change.
>> compileBroker.cpp:
>>
>> Why you need this? And why scaling?:
>>
>> + // Only one thread at a time can do sweeping. Scale the
>> + // wait time according to the number of compiler threads.
>> + wait_time = 100 * CICompilerCount;
>>
> I am not sure if we need this. The idea behind is that currently only
> one thread can sweep
> at a time. The scaling has the effect that - on average - every 100ms
> there is at least one
> sweep if the code cache is full.
>> Swap next lines:
>>
>> 1953 NMethodSweeper::possibly_sweep();
>> 1954 NMethodSweeper::log_sweep("disable_compiler");
>>
>>
> I'll change that.
>> I will review sweeper.cpp changes later.
>>
> Many thanks,
> Albert
>
>> Thanks,
>> Vladimir
>>
>> On 9/13/13 8:48 AM, Albert Noll wrote:
>>> Hi,
>>>
>>> thanks again for the feedback. This is the updated version.
>>> http://cr.openjdk.java.net/~anoll/8020151/webrev.02/
>>> <http://cr.openjdk.java.net/%7Eanoll/8020151/webrev.02/>
>>>
>>> @Vladimir:
>>> I hope I could fix the code issues properly. I tried your suggested
>>> change in JavaCalls::call_helper() but it did not lead
>>> to more code reanimation.
>>>
>>> @Chris P.
>>> Thanks again for your feedback. You are right, the algorithm could be
>>> made more efficient.
>>> However, I did not consider that code is being re-animated, so it
>>> probably does not help
>>> to reduce fragmentation. We are working on a patch where we have
>>> multiple code heaps.
>>> This should reduce fragmentation, since we can put subs, adapters, code
>>> compiled with
>>> different tiers into different heaps. More about this will follow most
>>> likely next week.
>>>
>>> I did experiments that show the performance of the current version.
>>> Short version:
>>> The current patch performs, with the used benchmarks, significantly
>>> better than
>>> the default way of handling a full code cache. This version has no
>>> performance regression
>>> when the code cache does not filled up.
>>>
>>> More detailed data can be found here:
>>> https://bugs.openjdk.java.net/browse/JDK-8020151
>>>
>>>
>>> Change since last version (webrev.01) :
>>>
>>> Since the standard way of handling a full code cache
>>> (speculatively_disconnect + re-animate)
>>> performs worse (and does not solve the problem) I see no reason to keep.
>>> As a result, I removed
>>> that code.
>>>
>>> This is a large change to an import part of Hotspot; detailed and
>>> critical reviews are very welcome.
>>>
>>> Best,
>>> Albert
>>>
>>>
>>> On 04.09.2013 22:33, Vladimir Kozlov wrote:
>>>> On 9/4/13 4:07 AM, Albert Noll wrote:
>>>>> Hi,
>>>>>
>>>>> here is an updated version of the patch. In summary, changes to the
>>>>> previous version are:
>>>>>
>>>>> 1) Adapt NmethodSweepFraction based on the code cache size:
>>>>> Before the patch, NmethodSweepFraction was defined as a constant
>>>>> value.
>>>>> I.e., NmethodSweepFraction was 16 for a code cache size of 32m as
>>>>> well as a
>>>>> code cache size of 256m. This seems unreasonable, since
>>>>> NmethodSweepFraction
>>>>> was originally introduce to keep the time spent in the sweeper
>>>>> (note that sweeping
>>>>> was originally done at safepoints) at a reasonable level.
>>>>
>>>> Albert,
>>>>
>>>> General question first. Do we still need to sweep in fractions since
>>>> we don't do sweeping at safepoint and it stops for safepoints? The
>>>> only drawback is one Compiler thread is used for sweeping and can't
>>>> compile when it sweeps. But we will address that later by using
>>>> special sweeping thread.
>>>>
>>>>> The patch adapts NmethodSweepFraction to the
>>>>> ReservedCodeCacheSize.
>>>>> I.e.,
>>>>> each sweep operation roughly covers 16m of the code cache.
>>>>
>>>> Why you stop at 16?:
>>>>
>>>> + } else if (ReservedCodeCacheSize > (256 * M)) {
>>>> + FLAG_SET_DEFAULT(NmethodSweepFraction, 16);
>>>>
>>>> ReservedCodeCacheSize could be 2GB so one sweep have to process 128Mb.
>>>> Next should cover all cases, I think:
>>>>
>>>> FLAG_SET_DEFAULT(NmethodSweepFraction, 1 + ReservedCodeCacheSize / (16
>>>> * M))
>>>>
>>>>>
>>>>> 2) The parameter NmethodSweepFraction is replaced by a function that
>>>>> computes the
>>>>> NmethodSweepFraction based on the code cache size. See
>>>>> 'sweeper.cpp' for a detailed
>>>>> description
>>>>>
>>>>> 3) The initial hotness counter depends on the code cache size. More
>>>>> specifically, the initial
>>>>> hotness counter is 2*ReservedCodeCacheSize. As a result, a method
>>>>> stays longer in the
>>>>> code cache if ReservedCodeCacheSize is larger.
>>>>
>>>> I don't like spreading '(ReservedCodeCacheSize / M) * 2' through the
>>>> code. Use NMethodSweeper::get_hotness_counter_reset_val() in nmethod
>>>> constuctors.
>>>>
>>>> Also you need to check for min value if ReservedCodeCacheSize < 1Mb.
>>>>
>>>> Also drop 'get_' from name. We usually don't use 'get_' in accessor
>>>> methods. the same for nmethod::get_hotness_counter().
>>>>
>>>> I already asked you to make hotness_counter_decay member of nmethod so
>>>> that you don't need to pass it to dec_hotness_counter(). Or remove it
>>>> at all since it is always 1:
>>>>
>>>> void dec_hotness_counter() { _hotness_counter--; }
>>>>
>>>> I would prefer all code which modifies _hotness_counter be in nmethod
>>>> class.
>>>>
>>>>>
>>>>> 4) Newly compiled methods are guaranteed to not be evicted by the
>>>>> sweeper for 10
>>>>> sweep cycles. This ensures the newly compiled methods are not
>>>>> immediately made
>>>>> not entrant after compialtion.
>>>>>
>>>>> 5) The hotness counter is reset EVERY TIME active methods are scanned.
>>>>> In the previous version
>>>>> the hotness counter was only reset after a full sweep cycle of
>>>>> the
>>>>> code cache. Resetting the
>>>>> hotness counter more frequently provides a better hotness coverage
>>>>> of methods.
>>>>>
>>>>> 6) Methods are flushed in blocks of 1m. The algorithm computes the
>>>>> average hotness of a
>>>>> a nmethod block and evicts the entire block. This should reduce
>>>>> fragmentation.
>>>>>
>>>>> Please let me know what you think about these changes. Performance
>>>>> results can be found at:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8020151
>>>>> I will continuously provide more results.
>>>>>
>>>>> Here is the new webrev:
>>>>> http://cr.openjdk.java.net/~anoll/8020151/webrev.01/
>>>>> <http://cr.openjdk.java.net/%7Eanoll/8020151/webrev.01/>
>>>>
>>>> Very nice comment in sweeper.hpp. Could you please extend it?
>>>>
>>>> 2) sweep nmethods
>>>>
>>>> Sweeping is currently done by Compiler thread between compilations or
>>>> at least each 5 sec (NmethodSweepCheckInterval) when CodeCache is full.
>>>>
>>>> The comment is not cleare that nmethods could be marked as not-entrant
>>>> by different code (deoptimization, dependency invalidation, replace
>>>> old nmethod) and not just by sweeper (which does that only for flushed
>>>> nmethods).
>>>>
>>>> Each nmethod's state change happens during separate sweeps. It may
>>>> take at least 3 sweeps before nmethod's space is freed.
>>>>
>>>> 3) code cache flushing
>>>> ... as a VM operation at safepoint.
>>>>
>>>> Flashing VM operation requires safepoint.
>>>>
>>>> As you pointed current code re-animates nmethods only if we get
>>>> compilation request. It means that it depends on profiling counters.
>>>> It could be the case that a method have to be executed in interpreter
>>>> for some time again to hit compilation threshold.
>>>>
>>>> I think we need to re-animate nmethod immediately during java call:
>>>>
>>>> JavaCalls::call_helper()
>>>>
>>>> if (CompilationPolicy::must_be_compiled(method)) {
>>>> CompileBroker::compile_method(method, InvocationEntryBci,
>>>>
>>>> CompilationPolicy::policy()->initial_compile_level(),
>>>> methodHandle(), 0,
>>>> "must_be_compiled", CHECK);
>>>> + } else if (UseCodeCacheFlushing) {
>>>> + nmethod* saved = CodeCache::reanimate_saved_code(method());
>>>> + if (saved != NULL) {
>>>> + method->set_code(method, saved);
>>>> + }
>>>> }
>>>>
>>>> Can you verify how this work?
>>>>
>>>> I will send additional comments when look through the rest of code.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>>
>>>>> Best,
>>>>> Albert
>>>>>
>>>>>
>>>>> On 22.08.2013 15:16, Albert Noll wrote:
>>>>>> Hi Igor,
>>>>>>
>>>>>> thanks again for your comments. You are right. I will run some
>>>>>> benchmarks to see if removing
>>>>>> the disconnect logic is feasible or not.
>>>>>>
>>>>>> Best,
>>>>>> Albert
>>>>>>
>>>>>> On 22.08.2013 12:39, Igor Veresov wrote:
>>>>>>> It's worth a try. But the "hotness" logic is probabilistic and
>>>>>>> imprecise - it will notice only methods that are on stack during
>>>>>>> safepoints. Those are going to be pretty narrow snapshots of
>>>>>>> activity. I suspect that for large flat profiles (like enterprise
>>>>>>> apps and friends) you could be missing methods that are rather warm
>>>>>>> in reality, which will cause recompilation oscillations. Although if
>>>>>>> the statistics are allowed to accumulate enough may be it's going to
>>>>>>> work out, an experiment will tell. The patch is a good start, the
>>>>>>> stack sampling, IMO, is totally the right approach for filtering out
>>>>>>> the hot methods.
>>>>>>>
>>>>>>> The "disconnect" logic on the other hand is sort of precise.
>>>>>>> Although, looking now at the code it's not quite clear to me how it
>>>>>>> works, it doesn't seem to be any patching going on to divert the
>>>>>>> control for the cases when the nmethod is called directly or through
>>>>>>> an IC. So I guess it's not really a full disconnect? Anyways, in
>>>>>>> theory, with some work, we can make the disconnect logic to
>>>>>>> precisely
>>>>>>> measure the time the method is inactive. Which should provide
>>>>>>> precise
>>>>>>> information about the warm/cold methods.
>>>>>>>
>>>>>>> Btw, also just noticed a bunch of flaws in the interaction of the
>>>>>>> disconnect logic and tiered. The nmethod's "reconnection" happens in
>>>>>>> CompileBroker::compile_method(), which firstly will be called by
>>>>>>> tiered only after a rather substantial number of invocations in the
>>>>>>> interpreter (up to 128), and secondly will be subject to all the
>>>>>>> prioritization rules (it probably should not), and also we don't
>>>>>>> check if the comp level of the reanimated nmethod matches the
>>>>>>> request. If the disconnect logic is to stay, the interpreter should
>>>>>>> be able to know if the method has saved code and be able to call
>>>>>>> into
>>>>>>> the runtime immediately to reanimate it.
>>>>>>>
>>>>>>> igor
>>>>>>>
>>>>>>> On Aug 21, 2013, at 10:27 PM, Albert Noll <albert.noll at oracle.com
>>>>>>> <mailto:albert.noll at oracle.com>> wrote:
>>>>>>>
>>>>>>>> Hi Igor,
>>>>>>>>
>>>>>>>> thanks for looking at the patch. Actually, I think - just as
>>>>>>>> Vladimir pointed out - that we can get rid of
>>>>>>>> the "disconnect" logic. We now have the hotness of a method and if
>>>>>>>> the code cache fills up, we and
>>>>>>>> we decide to schedule the method for removal, we set it to
>>>>>>>> not_entrant.
>>>>>>>> It seems that adding the method to the list of disconnected methods
>>>>>>>> just buys a little more time until we decide to make the method
>>>>>>>> not-entrant. However, we can have the same effect by setting the
>>>>>>>> threshold differently.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Albert
>>>>>>>>
>>>>>>>> On 22.08.2013 10:02, Igor Veresov wrote:
>>>>>>>>> May be instead of "(_traversals > _last_flush_traversal_id +
>>>>>>>>> 2)" we
>>>>>>>>> should timestamp a method when it's disconnected, and then use a
>>>>>>>>> rule like if a method has been disconnected for k *
>>>>>>>>> reverse_free_ratio() seconds then it's ok to kill it. We can also
>>>>>>>>> sort the nmethods that pass that filter by the amount of time they
>>>>>>>>> were disconnected and select most likely candidates for flushing.
>>>>>>>>> This should allow to basically do disconnect/flush in every
>>>>>>>>> traversal, which should make things faster. Timestamps would be
>>>>>>>>> obtained only once per traversal or something like that. What do
>>>>>>>>> you think?
>>>>>>>>>
>>>>>>>>> Pretty cool idea to reverse-prioritize disconnects on hotness.
>>>>>>>>>
>>>>>>>>> igor
>>>>>>>>>
>>>>>>>>> On Aug 21, 2013, at 4:42 AM, Albert Noll <albert.noll at oracle.com
>>>>>>>>> <mailto:albert.noll at oracle.com>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> could I have reviews for this patch? Please note
>>>>>>>>>> that I do not yet feel very confident with the sweeper,
>>>>>>>>>> so please take a close look.
>>>>>>>>>>
>>>>>>>>>> jbs: https://jbs.oracle.com/bugs/browse/JDK-8020151
>>>>>>>>>> webrev: http://cr.openjdk.java.net/~anoll/8020151/webrev.00/
>>>>>>>>>> <http://cr.openjdk.java.net/%7Eanoll/8020151/webrev.00/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Many thanks in advance,
>>>>>>>>>> Albert
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Problem: There can be large performance regressions when the code
>>>>>>>>>> cache fills up. There are
>>>>>>>>>> several reasons for the performance regression: First (1), when
>>>>>>>>>> the code cache is full and methods
>>>>>>>>>> are speculatively disconnected, the oldest methods (based on
>>>>>>>>>> compilation ID) are scheduled for
>>>>>>>>>> flushing. This can result in flushing hot methods. Second (2),
>>>>>>>>>> when compilation is disabled due to a full
>>>>>>>>>> code cache, the number of sweeps can go down. A lower number of
>>>>>>>>>> sweep operations results
>>>>>>>>>> in slower method flushing.
>>>>>>>>>>
>>>>>>>>>> Solution:
>>>>>>>>>> Introduce a hotness counter that is set to a particular value
>>>>>>>>>> (e.g., 100) when there is an activation
>>>>>>>>>> of the method during stack scanning. The counter is
>>>>>>>>>> decremented by
>>>>>>>>>> 1 every time the sweeper
>>>>>>>>>> is invoked.
>>>>>>>>>>
>>>>>>>>>> ad (1):
>>>>>>>>>> A VM operation that speculatively disconnects nmethods, selects
>>>>>>>>>> the methods that should be
>>>>>>>>>> flushed based on the hotness. For example, if 50% of the code
>>>>>>>>>> cache shall be flushed, we flush
>>>>>>>>>> those methods that have not been active while stack scanning
>>>>>>>>>> for
>>>>>>>>>> the longest time. Note that
>>>>>>>>>> while this strategy is more likely to flush cold methods, it is
>>>>>>>>>> not clear to what extent the new
>>>>>>>>>> strategy fragments the code cache.
>>>>>>>>>>
>>>>>>>>>> Changes in NMethodSweeper::speculative_disconnect_nmethods(bool
>>>>>>>>>> is_full)
>>>>>>>>>>
>>>>>>>>>> ad (2)
>>>>>>>>>> Currently, methods are removed from the code cache if:
>>>>>>>>>> a) code cache is full
>>>>>>>>>> b) class is unloaded
>>>>>>>>>> c) method is replaced by another version (i.e., compiled with
>>>>>>>>>> a different tier)
>>>>>>>>>> d) deopt
>>>>>>>>>>
>>>>>>>>>> The current patch adds a 5-th possibility to remove a method
>>>>>>>>>> from the code cache.
>>>>>>>>>> In particular, if a method has not been active during stack
>>>>>>>>>> scanning for a long-enough
>>>>>>>>>> amount of time, the method is removed from the code cache. The
>>>>>>>>>> amount of time
>>>>>>>>>> required to flush the method depends on the available space in
>>>>>>>>>> the code cache.
>>>>>>>>>>
>>>>>>>>>> Here is one example: If a method was seen on a stack the
>>>>>>>>>> hotness counter
>>>>>>>>>> is set to 100. A sweep operation takes roughly place every
>>>>>>>>>> 100ms. I.e., it takes
>>>>>>>>>> 100ms * 100 = 10s until the hotness counter reaches 0. The
>>>>>>>>>> threshold that determines
>>>>>>>>>> if a method should be removed from the code cache is
>>>>>>>>>> calculated
>>>>>>>>>> as follows:
>>>>>>>>>>
>>>>>>>>>> threshold = -100 + (CodeCache::reverse_free_ratio() *
>>>>>>>>>> NMethodSweepActivity)
>>>>>>>>>>
>>>>>>>>>> For example, if 25% of the code cache is free,
>>>>>>>>>> reverse_free_ratio returns 4.
>>>>>>>>>> The default value of NMethodSweepActivity is 10. As a result,
>>>>>>>>>> threshold = -60.
>>>>>>>>>> Consequently, all methods that have a hotness value smaller
>>>>>>>>>> than -60 (which
>>>>>>>>>> means they have not been seen on the stack for 16s) are
>>>>>>>>>> scheduled to be flushed
>>>>>>>>>> from the code cache. See an illustration of the threshold
>>>>>>>>>> as a
>>>>>>>>>> function of the available
>>>>>>>>>> code cache in threshold.pdf
>>>>>>>>>>
>>>>>>>>>> Note that NMethodSweepActivity is a parameter that can be
>>>>>>>>>> specified via a -XX
>>>>>>>>>> flag.
>>>>>>>>>>
>>>>>>>>>> Changes in NMethodSweeper::sweep_code_cache()
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> A very preliminary performance evaluation looks promising. I used
>>>>>>>>>> the DaCapo
>>>>>>>>>> benchmarks where a series of benchmarks is executed in the
>>>>>>>>>> same VM
>>>>>>>>>> instance.
>>>>>>>>>> See performance.pdf . The x-axis shows the benchmarks. Assume we
>>>>>>>>>> have 2 benchmarks
>>>>>>>>>> (BM). The execution sequence is as follows:
>>>>>>>>>>
>>>>>>>>>> BM1 (Run 1-1)
>>>>>>>>>> BM1 (Run 2-1)
>>>>>>>>>> BM2 (Run 1-1)
>>>>>>>>>> BM2 (Run 2-1)
>>>>>>>>>>
>>>>>>>>>> BM1 (Run 1-2)
>>>>>>>>>> BM1 (Run 2-2)
>>>>>>>>>> BM2 (Run 1-2)
>>>>>>>>>> BM2 (Run 2-2)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> A value larger than 0 on the x-axis indicates that the version
>>>>>>>>>> including the proposed patch is faster.
>>>>>>>>>> I.e., the values are calculated as follows: (T_original /
>>>>>>>>>> T_with_patch) - 1. T is the execution time
>>>>>>>>>> (wall clock time) of the benchmark. ReservedCodeCacheSize is set
>>>>>>>>>> to 50m. I used three runs and
>>>>>>>>>> the arithmetic average to compare the numbers. I know, we need
>>>>>>>>>> much more data, however,
>>>>>>>>>> I think we can see a trend.
>>>>>>>>>>
>>>>>>>>>> The current patch does not trigger a warning that the code cache
>>>>>>>>>> is full and compilation has been
>>>>>>>>>> disabled.
>>>>>>>>>>
>>>>>>>>>> Please let me know that you think.
>>>>>>>>>> <threshold.pdf><performance.pdf>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>
More information about the hotspot-compiler-dev
mailing list