RFR(S): 8020151: PSR:PERF Large performance regressions when code cache is filled

Fri Aug 30 12:46:19 PDT 2013

Hi Albert,

A couple more comments inline:

On 8/30/13 6:44 AM, Albert Noll wrote:
> Hi Chris,
>
> thanks again for sharing your thoughts. See the comments below:
>
> On 27.08.2013 21:20, Chris Plummer wrote:
>> Hi Albert,
>>
>> Comments inline below:
>>
>> On 8/27/13 6:03 AM, Albert Noll wrote:
>>> Hi Chris,
>>>
>>> thanks a lot for your feedback. I will do the performance evaluation 
>>> that you proposed
>>> to validate that the code changes also perform well for small code 
>>> cache sizes.
>>>
>>> On 22.08.2013 23:29, Chris Plummer wrote:
>>>> Hi Albert,
>>>>
>>>> sort_nmentod_by_hotness should be sort_nmethod_by_hotness.
>>>>
>>> Thanks, I will fix the type error.
>>>> We already store NMethodSweeper::_traversal_count in 
>>>> nmethod::_stack_traversal_mark each time scan_stacks() is called. 
>>>> Why not just use that to determine how long it's been since an 
>>>> nmethod was active? No need to decrement it. Just look at 
>>>> nmethod::_stack_traversal_mark when sorting.
>>>>
>>> _stack_traversal_mark is only set for not-entrant methods. However, 
>>> the hotness counter should
>>> also be available for all methods, including methods that have the 
>>> state 'in_use'. As a result, we
>>> need _hotness_counter.
>> I see what you mean about only be set for not-entrant methods, but I 
>> don't see anything that would indicate that it would be problematic 
>> to always set it, not just if the method is not-entrant. In fact the 
>> comment alludes that this is what is actually done, although it is not:
>>
> Currently, stack scanning and marking is only performed if there is no 
> sweep in progress (!sweep_in_progress() ...). Note that this is due to 
> the correctness of the sweeping procedure.
> Having a separate field (the _hotness_counter) enables us to do stack 
> scanning every time
> the scan_stacks()  function is called.
But currently you are not doing that. The change you made to intialize 
the hotness is still protected by the !sweep_in_progress() code.
> This improves the precision of live method detection approx.
> by a factor of 6. So, if we want more precision, we have to have the 
> extra field plus some
> overhead due to the more frequent stack scanning. I am currently 
> evaluating the overhead.
>>   // not_entrant method removal. Each mark_sweep pass will update
>>   // this mark to current sweep invocation count if it is seen on the
>>   // stack.  An not_entrant method can be removed when there is no
>>   // more activations, i.e., when the _stack_traversal_mark is less than
>>   // current sweep traversal index.
>>   long _stack_traversal_mark;
>>
>> BTW, there is bit of comment rot in the following. Maybe you can 
>> clean it up as part of your changes. It should reference 
>> _stack_traversal_mark instead of _last_seen_on_stack
>>
>>   // See comment at definition of _last_seen_on_stack
>>   void mark_as_seen_on_stack();
>>   bool can_not_entrant_be_converted();
>>
> I will clean up the comments.
>>
>>>> Also, keep in mind that how long it's been since a method was last 
>>>> used does not always equate to how hot it is. You might have a 
>>>> method that is called once every second, and another that is called 
>>>> 1000 times every 10 seconds. The later is actually hotter, but your 
>>>> algorithm would favor sweeping the former. Not a deal breaker, but 
>>>> something to consider when trying to improve your criteria for 
>>>> selecting how hot a method is.
>>>>
>>> I cannot follow your example. Since we use stack scanning to 
>>> determine the hotness of a method,
>>> it is irrelevant how often a method is called. The probability that 
>>> a method 'M' is hot depends only
>>> on the time that is spent executing 'M'; the more time we spend in 
>>> 'M', the more likely it is that we
>>> hit 'M' during stack scanning. For example, assume that a program 
>>> spends 10ms/second in 'M'.
>>> If 'M' is called 1 time/second and the execution time of 'M'=10ms, 
>>> or M is called 10 times/second
>>> and the execution time of 'M'=1ms is likely to yield the same hotness.
>> I was referring to if you instead used _last_seen_on_stack, which 
>> only indicates how recently the method was seen, not how frequently. 
>> However, I think your approach suffers the same problem, since 
>> scan_stacks sets the hotness to 100, rather than increasing the 
>> hotness each time the method is found. So it to is only an indicator 
>> of how recently the method was seen.
> That is true. However, hot methods are more likely to be seen 
> 'recently' than cold methods. I.e.,
> the average 'temperature' of methods in which the application spends a 
> significant amount of time will be higher than the average temperature 
> of methods that are barely used, simply because 'hot methods' will be 
> 'reheated' (by stack scanning) more often.
>
> If we would increase a counter every time we see a method on the 
> stack, we have to reset/decrement
> the counter somewhere else (e.g., in the sweeper). 
You already decrement in NMethodSweeper::process_nmethod().
> Otherwise, if a method is once marked as hot that method will always 
> stay hot even if it is not anymore. If we decrement the 
> hotness_counter, the information we get is, I think, less precise 
> compared to the current approach. I.e., simply by looking at the 
> counter does not tell you when the method was used the last time: A 
> value of, e.g. 10, could mean that the method is 'medium hot' (we do 
> not want to evict), or that the method was hot a long time ago and is 
> now cold (we want to evict). Do you have any thoughts about this?
In general the issue you have with incrementing hotness in scan_stacks 
(not just setting it to a fixed value as you are now doing) and 
decrementing somewhere like process_nmethod(), is that the increments 
can get out of hand, and you may never decrement to anything close to 0 
again. One way around this is to have the decrements halve the hotness 
value. That way you are never more than (worse case) 32 decrements from 
reaching 0. You then need to decide on the proper increment value (1 is 
probably too small).

Halving is desirable because over time it impacts older increments more 
than newer ones. For example, let's say 10 seconds ago a hot method 
resulted in increments totaling 256 and that same hot method got 
increased another 256 the past second. If there were 4 decrements 
between the first round of increments and the most recent round, then 
that first round only accounts for 32 increments while the most recent 
accounts for 256, totaling 288.

You still have the issue of, for example, 10 indicating either a newly 
warm method, or a method that was very hot not too long ago, but has 
since gone cold. I'm not so sure the distinction matters. A method that 
was really hot maybe 10 seconds ago could be seen as just a valuable as 
the method that is just starting to look hot, but hasn't really proven 
itself yet. The method that was really hot may get really hot in cycles, 
and the next cycle may be coming up soon. Meanwhile, the currently warm 
method could be done with its hot cycle, and never really got that hot 
in the first place.

Chris
>>> Using stack scanning to determine the hotness of a method is only an 
>>> approximation based on
>>> sampling. I.e., if we are unlucky and never see a hot method on the 
>>> stack, the hot method will
>>> eventually be flushed.
>> True, but it doesn't need to be perfect. However, I think with tiered 
>> you need to be a lot more careful than for example when just using 
>> C1. It doesn't take that much work to get a C1 method compiled, 
>> whereas there is a lot of work put into a tiered C2 compilation 
>> (including doing all the tiered compilations again). So throwing away 
>> a C2 tiered compiled method that is indeed hot can be costly. Maybe 
>> you should take this into account when determining the hotness, and 
>> give C2 tiered compiled methods a higher hotness count when found on 
>> the stack.
>>
> That's a good idea. I will look into different eviction strategies.
>
>> cheers,
>>
>> Chris
>>>
>>>
>>>> Please verify your results with specjvm98 using an artificially 
>>>> small codecache size. First do a run to get code cache usage using 
>>>> -XX:+PrintCodeCache. For example:
>>>>
>>>> CodeCache: size=32768Kb used=3758Kb max_used=3758Kb free=29009Kb
>>>>
>>>> Take the "used" size, divide by 2, and then add 1500k 
>>>> (CodeCacheFlushingMinimumFreeSpace). Use that for your new 
>>>> ReservedCodeCacheSize. From my recent experience, you should see 
>>>> less than a 10% performance degradation than when not constraining 
>>>> the codecache size in this manner. However, what's most important 
>>>> is that performance using this configuration is the same or better 
>>>> than without your changes. Take another 500-1000k off the code 
>>>> cache size and measure again. You'll see much bigger performance 
>>>> degradation, but once again what is important is that with your 
>>>> change in place, performance has not gotten any worse. Do the same 
>>>> running Nashorn with the v8 benchmarks.
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>
>>>> On 8/21/13 7:41 AM, hotspot-compiler-dev-request at openjdk.java.net 
>>>> wrote:
>>>>> Hi all,
>>>>>
>>>>> could I have reviews for this patch? Please note
>>>>> that I do not yet feel very confident with the sweeper,
>>>>> so please take a close look.
>>>>>
>>>>> jbs:https://jbs.oracle.com/bugs/browse/JDK-8020151
>>>>> webrev:http://cr.openjdk.java.net/~anoll/8020151/webrev.00/ 
>>>>> <http://cr.openjdk.java.net/%7Eanoll/8020151/webrev.00/>
>>>>>
>>>>>
>>>>> Many thanks in advance,
>>>>> Albert
>>>>>
>>>>>
>>>>> Problem: There can be large performance regressions when the code 
>>>>> cache
>>>>> fills up. There are
>>>>> several reasons for the performance regression: First (1), when 
>>>>> the code
>>>>> cache is full and methods
>>>>> are speculatively disconnected, the oldest methods (based on 
>>>>> compilation
>>>>> ID) are scheduled for
>>>>> flushing. This can result in flushing hot methods. Second (2), when
>>>>> compilation is disabled due to a full
>>>>> code cache, the number of sweeps can go down. A lower number of sweep
>>>>> operations results
>>>>> in slower method flushing.
>>>>>
>>>>> Solution:
>>>>> Introduce a hotness counter that is set to a particular value (e.g.,
>>>>> 100) when there is an activation
>>>>> of the method during stack scanning. The counter is decremented by 1
>>>>> every time the sweeper
>>>>> is invoked.
>>>>>
>>>>> ad (1):
>>>>>     A VM operation that speculatively disconnects nmethods, 
>>>>> selects the
>>>>> methods that should be
>>>>>     flushed based on the hotness. For example, if 50% of the code 
>>>>> cache
>>>>> shall be flushed, we flush
>>>>>     those methods that have not been active while stack scanning 
>>>>> for the
>>>>> longest time. Note that
>>>>>     while this strategy is more likely to flush cold methods, it 
>>>>> is not
>>>>> clear to what extent the new
>>>>>     strategy fragments the code cache.
>>>>>
>>>>>     Changes in 
>>>>> NMethodSweeper::speculative_disconnect_nmethods(bool is_full)
>>>>>
>>>>> ad (2)
>>>>>     Currently, methods are removed from the code cache if:
>>>>>       a) code cache is full
>>>>>       b) class is unloaded
>>>>>       c) method is replaced by another version (i.e., compiled with a
>>>>> different tier)
>>>>>       d) deopt
>>>>>
>>>>>      The current patch adds a 5-th possibility to remove a method 
>>>>> from
>>>>> the code cache.
>>>>>      In particular, if a method has not been active during stack 
>>>>> scanning
>>>>> for a long-enough
>>>>>      amount of time, the method is removed from the code cache. The
>>>>> amount of time
>>>>>      required to flush the method depends on the available space 
>>>>> in the
>>>>> code cache.
>>>>>
>>>>>      Here is one example: If a method was seen on a stack the hotness
>>>>> counter
>>>>>      is set to 100. A sweep operation takes roughly place every 
>>>>> 100ms.
>>>>> I.e., it takes
>>>>>      100ms * 100 = 10s until the hotness counter reaches 0. The 
>>>>> threshold
>>>>> that determines
>>>>>      if a method should be removed from the code cache is 
>>>>> calculated as
>>>>> follows:
>>>>>
>>>>>      threshold = -100 + (CodeCache::reverse_free_ratio() *
>>>>> NMethodSweepActivity)
>>>>>
>>>>>       For example, if 25% of the code cache is free, 
>>>>> reverse_free_ratio
>>>>> returns 4.
>>>>>       The default value of NMethodSweepActivity is 10. As a result,
>>>>> threshold = -60.
>>>>>       Consequently, all methods that have a hotness value smaller 
>>>>> than
>>>>> -60 (which
>>>>>       means they have not been seen on the stack for 16s) are 
>>>>> scheduled
>>>>> to be flushed
>>>>>       from the code cache. See an illustration of the threshold as a
>>>>> function of the available
>>>>>       code cache in threshold.pdf
>>>>>
>>>>>       Note that NMethodSweepActivity is a parameter that can be 
>>>>> specified
>>>>> via a -XX
>>>>>       flag.
>>>>>
>>>>> Changes in NMethodSweeper::sweep_code_cache()
>>>>>
>>>>>
>>>>> A very preliminary performance evaluation looks promising. I used the
>>>>> DaCapo
>>>>> benchmarks where a series of benchmarks is executed in the same VM 
>>>>> instance.
>>>>> See performance.pdf . The x-axis shows the benchmarks. Assume we 
>>>>> have 2
>>>>> benchmarks
>>>>> (BM). The execution sequence is as follows:
>>>>>
>>>>> BM1 (Run 1-1)
>>>>> BM1 (Run 2-1)
>>>>> BM2 (Run 1-1)
>>>>> BM2 (Run 2-1)
>>>>>
>>>>> BM1 (Run 1-2)
>>>>> BM1 (Run 2-2)
>>>>> BM2 (Run 1-2)
>>>>> BM2 (Run 2-2)
>>>>>
>>>>>
>>>>> A value larger than 0 on the x-axis indicates that the version 
>>>>> including
>>>>> the proposed patch is faster.
>>>>> I.e., the values are calculated as follows: (T_original / 
>>>>> T_with_patch)
>>>>> - 1. T is the execution time
>>>>> (wall clock time) of the benchmark. ReservedCodeCacheSize is set to
>>>>> 50m.  I used three runs and
>>>>> the arithmetic average to compare the numbers. I know, we need 
>>>>> much more
>>>>> data, however,
>>>>> I think we can see a trend.
>>>>>
>>>>> The current patch does not trigger a warning that the code cache 
>>>>> is full
>>>>> and compilation has been
>>>>> disabled.
>>>>>
>>>>> Please let me know that you think.
>>>>> -------------- next part --------------
>>>>> An HTML attachment was scrubbed...
>>>>> URL:http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20130821/6478f023/attachment.html 
>>>>> -------------- next part --------------
>>>>> A non-text attachment was scrubbed...
>>>>> Name: threshold.pdf
>>>>> Type: application/pdf
>>>>> Size: 13321 bytes
>>>>> Desc: not available
>>>>> Url 
>>>>> :http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20130821/6478f023/threshold.pdf 
>>>>> -------------- next part --------------
>>>>> A non-text attachment was scrubbed...
>>>>> Name: performance.pdf
>>>>> Type: application/pdf
>>>>> Size: 14641 bytes
>>>>> Desc: not available
>>>>> Url 
>>>>> :http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20130821/6478f023/performance.pdf 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>