RFR(XXS): 8244278: Excessive code cache flushes and sweeps

Wed May 6 17:12:26 UTC 2020

After posting this I noticed another problem. The sweeper should wake up 
regularly, but now it is only awakened when hitting the SweepAggressive 
threshold. This is wrong.

I suggest holding of the fix until all the problems have been ironed out.

Best regards,
Nils Eliasson

On 2020-05-06 11:52, Nils Eliasson wrote:
>
>
> On 2020-05-05 20:57, Man Cao wrote:
>> Hi Laurent, Nils,
>>
>> > Using 40mb code cache looks quite small compared to default values 
>> (256mb ?).
>> The 40MB code cache size is for -XX:-TieredCompilation, for the 
>> DaCapo runs.
>> "java -XX:-TieredCompilation -XX:+PrintFlagsFinal |& grep 
>> ReservedCodeCacheSize" shows the default size is 48MB for 
>> -TieredCompilation.
>> So 40MB is not particularly small compared to 48MB.
>>
>> The Blaze runs use "-XX:+TieredCompilation" and a Google-default code 
>> cache size of 480MB (we have doubled the default size).
>> This actually tests a case that is closer to the OpenJDK default.
>>
>> > Could you run 1 quick experiment with default code cache settings 
>> to see if there is no regression in the main use case?
>> Yes, I just launched an experiment running DaCapo at JDK source tip, 
>> only with "-Xms4g -Xmx4g" and some logging flags.
>> This is a meaningful experiment as it uses the real OpenJDK default 
>> flags, and the most up-to-date source.
>>
>> > Why do you expect code cache usage would increase a lot? The sweeper
>> > still wakes up regularly and cleans the code cache. The code path 
>> fixed
>> > is just about sweeping extra aggressively under some circumstances. 
>> Some
>> > nmethod might live a little longer, but they will still be cleaned.
>> I found the aggressive sweeping deoptimized a non-trivial number of 
>> nmethods, which kept the usage low.
>> After my bugfix, the sweeper only wakes up when the usage reaches 
>> (100-StartAggressiveSweepingAt)% of code cache capacity, which 
>> default to 90%.
>> This means many nmethods will not be cleaned until there is pressure 
>> in code cache usage.
>>
>> > One number you could add to your benchmark numbers is the number of
>> > nmethods reclaimed and code cache usage. I expect both to remain 
>> the same.
>> The benchmark result htmls already show the code cache usage (Code 
>> Cache Used (MB)).
>> You can CTRL-F for "Code Cache" in the browser.
>>
>> There is a significant increase in code cache usage:
>> For DaCapo, up to 13.5MB (for tradesoap) increase for the 40MB 
>> ReservedCodeCacheSize.
>> For Blaze, the increase is 33MB-80MB for the 
>> 480MB ReservedCodeCacheSize.
>> The code cache usage metric is measure like this (we added an 
>> hsperfdata counter for it), at the end of a benchmark run:
>>   size_t result = 0;
>>   FOR_ALL_ALLOCABLE_HEAPS(heap) {
>>     result += (*heap)->allocated_capacity();
>>   }
>>   _code_cache_used_size->set_value(result);
>>
>> I also looked at the logs, it shows that the bugfix eliminated almost 
>> all of the code cache flushes. They also contain the number of 
>> nmethods reclaimed.
>> E.g., for tradesoap:
>>
>> Without the fix:
>> Code cache sweeper statistics:
>>    Total sweep time:                1902 ms
>>    Total number of full sweeps:     23301
>>    Total number of flushed methods: 7080 (thereof 7080 C2 methods)
>>    Total size of flushed methods:   20468 kB
>> With the fix:
>> Code cache sweeper statistics:
>>    Total sweep time:                0 ms
>>    Total number of full sweeps:     0
>>    Total number of flushed methods: 0 (thereof 0 C2 methods)
>>    Total size of flushed methods:   0 kB
>>
>
> Interesting results and a clear indication that something is broken in 
> the sweeper heuristics - Nmethods should still be flushed!
>
> Looking at sweeper.cpp I see something that looks wrong. The 
> _last_sweep counter is updated even if no sweep was done. In low code 
> cache usage scenarios that means will might never reach the threshold.
>
> I think the curly brace should be moved done like this:
>
> --- a/src/hotspot/share/runtime/sweeper.cpp     Tue May 05 21:28:46 
> 2020 +0200
> +++ b/src/hotspot/share/runtime/sweeper.cpp     Wed May 06 11:46:01 
> 2020 +0200
> @@ -445,16 +445,17 @@
>    if (_should_sweep || forced) {
>      init_sweeper_log();
>      sweep_code_cache();
> +
> +    // We are done with sweeping the code cache once.
> +    _total_nof_code_cache_sweeps++;
> +    _last_sweep = _time_counter;
> +    // Reset flag; temporarily disables sweeper
> +    _should_sweep = false;
> +    // If there was enough state change, 'possibly_enable_sweeper()'
> +    // sets '_should_sweep' to true
> +    possibly_enable_sweeper();
>    }
>
> -  // We are done with sweeping the code cache once.
> -  _total_nof_code_cache_sweeps++;
> -  _last_sweep = _time_counter;
> -  // Reset flag; temporarily disables sweeper
> -  _should_sweep = false;
> -  // If there was enough state change, 'possibly_enable_sweeper()'
> -  // sets '_should_sweep' to true
> -  possibly_enable_sweeper();
>    // Reset _bytes_changed only if there was enough state change. 
> _bytes_changed
>    // can further increase by calls to 'report_state_change'.
>    if (_should_sweep) {
>
> Can you try it out and see if things improve?
>
> Best regards,
> Nils Eliasson
>
>
>> Anyway, just to reiterate, we think the improvement in throughput and 
>> CPU usage is well worth the increase in code cache usage.
>> If the increase causes any problem, we would advise users to accept 
>> the increase and fully provision memory for the 
>> entire ReservedCodeCacheSize.
>>
>> -Man
>>
>>
>> On Tue, May 5, 2020 at 3:18 AM Nils Eliasson 
>> <nils.eliasson at oracle.com <mailto:nils.eliasson at oracle.com>> wrote:
>>
>>     Hi Man,
>>
>>     Why do you expect code cache usage would increase a lot? The sweeper
>>     still wakes up regularly and cleans the code cache. The code path
>>     fixed
>>     is just about sweeping extra aggressively under some
>>     circumstances. Some
>>     nmethod might live a little longer, but they will still be cleaned.
>>
>>     Without your bugfix the sweeper will be notified for every new
>>     allocation in the codecache as soon as code cache usages has gone
>>     beyond
>>     10%. That could in the worst case be one sweep for every allocation.
>>
>>     One number you could add to your benchmark numbers is the number of
>>     nmethods reclaimed and code cache usage. I expect both to remain
>>     the same.
>>
>>     Best regards,
>>     Nils
>>
>>
>>
>>     On 2020-05-04 21:21, Man Cao wrote:
>>     > Hi,
>>     >
>>     > Thanks for the review!
>>     > Yes, the code change is trivial, but runtime behavior change is
>>     > considerable.
>>     > In particular, throughput and CPU usage could improve, but code
>>     cache usage
>>     > could increase a lot. In our experience, the improvement in
>>     throughput and
>>     > CPU is well worth the code cache increase.
>>     >
>>     > I have attached some benchmarking results in JBS. They are based
>>     on JDK11.
>>     > We have not rolled out this fix to our production JDK11 yet, as
>>     I'd like to
>>     > confirm that this large change in runtime behavior is OK with
>>     the OpenJDK
>>     > community.
>>     > We are happy to share some performance numbers from production
>>     workload
>>     > once we have them.
>>     >
>>     > -Man
>>     >
>>     >
>>     > On Mon, May 4, 2020 at 4:11 AM Laurent Bourgès
>>     <bourges.laurent at gmail.com <mailto:bourges.laurent at gmail.com>>
>>     > wrote:
>>     >
>>     >> Hi,
>>     >>
>>     >> Do you have performance results to justify your assumption "could
>>     >> significantly improve performance" ?
>>     >>
>>     >> Please share numbers in the jbs bug
>>     >>
>>     >> Laurent
>>     >>
>>     >> Le sam. 2 mai 2020 à 07:35, Man Cao <manc at google.com
>>     <mailto:manc at google.com>> a écrit :
>>     >>
>>     >>> Hi all,
>>     >>>
>>     >>> Can I have reviews for this one-line change that fixes a bug
>>     and could
>>     >>> significantly improve performance?
>>     >>> Webrev: https://cr.openjdk.java.net/~manc/8244278/webrev.00/
>>     >>> Bug: https://bugs.openjdk.java.net/browse/JDK-8244278
>>     >>>
>>     >>> It passes tier-1 tests locally, as well as
>>     >>> "vm/mlvm/meth/stress/compiler/deoptimize" (for the original
>>     JDK-8046809).
>>     >>>
>>     >>> -Man
>>     >>>
>>
>