RFR(M): 8244660: Code cache sweeper heuristics is broken

Mon May 11 13:42:39 UTC 2020

Hi Man,

On 2020-05-09 03:28, Man Cao wrote:
> Hi Nils,
>
> Thanks for fixing this so quickly, and simplifying the logic!
>
> Some high-level questions and suggestions:
> *1. Sweep frequency *
> With this new approach, is the sweeping expected to be less
> frequent than the current approach, or more frequent? It looks more
> frequent to me.

It should be a lot less frequent than when the StartAggressiveSweep 
kicked in because of the bug.
I hope that it is about the same as before. The  byte threshold is 
lower, but there are no longer any safepoints that can trigger 
additional sweeps.

I am open to adjusting the threshold until a good balance is achieved.

>
> If I understand correctly, for the current approach, the conditions to
> sweep are:
> - Bytes from make_not_entrant_or_zombie() and make_unloaded() reach 1/100
> of ReservedCodeCacheSize
> OR
> - Heuristics based on ReservedCodeCacheSize/16M, _time_counter
> and _last_sweep.
> I suppose the second condition is the "Number of safepoints with stack
> scans" condition you mentioned,
> which is not currently triggered.
>
> With the new approach, the condition is:
> - Bytes from make_not_entrant_or_zombie() and make_unloaded()
> reach SweeperThreshold (1/256 of ReservedCodeCacheSize)
You understand it correctly.  I also capped the it ReservedCodeCacheSize 
at 1Mb, because a user running with a large reserved code cache still 
wants to clean out L3 code during startup.
>
> Is it better to make SweeperThreshold default to 1/100
> of ReservedCodeCacheSize like before?
I think 1/100 of a 240mb default code cache seems a bit high. During 
startup we produce a lot of L3 code that will be thrown away. We want to 
recycle it fairly quickly, to avoid fragmenting the code cache, but not 
that often that we affect startup.

I've done some startup measurements, and then we sweep about every other 
second in a benchmark that produces a lot of code.

What results are you seeing?

> Also, could it be a percentage instead of a byte-size value?
> In our experience, a percentage value is also easier to maintain for
> production users.
SweeperThreshold could absolutely be a percentage. I will change that.
>
> We also would like to reduce the default sweep frequency, especially for
> -XX:-TieredCompilation.
> Because in JDK11, we have seen the higher sweep frequency caused regression
> compared to JDK8,
> and turning off code cache flushing could significantly improve performance.
Code cache flushing has another heuristic - it might be broken too. But 
it would be interesting too see how it works with the new sweep 
heuristic. If you know that you have enough code cache - turning it off 
is no loss. It only helps when you are running out of code cache.

>
> *2. Sweep and make non-entrant*
>> The threshold is capped at 1M because even if you have an enormous code
>> cache - you don't want to fragment it, and you probably don't want to
>> commit more than needed.
> It is possible that sweeping will deoptimize some cold nmethods that will
> be used soon.
> Such deoptimizations could hurt performance more than fragmenting the code
> cache.
When we are doing normal sweeping - we don't deoptimize cold code. That 
is handled my the method flushing - it should only kick in when we start 
to run out of code cache.

> Taking a closer look, perhaps the root problem is not just the sweep
> frequency itself, but coupled
> with the logic in NMethodSweeper::possibly_flush() to determine when to
> make an nmethod not-entrant.
> Perhaps the two flags NmethodSweepActivity and MinPassesBeforeFlush could
> be adjusted
> accordingly to the higher sweep frequency, to make JVM deoptimize fewer
> cold but usable nmethods.
>
> Do you think I should open a CR to investigate changing the default values
> of these flags later?
> It would be better if we could deprecate one of these two flags if they
> serve the same purpose.
I think we should address MethodFlushing in a separate RFE/BUG.

The heuristics for CodeAging may have been negatively affected by the 
transition to handshakes. Also the SetHotnessClosure should be replaced 
by a mechanism using the NMethodEntry barriers.

I see that we are missing JFR events for MethodFlushing. I have created 
another patch for that.

Best regards,
Nils Eliasson

> -Man