RFR: 8290025: Remove the Sweeper [v6]

Sat Aug 6 00:47:27 UTC 2022

On Fri, 5 Aug 2022 10:10:28 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> When the world was still young, the sweeper was built to unload bad smelling nmethods. While it has been going through various revisions, the GCs got support for class unloading, and the need for the GCs to get rid of nmethods with a different unpleasant scent. 
>> 
>> The two systems would now compete for unloading nmethods, and the responsibility of throwing away nmethods would blur. The sweeper was still good at throwing away nmethods faster as it only needs to scan stacks, and not do a full GC. 
>> 
>> With the advent of Loom, the situation has gotten even worse. The stacks are now also in the Java heap. The sweeper is unable to throw away nmethods without the liveness analysis of a full GC, which also performs code cache unloading, but isn't allowed to actually delete nmethods due to races with the sweeper. In a way we have the worst of both worlds, where both the sweeper and GC are crippled, unable to unload nmethods without the help of the other. And there are a very large number of complicated races that the JVM needs to deal with, especially with concurrent code cache unloading not interfering with concurrent sweeping. And concurrent sweeping not interfering with the application. 
>> 
>> The sweeper cycle exposes 2 flavours of nmethods that are "dead" to the system. So whenever nmethods are used, we have to know they are not dead. But we typically don't have the tools to really know they are not dead. For example, one might think grabbing the CodeCache_lock and using an iterator that only walks is_alive() nmethods would help make sure you don't get dead nmethods in your iterator. However, that is not the case, because the CodeCache_lock can't be held across the entire zombie transition due to "reasons" that are not trivial to actually change. Because of this, code has to deal with nmethods flipping around randomly to a dead state. 
>> 
>> I propose to get out of this sad situation, by removing the sweeper. If we need a full GC anyway to remove nmethods, we might as well let the GC do everything. This removes the notion of is_zombie(), is_unloaded() and hence is_alive() from the JVM. It also removes the notion of the orthogonal but related nmethodLocker to keep nmethods around, without preventing them from dying. We instead throw away nmethods the way we throw away pretty much anything else in the unloading GC code: 
>> 1. Unlink 
>> 2. Global sync 
>> 3. Throw away 
>> 4. Profit! 
>> This way, if you get a reference to an nmethod, it won't go away until the next safepoint poll, and will not flip around liveness due to concurrent transitions.
>> 
>> In the new model, we use nmethod entry barriers to keep track of the last time an nmethod was on-stack. This is then used to 1) prove that not_entrant nmethods that haven't been on-stack for an entire GC can be removed, and 2) heuristically remove nmethods that have never been called for N full GCs, where N is calculated based on code cache allocation rate, GC frequency, remaining free memory until "trouble", etc. Similar to metaspace, there is also some threshold GC trigger to start GC when the code cache is filling up, and nothing else is triggering full GCs. The threshold gets smaller as we approach a point of being uncomfortably close to code cache exhaustion. Past said point, we GC very aggressively, and you probably want a larger code cache.
>> 
>> I have tested this in mach5 tier1-7, I have run through perf aurora with no regressions, and also run an "internal large application" to see how it scales, also with no regressions. Since testing tier1-7 a few small tweaks have been made so I am running some extra testing.
>> 
>> I have tried to be as compatible as possible to previous sweeping related JVM flags, arguing that nothing in the flags implies whether the implementation is using a GC or a separate sweeper thread. However, the UseCodeAging flag I have obsoleted, as UseCodeCacheFlushing is the flag for deciding cold nmethods should be removed, and with the new mechanism for doing that, there is no need for UseCodeAging flag as well.
>
> Erik Österlund has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains six commits:
> 
>  - Merge branch 'master' into 8290025_remove_sweeper
>  - Removed more stale things
>  - Remove incorrect comment
>  - Feedback v2
>  - Feedback v1
>  - 8290025: Remove the Sweeper

Wow, this is great.

src/hotspot/share/code/codeCache.cpp line 984:

> 982:     nm->flush();
> 983:     if (next == nm) {
> 984:       // Self looped means end of list

Why not nullptr for the end of the list?

src/hotspot/share/code/codeCache.cpp line 1312:

> 1310:       CompiledMethod* cm = old_compiled_method_table->at(i);
> 1311:       // Only walk !is_unloading nmethods, the other ones will get removed by the GC.
> 1312:       if (!cm->is_unloading()) {

I thought we were going to walk the unloading methods?  Or should that be another bug fix?  It's unlikely to ever find a reproducer so I wouldn't mind if it was fixed with this giant patch.

src/hotspot/share/code/nmethod.cpp line 761:

> 759:     _comp_level              = comp_level;
> 760:     _orig_pc_offset          = orig_pc_offset;
> 761:     _gc_epoch                = CodeCache::gc_epoch();

Good to move this.

src/hotspot/share/code/nmethod.cpp line 1316:

> 1314: 
> 1315:     // Change state
> 1316:     bool success = try_transition(not_entrant);

I suppose in a further RFE, you could rename try_transition or do the transition inline.  Does it still CAS?

src/hotspot/share/code/nmethod.cpp line 1712:

> 1710:     mark_as_seen_on_stack();
> 1711:   }
> 1712: 

Thank you! It's nice to see this code which still may have bugs removed.

src/hotspot/share/code/nmethod.hpp line 268:

> 266: 
> 267:   // Protected by CompiledMethod_lock
> 268:   volatile signed char _state;         // {not_installed, in_use, not_entrant}

There's a fourth state: not_used which you might as well add to the comment ?

src/hotspot/share/oops/constantPool.cpp line 2222:

> 2220:   }
> 2221: 
> 2222:   // If the condition below is true, it means that the nmethod was found to

I think this comment should say Method and not nmethod.  I think this was a cut and paste error.  I assume that the is_maybe_on_stack is generalizing "stack" meaning on the execution OR continuation stack ?

-------------

Marked as reviewed by coleenp (Reviewer).

PR: https://git.openjdk.org/jdk/pull/9741