RFR (L): 8230706: Waiting on completion of strong nmethod processing causes long pause times with G1

Fri Sep 27 21:16:30 UTC 2019

Hi Thomas,

> 25 sep. 2019 kl. 12:42 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:
> 
> Hi all,
> 
>  can I have reviews for this change that fixes a regression introduced in jdk11?
> 
> So, currently in g1, in the concurrent start pause we have to mark through all live oops referenced from thread stacks to keep those alive across the concurrent marking in presence of class unloading. These are "strong" nmethods.
> 
> However G1 also scans oops of nmethods as part of the per-region nmethod remembered sets, but should not mark their oops as their classes should remain unloadable. These are "weak" nmethods.
> 
> Regardless of either way of processing, nmethods are claimed, to prevent processing them multiple times as happens often. Which means that if an nmethod is claimed first via the "weak" path, its oops will not be marked properly, and *boom*.
> 
> In order to prevent this, currently there is a hard synchronization, i.e. wait barrrier between strong nmethod processing and the actual evacuation (which ultimately does the weak nmethod processing).
> 
> This is a problem, particularly since jdk11, because stacks are claimed on a per thread basis. I.e. an extremely deep stack could make all 100's or 1000's of your cpu cores wait idly. The "particular since jdk11" part is that since then the amount of work done during nmethod iteration is only bounded by the amount of live objects to copy (i.e. JDK-6672778 added task queue trimming which is generally a good thing), which can take a lot of time.
> 
> In this case, in some internal Cassandra setup we have seen ridiculously long waiting times as a single thread apparently evacuates the whole young gen... :(
> 
> This change moves the wait to the end of the evacuation, remembering any nmethods that were claimed by the weak nmethod processing before the strong nmethod processing got to it.
> 
> Since the amount of nmethods that need to be marked through has always been very low (single digit), that phase took <0.1ms typically. There is some attempt to parallelize this phase based on the number of nmethods (this pass only needs to mark the oops <= TAMS in nmethods, no more) anyway though.
> 
> I will look into merging parallel phases into a single one in the post-evacuation phase soon to get rid of this additional spin-up of threads (only during concurrent mark) again.
> 
> During this work I tried several alternatives that were rejected:
> - disabling task queue trimming; that works, but still has the problem with deep thread stacks
> - moving the actual wait deep into the evacuation code: made a mess with  the code, and still does not really solve the problem
> - instead of remembering nmethods, concurrently process them. Does not work because x86 embedded oops are not word-aligned, so this is not a good idea.
> 
> There is always the option of doing the stack scanning in the concurrent phase: that seemingly requires much more work, e.g. by using method return barriers and has been left as a future enhancement.
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8230706
> Webrev:
> http://cr.openjdk.java.net/~tschatzl/8230706/webrev/

Thanks for hunting down and fixing this problem. The change looks good in general, but I have some comments:
src/hotspot/share/code/nmethod.cpp
—
I would like the bit-operations to be hidden in functions, something like:
  is_claimed(nmethod* nm) { return ((uintptr_t)nm & 0x1) != 0;}
  claim(nmethod* nm) { return (nmethod*)((uintptr_t)nm | 0x1); }
  mask_claimed(nmethod* nm) { return (nmethod*)((uintptr_t)nm & ~0x1); }
The names might be improved, but you get my point.
—

src/hotspot/share/gc/g1/g1CodeBlobClosure.hpp
—
  55   // Do we need to collect nmethods if they have already been claimed.
  56   bool is_strong_iteration() const { return _pss != NULL; }

I’m not sure about this name, especially since it won’t be true for all ”strong” closures, just the ones during initial mark right? Naming is hard, so I’m not sure I have a better suggestion overall, but what do you think about calling this method should_collect_strong_nmethods()?
---

src/hotspot/share/gc/g1/g1CollectedHeap.cpp
—
3845   class G1MarkStrongNMethodsTask : public AbstractGangTask {
3846     G1ParScanThreadStateSet* _per_thread_states;
3847     volatile uint _claim;
3848   public:
3849     G1MarkStrongNMethodsTask(G1ParScanThreadStateSet* per_thread_states)
3850     : AbstractGangTask("Remark strong nmethods"), _per_thread_states(per_thread_states), _claim(0) { }
3851 
3852     void work(uint worker_id) {
3853       G1RemarkStrongNMethodsClosure cl(worker_id);
3854       _per_thread_states->iterate(&cl, &_claim);
3855     }
3856   } cl(per_thread_states);

Move the G1MarkStrongNMethodsTask definition outside the function, it think that is more along the lines what we usually do.
—

Thanks,
Stefan

> Testing:
> hs-tier1-5, many cassandra runs - no more exceptionally long concurrent start pauses :)
> 
> Thanks,
>  Thomas