RFR: 8209189: Make CompiledMethod::do_unloading more concurrent

Fri Oct 26 15:15:31 UTC 2018

Thank you, Erik, for data.

Looks good. And I am fine with virtual call in is_unloading() since you will have other variant for ZGC - I thought it 
is only one.

Thanks,
Vladimir

On 10/26/18 5:20 AM, Erik Österlund wrote:
> Hi Vladimir,
> 
> Thank you for having a look at these changes.
> 
> On 2018-10-25 19:34, Vladimir Kozlov wrote:
>> Hi Erik,
>>
>> First, what testing you did? This is significant rewrite and requires a lot of testing. I don't see a link to test 
>> results in RFE. Any performance testing?
> 
> In terms of testing, I have run this through tier 1-6 in mach 5 with no failures related to my changes.
> 
> As far as performance testing goes, I did spend some extra attention to the G1 parallel case, as it is most sensitive to 
> performance.
> I ran through DaCapo2009, SPECjvm98, SPECjvm2008, SPECjbb2000, SPECjbb2005 and SPECjbb2015 with fairly constrained heap 
> size to cause a lot of old GCs. I recorded the time taken for the parallel class unloading and cleanup task which is 
> where my changes would affect things.
> 
> Here is a summary of the results:
> new DaCapo2009:
> mean    19.46   min     4.97    max     41.94   std     10.74
> old DaCapo2009:
> mean    20.22   min     6.54    max     42.55   std     10.41
> 
> new SPECjvm98:
> mean    2.4     min     0.34    max     6.43    std     1.66
> old SPECjvm98:
> mean    3.18    min     0.88    max     8.15    std     2.01
> 
> new SPECjvm2008:
> mean    374.75  min     69.72   max     477.76  std     73.42
> old SPECjvm2008:
> mean    345.14  min     84.85   max     496.55  std     101.24
> 
> new SPECjbb2000:
> mean    354.69  min     0.61    max     482.97  std     87.87
> old SPECjbb2000:
> mean    360.56  min     0.56    max     482.86  std     85.83
> 
> new SPECjbb2005:
> mean    174.69  min     0.8     max     242.71  std     36.37
> old SPECjbb2005:
> mean    174.66  min     0.74    max     242.76  std     36.82
> 
> new SPECjbb2015:
> mean    2.71    min     2.38    max     3.05    std     0.47
> old SPECjbb2015:
> mean    4.05    min     3.52    max     4.58    std     0.75
> 
> As we can see, my approach seems to perform mostly "a bit better". Looks like the odd one I don't perform better is 
> SPECjbb2008, but that is also the benchmark where I see highest standard deviations, so that might just be noise. The 
> reason I think why I get overall better results, is because I always do everything in one pass over the code heap, 
> instead of two with an intrusive wait barrier separating the two parallel phases. And, as a bonus when you do it this 
> way, for each nmethod, you first walk the oop reloc entries to figure out if the nmethod is alive, and then immediately 
> clean the same nmethod, walking the same reloc entries again, to find IC caches, while the caches are warm. That 
> improved data locality gives this approach an edge. And the bigger the code cache, the bigger that advantage becomes.
> 
> I have not looked with the same level of detail on SerialGC and ParallelGC, because it seems strictly less interesting 
> (when you do a full STW GC, the unloading times is by far not the biggest problem, which is why it's done with 1 
> thread). However, FWIW, the same improved data locality should be beneficial there too by again doing everything in one 
> single pass over the code cache instead of two.
> 
>> It seems you removed the only case (in do_unloading()) where CodeCache::do_unloading_nmethod_caches() is called. I 
>> can't find other place where it is called after your change. Why keep it then?
> 
> Good point. Removed.
> 
>> It seems CompiledMethod::is_unloading() is performance critical. Can we make 
>> IsUnloadingBehaviour::current()->is_unloading(this) as non virtual call? I see only ClosureIsUnloadingBehaviour 
>> subclass of IsUnloadingBehaviour.
> 
> Yes, is_unloading does get pretty hot indeed. However, after it has been called once on an nmethod, the result of the 
> computation is cached. So the virtual call you point out will only be called once per nmethod, and then the cached value 
> will be used instead. And I need to override that virtual call with ZGC, as this will be called outside of safepoints 
> and we have to use locking to guarantee safety of computing the value, especially since the immediate oops in the code 
> stream are misaligned, and will get patched concurrently by nmethod entry barriers.
> 
> Thanks,
> /Erik
> 
>> Thanks,
>> Vladimir
>>
>> On 10/19/18 7:22 AM, Erik Österlund wrote:
>>> Hi,
>>>
>>> Today the nmethods are unloaded either serially or in parallel due to GC (and e.g. class unloading). With ZGC 
>>> concurrent class unloading, a concurrent fashion is required. Rather than inventing yet a third way of unloading 
>>> nmethods due to class unloading, it would be ideal if there could be one unified way of doing it that makes everyone 
>>> happy.
>>>
>>> To solve this problem in a more general way, a new member function on CompiledMethod is added: is_unloading(). It 
>>> returns whether a CompiledMethod has broken oops. In order to not have to iterate over all the oops every time this 
>>> question is asked, it caches the result, so it needs to be computed only once per "epoch". Every time a CodeCache 
>>> unloading is triggered by the GC, a new epoch is started, which forces re-computation of whether the CompiledMethod 
>>> is_unloading() the first time it is called.
>>>
>>> So by having is_unloading() be lazily evaluated, it is now possible to build a do_unloading() method on nmethod that 
>>> simply makes the nmethod unloaded if it is_unloading(), and otherwise cleans the various caches. Cleaning e.g. the 
>>> inline caches of said nmethod, uses the same is_unloading() method on its targets to figure out if the IC cache 
>>> should be cleaned or not. Again, the epoched caching mechanism makes sure we don't recompute this value.
>>>
>>> The new do_unloading() method may be called in both serial, parallel and concurrent contexts. So I removed the 
>>> parallel variation of this that we had before, that unnecessarily postponed the unloading due to not having computed 
>>> whether the nmethod should be unloaded or not. Since that is now computed on-demand lazily, there is no longer a need 
>>> for postponing this, nor to have two phases for parallel nmethod unloading.
>>>
>>> While there is more work involved in making concurrent nmethod unloading work, this is a good cleanup towards that goal.
>>>
>>> Bug ID:
>>> https://bugs.openjdk.java.net/browse/JDK-8209189
>>>
>>> Webrev:
>>> cr.openjdk.java.net/~eosterlund/8209189/webrev.00/
>>>
>>> Thanks,
>>> /Erik
>