RFR: 8209189: Make CompiledMethod::do_unloading more concurrent

Fri Oct 26 15:18:07 UTC 2018

Hi Vladimir,

Thank you for the review!

/Erik

On 2018-10-26 17:15, Vladimir Kozlov wrote:
> Thank you, Erik, for data.
>
> Looks good. And I am fine with virtual call in is_unloading() since 
> you will have other variant for ZGC - I thought it is only one.
>
> Thanks,
> Vladimir
>
> On 10/26/18 5:20 AM, Erik Österlund wrote:
>> Hi Vladimir,
>>
>> Thank you for having a look at these changes.
>>
>> On 2018-10-25 19:34, Vladimir Kozlov wrote:
>>> Hi Erik,
>>>
>>> First, what testing you did? This is significant rewrite and 
>>> requires a lot of testing. I don't see a link to test results in 
>>> RFE. Any performance testing?
>>
>> In terms of testing, I have run this through tier 1-6 in mach 5 with 
>> no failures related to my changes.
>>
>> As far as performance testing goes, I did spend some extra attention 
>> to the G1 parallel case, as it is most sensitive to performance.
>> I ran through DaCapo2009, SPECjvm98, SPECjvm2008, SPECjbb2000, 
>> SPECjbb2005 and SPECjbb2015 with fairly constrained heap size to 
>> cause a lot of old GCs. I recorded the time taken for the parallel 
>> class unloading and cleanup task which is where my changes would 
>> affect things.
>>
>> Here is a summary of the results:
>> new DaCapo2009:
>> mean    19.46   min     4.97    max     41.94   std     10.74
>> old DaCapo2009:
>> mean    20.22   min     6.54    max     42.55   std     10.41
>>
>> new SPECjvm98:
>> mean    2.4     min     0.34    max     6.43    std     1.66
>> old SPECjvm98:
>> mean    3.18    min     0.88    max     8.15    std     2.01
>>
>> new SPECjvm2008:
>> mean    374.75  min     69.72   max     477.76  std     73.42
>> old SPECjvm2008:
>> mean    345.14  min     84.85   max     496.55  std     101.24
>>
>> new SPECjbb2000:
>> mean    354.69  min     0.61    max     482.97  std     87.87
>> old SPECjbb2000:
>> mean    360.56  min     0.56    max     482.86  std     85.83
>>
>> new SPECjbb2005:
>> mean    174.69  min     0.8     max     242.71  std     36.37
>> old SPECjbb2005:
>> mean    174.66  min     0.74    max     242.76  std     36.82
>>
>> new SPECjbb2015:
>> mean    2.71    min     2.38    max     3.05    std     0.47
>> old SPECjbb2015:
>> mean    4.05    min     3.52    max     4.58    std     0.75
>>
>> As we can see, my approach seems to perform mostly "a bit better". 
>> Looks like the odd one I don't perform better is SPECjbb2008, but 
>> that is also the benchmark where I see highest standard deviations, 
>> so that might just be noise. The reason I think why I get overall 
>> better results, is because I always do everything in one pass over 
>> the code heap, instead of two with an intrusive wait barrier 
>> separating the two parallel phases. And, as a bonus when you do it 
>> this way, for each nmethod, you first walk the oop reloc entries to 
>> figure out if the nmethod is alive, and then immediately clean the 
>> same nmethod, walking the same reloc entries again, to find IC 
>> caches, while the caches are warm. That improved data locality gives 
>> this approach an edge. And the bigger the code cache, the bigger that 
>> advantage becomes.
>>
>> I have not looked with the same level of detail on SerialGC and 
>> ParallelGC, because it seems strictly less interesting (when you do a 
>> full STW GC, the unloading times is by far not the biggest problem, 
>> which is why it's done with 1 thread). However, FWIW, the same 
>> improved data locality should be beneficial there too by again doing 
>> everything in one single pass over the code cache instead of two.
>>
>>> It seems you removed the only case (in do_unloading()) where 
>>> CodeCache::do_unloading_nmethod_caches() is called. I can't find 
>>> other place where it is called after your change. Why keep it then?
>>
>> Good point. Removed.
>>
>>> It seems CompiledMethod::is_unloading() is performance critical. Can 
>>> we make IsUnloadingBehaviour::current()->is_unloading(this) as non 
>>> virtual call? I see only ClosureIsUnloadingBehaviour subclass of 
>>> IsUnloadingBehaviour.
>>
>> Yes, is_unloading does get pretty hot indeed. However, after it has 
>> been called once on an nmethod, the result of the computation is 
>> cached. So the virtual call you point out will only be called once 
>> per nmethod, and then the cached value will be used instead. And I 
>> need to override that virtual call with ZGC, as this will be called 
>> outside of safepoints and we have to use locking to guarantee safety 
>> of computing the value, especially since the immediate oops in the 
>> code stream are misaligned, and will get patched concurrently by 
>> nmethod entry barriers.
>>
>> Thanks,
>> /Erik
>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 10/19/18 7:22 AM, Erik Österlund wrote:
>>>> Hi,
>>>>
>>>> Today the nmethods are unloaded either serially or in parallel due 
>>>> to GC (and e.g. class unloading). With ZGC concurrent class 
>>>> unloading, a concurrent fashion is required. Rather than inventing 
>>>> yet a third way of unloading nmethods due to class unloading, it 
>>>> would be ideal if there could be one unified way of doing it that 
>>>> makes everyone happy.
>>>>
>>>> To solve this problem in a more general way, a new member function 
>>>> on CompiledMethod is added: is_unloading(). It returns whether a 
>>>> CompiledMethod has broken oops. In order to not have to iterate 
>>>> over all the oops every time this question is asked, it caches the 
>>>> result, so it needs to be computed only once per "epoch". Every 
>>>> time a CodeCache unloading is triggered by the GC, a new epoch is 
>>>> started, which forces re-computation of whether the CompiledMethod 
>>>> is_unloading() the first time it is called.
>>>>
>>>> So by having is_unloading() be lazily evaluated, it is now possible 
>>>> to build a do_unloading() method on nmethod that simply makes the 
>>>> nmethod unloaded if it is_unloading(), and otherwise cleans the 
>>>> various caches. Cleaning e.g. the inline caches of said nmethod, 
>>>> uses the same is_unloading() method on its targets to figure out if 
>>>> the IC cache should be cleaned or not. Again, the epoched caching 
>>>> mechanism makes sure we don't recompute this value.
>>>>
>>>> The new do_unloading() method may be called in both serial, 
>>>> parallel and concurrent contexts. So I removed the parallel 
>>>> variation of this that we had before, that unnecessarily postponed 
>>>> the unloading due to not having computed whether the nmethod should 
>>>> be unloaded or not. Since that is now computed on-demand lazily, 
>>>> there is no longer a need for postponing this, nor to have two 
>>>> phases for parallel nmethod unloading.
>>>>
>>>> While there is more work involved in making concurrent nmethod 
>>>> unloading work, this is a good cleanup towards that goal.
>>>>
>>>> Bug ID:
>>>> https://bugs.openjdk.java.net/browse/JDK-8209189
>>>>
>>>> Webrev:
>>>> cr.openjdk.java.net/~eosterlund/8209189/webrev.00/
>>>>
>>>> Thanks,
>>>> /Erik
>>