Request for review (S): 8005972: ParNew should not update the tenuring threshold when promotion failed has occurred

Tue Jan 15 17:43:27 UTC 2013

Good that the full GC's do the -XX:+AlwaysPreTouch thing.  Still, something to remember for GC performance measurements in general.

			... peter

Bengt Rutisson wrote:
> 
> Hi Peter,
> 
> Thanks for looking at this!
> 
> On 1/11/13 11:18 PM, Peter B. Kessler wrote:
>> I don't see -XX:+AlwaysPreTouch in your command line.  (Mostly because 
>> I'm not sure I see the command line: for example, I don't see your 1GB 
>> heap setting.)
>>
>> When you are watching GC performance before the first full collection, 
>> you have to remember those faults for the OS to populate the old 
>> generation.  If you do that during promotions, you do it one page at a 
>> time.  -XX:+AlwaysPreTouch touches all of the committed old generation 
>> during start-up, which turns out to be much faster.  (And doesn't bill 
>> the time to GC. :-)
>>
>> Maybe those forced System.gc() calls at the beginning touch all of the 
>> old generation?  Try a runs with and without -XX:+AlwaysPreTouch and 
>> report the results, because I'm curious.
> 
> AlwaysPreTouch should not make a difference since SPECjbb does two 
> System.gc()s in the begin. But I did do two runs with 
> -XX:+AlwaysPreTouch just in case. The resulting log files are attached. 
> As you can see the results are the same.
> 
> defnew:  56964
> parnew:  59224
> 
> Here is the command line for the runs:
> 
> java -server -XX:+AlwaysPreTouch -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:-PrintGCCause -cp ./jbb.jar:./check.jar 
> -Xms1g -Xmx1g spec.jbb.JBBmain -propfile SPECjbb.props
> 
> I add "-XX:+UseSerialGC" for the defnew runs and "-XX:+UseParNewGC 
> -XX:ParallelGCThreads=1" for the parnew runs.
> 
>>
>>
>> I also see that the young generation spaces are different between the 
>> runs.  Looking at the heap shape printed at th end of the GC logs
>>
>>     def new generation   total 314560K, used 232441K 
>> [0x00000000c0000000, 0x00000000d5550000, 0x00000000d5550000)
>>      eden space 279616K,  76% used [0x00000000c0000000, 
>> 0x00000000cd0b4d38, 0x00000000d1110000)
>>      from space 34944K,  53% used [0x00000000d1110000, 
>> 0x00000000d2359a18, 0x00000000d3330000)
>>      to   space 34944K,   0% used [0x00000000d3330000, 
>> 0x00000000d3330000, 0x00000000d5550000)
>>
>>     par new generation   total 314560K, used 74831K 
>> [0x00000000c0000000, 0x00000000d5550000, 0x00000000d5550000)
>>      eden space 279616K,  18% used [0x00000000c0000000, 
>> 0x00000000c33154f8, 0x00000000d1110000)
>>      from space 34944K,  64% used [0x00000000d1110000, 
>> 0x00000000d270e760, 0x00000000d3330000)
>>      to   space 34944K,   0% used [0x00000000d3330000, 
>> 0x00000000d3330000, 0x00000000d5550000)
>>
>>     PSYoungGen      total 329728K, used 93207K [0x00000000eaab0000, 
>> 0x0000000100000000, 0x0000000100000000)
>>      eden space 309952K, 23% used 
>> [0x00000000eaab0000,0x00000000ef30dd08,0x00000000fd960000)
>>      from space 19776K, 96% used 
>> [0x00000000fd960000,0x00000000fec08000,0x00000000fecb0000)
>>      to   space 19712K, 0% used 
>> [0x00000000fecc0000,0x00000000fecc0000,0x0000000100000000)
>>
>> The PSYoung eden is 10% larger than the others (because the survivors 
>> are smaller?).
> 
> Right. This is because the PS runs have UseAdaptiveSizePolicy enabled. I 
> mostly included the PS runs for comparison. My main interest in in the 
> difference between DefNew and ParNew. They have the same size eden and 
> survivors.
>> The sizes and occupancy of the survivors is different between DefNew 
>> and ParNew if you look in the logs.
> Yes, this is what triggered me to go and look at the code for how we 
> update the tenuring threshold. It looks like we use the same ergonomics 
> except for the difference that this review request is addressing. But 
> this difference is not the cause of the performance difference since I 
> don't get any promotion failures.
> 
> Thanks again for looking at this!
> Bengt
> 
>>
>>             ... peter
>>
>> Bengt Rutisson wrote:
>>>
>>>
>>> Hi Ramki,
>>>
>>> On 1/11/13 6:50 PM, Srinivas Ramakrishna wrote:
>>>> Hi Bengt --
>>>>
>>>> Try computing the GC overhead by normalizing wrt the work done (for 
>>>> which the net allocation volume might be a good proxy). As you 
>>>> state, the performance numbers will then likely make sense. Of 
>>>> course, they still won't explain why ParNew does better.  As Vitaly 
>>>> conjectures, the difference is likely in better object co-location 
>>>> with ParNew's slightly more DFS-like evacuation compared with 
>>>> DefNew's considerably more BFS-like evacuation because of the 
>>>> latter's use of a pure Cheney scan compared with the use of (a) 
>>>> marking stack(s) in the former, as far as i can remember the code. 
>>>> One way to tell if that accounts for the difference is to measure 
>>>> the cache-miss rates in the two cases (and may be use a good tool 
>>>> like Solaris perf analyzer to show you where the misses are coming 
>>>> from as well).
>>>
>>> Thanks for bringing the DFS/BFS difference up. This is exactly the 
>>> kind of difference I was looking for. My guess is that this is what 
>>> causes the difference in JBB score. I'll see if I can investigate 
>>> this further.
>>>> Also curious if you can share the two sets of GC logs, by chance? 
>>>> (specJBB is a for-fee benchmark so is not freely available to the 
>>>> individual developer.)
>>>
>>> I have a fairly large set of logs, but the runs are very stable so 
>>> I'm just attaching logs for one run for each collector. For 
>>> comparison I have also been running ParallelScavenge with one thread. 
>>> I'm using separate gc logs and jbb logs. The log files called 
>>> ".result" are the jbb output. The other logs are the gc logs.
>>>
>>> I'm running with a heap size of 1GB to avoid full GCs. All runs have 
>>> the two System.gc() induced full GCs but no other. ParallelScavenge 
>>> is performing even better than ParNew, but I am mostly interested in 
>>> the difference between ParNew and DefNew.
>>>
>>> A quick summary of the data in the logs:
>>>
>>>         Score  #GCs  AverageGCTime
>>> DefNew: 57903  2083  0.044053195391262644
>>> ParNew: 61363  2213  0.05931835969272489
>>> PS:     69697  2213  0.06117092860370538
>>>
>>> ParNew has a better score even though it does more GCs and they take 
>>> longer.
>>>
>>> If you have any insights from looking at the logs I would be very 
>>> happy to hear about it.
>>>
>>> Thanks,
>>> Bengt
>>>>
>>>> thanks.
>>>> -- ramki
>>>>
>>>> On Fri, Jan 11, 2013 at 4:57 AM, Bengt Rutisson 
>>>> <bengt.rutisson at oracle.com <mailto:bengt.rutisson at oracle.com>> wrote:
>>>>
>>>>
>>>>     Hi Vitaly,
>>>>
>>>>
>>>>     On 1/11/13 1:45 PM, Vitaly Davidovich wrote:
>>>>>
>>>>>     Hi Bengt,
>>>>>
>>>>>     Regarding the benchmark score, are you saying ParNew has longer
>>>>>     cumulative GC time or just the average is higher? If it's just
>>>>>     average, maybe the total # of them (and cumulative time) is
>>>>>     less.  I don't know the characteristics of this particular
>>>>>     specjbb benchmark, but perhaps having fewer total GCs is better
>>>>>     because of the overhead of getting all threads to a safe point,
>>>>>     going go the OS to suspend them, and then restarting them.  After
>>>>>     they're restarted, the CPU cache may be cold for it because the
>>>>>     GC thread polluted it.  Or I'm entirely wrong in my speculation
>>>>>     ... :).
>>>>>
>>>>
>>>>     You have a good point about the number of GCs. The problem in my
>>>>     runs is that ParNew does more GCs than DefNew. So there are both
>>>>     more of them and their average time is higher, but the score is
>>>>     still better. That ParNew does more GCs is not that strange. It
>>>>     has a higher score, which means that it had higher throughput and
>>>>     had time to create more objects. So, that is kind of expected. But
>>>>     I don't understand how it can have higher throughput when the GCs
>>>>     take longer. My current guess is that it does something
>>>>     differently with how objects are copied in a way that is
>>>>     beneficial for the execution time between GCs.
>>>>
>>>>     It also seems like ParNew keeps more objects alive for each GC.
>>>>     That is either the reason why it does more and more frequent GCs
>>>>     than DefNew, or it is an effect of the fact that more objects are
>>>>     created due to the higher throughput. This is the reason I started
>>>>     looking at the tenuring threshold.
>>>>
>>>>     Bengt
>>>>
>>>>
>>>>>     Thanks
>>>>>
>>>>>     Sent from my phone
>>>>>
>>>>>     On Jan 11, 2013 6:02 AM, "Bengt Rutisson"
>>>>>     <bengt.rutisson at oracle.com <mailto:bengt.rutisson at oracle.com>> 
>>>>> wrote:
>>>>>
>>>>>
>>>>>         Hi Ramki,
>>>>>
>>>>>         Thanks for looking at this!
>>>>>
>>>>>         On 1/10/13 9:28 PM, Srinivas Ramakrishna wrote:
>>>>>>         Hi Bengt --
>>>>>>
>>>>>>         The change looks reasonable, but I have a comment and a
>>>>>>         follow-up question.
>>>>>>
>>>>>>         Not your change, but I'd elide the "half the real survivor
>>>>>>         size" since it's really a configurable parameter based on
>>>>>>         TargetSurvivorRatio with default half.
>>>>>>         I'd leave the comment as "set the new tenuring threshold and
>>>>>>         desired survivor size".
>>>>>
>>>>>         I'm fine with removing this from the comment, but I thought
>>>>>         the "half the real survivor size" aimed at the fact that we
>>>>>         pass only the "to" capacity and not the "from" capacity in to
>>>>>         compute_tenuring_threshold(). With that interpretation I
>>>>>         think the comment is correct.
>>>>>
>>>>>         Would you like me to remove it anyway? Either way is fine
>>>>>         with me.
>>>>>
>>>>>>         I'm curious though, as to what performance data prompted
>>>>>>         this change,
>>>>>         Good point. This change was preceded by an internal
>>>>>         discussion in the GC team, so I should probably have
>>>>>         explained the background more in my review request to the 
>>>>> open.
>>>>>
>>>>>         I was comparing the ParNew and DefNew implementation since I
>>>>>         am seeing some strange differences in some SPECjbb2005
>>>>>         results. I am running ParNew with a single thread and get
>>>>>         much better score than with DefNew. But I also get higher
>>>>>         average GC times. So, I was trying to figure out what DefNew
>>>>>         and ParNew does differently.
>>>>>
>>>>>         When I was looking at DefNewGeneration::collect() and
>>>>>         ParNewGeneration::collect() I saw that they contain a whole
>>>>>         lot of code duplication. It would be tempting to try to
>>>>>         extract the common code out into DefNewGeneration since it is
>>>>>         the super class. But there are some minor differences. One of
>>>>>         them was this issue with how they handle the tenuring 
>>>>> threshold.
>>>>>
>>>>>         We tried to figure out if there is a reason for ParNew and
>>>>>         DefNew to behave different in this regard. We could not come
>>>>>         up with any good reason for that. So, we needed to figure out
>>>>>         if we should change ParNew or DefNew to make them consistent.
>>>>>         The decision to change ParNew was based on two things. First,
>>>>>         it seems wrong to use the data from a collection that got
>>>>>         promotion failure. This collection will not have allowed the
>>>>>         tenuring threshold to fulfill its purpose. Second,
>>>>>         ParallelScavenge works the same way as DefNew.
>>>>>
>>>>>         BTW, the difference between DefNew and ParNew seems to have
>>>>>         been there from the start. So, there is no bug or changeset
>>>>>         in mercurial or TeamWare to explain why the difference was
>>>>>         introduced.
>>>>>
>>>>>         (Just to be clear, this difference was not the cause of my
>>>>>         performance issue. I still don't have a good explanation for
>>>>>         how ParNew can have longer GC times but better SPECjbb score.)
>>>>>
>>>>>>         and whether it might make sense, upon a promotion failure to
>>>>>>         do something about the tenuring threshold for the next
>>>>>>         scavenge (i.e. for example make the tenuring threshold half
>>>>>>         of its current value as a reaction to the fact that
>>>>>>         promotion failed). Is it currently left at its previous
>>>>>>         value or is it asjusted back to the default max value (which
>>>>>>         latter may be the wrong thing to do) or something else?
>>>>>
>>>>>         As far as I can tell the tenuring threshold is left untouched
>>>>>         if we get a promotion failure. It is probably a good idea to
>>>>>         update it in some way. But I would prefer to handle that as a
>>>>>         separate bug fix.
>>>>>
>>>>>         This change is mostly a small cleanup to make
>>>>>         DefNewGeneration::collect() and ParNewGeneration::collect()
>>>>>         be more consistent. We've done the thinking so, it's good to
>>>>>         make the change in preparation for the next person that comes
>>>>>         a long and has a few cycles over and would like to merge the
>>>>>         two collect() methods in some way.
>>>>>
>>>>>         Thanks again for looking at this!
>>>>>         Bengt
>>>>>
>>>>>>
>>>>>>         -- ramki
>>>>>>
>>>>>>         On Thu, Jan 10, 2013 at 1:30 AM, Bengt Rutisson
>>>>>>         <bengt.rutisson at oracle.com
>>>>>>         <mailto:bengt.rutisson at oracle.com>> wrote:
>>>>>>
>>>>>>
>>>>>>             Hi everyone,
>>>>>>
>>>>>>             Could I have a couple of reviews for this small change
>>>>>>             to make DefNew and ParNew be more consistent in the way
>>>>>>             they treat the tenuring threshold:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~brutisso/8005972/webrev.00/
>>>>>> <http://cr.openjdk.java.net/%7Ebrutisso/8005972/webrev.00/>
>>>>>>
>>>>>>             Thanks,
>>>>>>             Bengt
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>