request for review (M): 6883834: ParNew: assert(!_g->to()->is_in_reserved(obj), "Scanning field twice?") with LargeObjects tests

Sat May 7 23:24:45 UTC 2011

Final webrev, i promise :-) The webrev.3 link below is now
superceded by webrev.4 :-

   http://cr.openjdk.java.net/~ysr/6883834/webrev.4

thanks.
-- ramki

PS: for those who may have looked at webrev.3, I reverted the
computation of last_block in process_chunk_boundaries()
to the previous form, using block_start(chunk_mr.end()),
from the block_start(chunk_mr.last()) that I had
briefly experimented with based upon a discussion
with one of the reviewers. I elaborated the preceding
comment to indicate why we prefer the end() form:
it potentially requires a smaller bot-search and
fewer object walks in the degenerate case where objects
are small (which is by far the frequent case).

On 5/7/2011 2:12 AM, Y. Srinivas Ramakrishna wrote:
> Hi again ...
>
> Testing revealed another long-standing and nicely hidden bug.
> (The bug revealed itself when I was testing with smallish
> values of ParGCCardsPerStrideChunk, which by the way
> also reproduces the original crash very easily without
> the fix that is here under review -- after all it's the relative
> sizes of the non-array objects and the chunks that is the
> cause of the original problem... while on-array objects
> exceeding 128K in size may be rare, those exceeding 512 bytes
> are clearly not!)
> The flag ParallelGCRetainPLAB controls whether
> so-called "promotion labs" in the old gen are retained
> and reused across scavenges. This is useful when the PLAB
> resizing isn't optimal or if there is a lot of volatility
> in the amount that a GC thread allocates because of which its
> promotion rate may fluctuate rapidly from one scavenge to
> another, wasting space to internal fragmentation within
> the promotion LAB allocated to it. However, a little thought
> reveals that if PLAB's are retained across scavenges, then
> GC worker threads may be promoting into cards that may be
> subject to scanning by worker threads and this can cause,
> e.g., BOT walks to go awry. The expedient solution
> is to switch off the PLAB retention feature so that there
> is no promotion activity happening into the "used_region_at_save_marks()"
> portion of the old gen when using ParNewGC. (CMS is not
> affected because the old gen, being a non-contig freelist space,
> is already prepared, at some cost, for such promotion activity into
> cards subject to scanning at the same time.)
>
> I also declared certain local variables const in process_chunk_boundaries()
> and made a few other minor changes related to assertions checking,
> documentation comments, and so forth.
>
> However, the only real semantic change since the last webrev
> is simply in the flipping of the default value of ParallelGCRetainPLAB
> and making it a diagnostic.
>
> Please point your browser to:
>
> http://cr.openjdk.java.net/~ysr/6883834/webrev.3
>
> I'll do some performance measurements to determine the
> extent of performance loss, if any, from disabling this
> feature and thus potentially losing space in the old gen
> to internal fragmentation in these promotion LABs.
> (The frequency of old gen collections would likely increase
> by the extent to which space is being wasted in these
> PLABs.)
>
> thanks.
> -- ramki
>
> On 5/6/2011 2:59 PM, Y. Srinivas Ramakrishna wrote:
>> In case you haven't started looking yet, an updated version is in:-
>>
>> http://cr.openjdk.java.net/~ysr/6883834/webrev.1
>>
>> with changes based on the review(s) so far.
>> If you've already started reviewing, it's fine to
>> stick to the old one if you want to.
>>
>> thanks!
>> -- ramki
>>
>> On 5/5/2011 1:23 AM, Y. Srinivas Ramakrishna wrote:
>>>
>>> 6883834: ParNew: assert(!_g->to()->is_in_reserved(obj),"Scanning field twice?") with LargeObjects
>>> tests
>>>
>>> Investigation of this crash led to several avenues
>>> towards fixing the problem, including a
>>> more scalable and performant rewrite of the parallelization
>>> of card scanning which would have fixed the bug as well,
>>> and which was indeed the direction in which I had been
>>> initially headed. However, I realized last week that,
>>> given the upcoming hs21 code freeze, and the subtleties
>>> in the current code, that might not be the most sensible
>>> thing to do at this time. At that point, I decided to fix
>>> only the correctness problem(s) indicated by the assertion
>>> that we hit and not address in any manner the relative
>>> lack of scalability in the current implementation in
>>> the presence of large non-array objects. Somewhat
>>> offsetting the lack of that scalability is, however,
>>> the fact that the situation in which scalability suffers,
>>> namely very large sparsely dirty non-array objects are the exception
>>> in most programs (not so, very large sparsely dirty
>>> array objects which are by far the more common).
>>>
>>> The problem in the current code turned out to be
>>> two-fold:
>>> (1) it may scan too little and miss some pointers:
>>> when a large non-array object straddles entirely over a
>>> complete "chunk" (the parallel unit of work) of the card table,
>>> its head is dirty (so we are forced to assume an imprecise
>>> card mark and scan it in its entirety), then
>>> we may miss any part of it that extends past the end
>>> of its second chunk. Currently chunks are fixed at
>>> 256 cards == 128K, so any object larger than that
>>> size is potentially vulnerable.
>>> (2) it may scan too much and thus scan some pointers twice:
>>> when a large object straddles onto a subsequent chunk
>>> its head is dirty, and the next card is not dirty,
>>> the intention of the code was that
>>> the thread scanning the chunk in which the head of the
>>> object lay would scan the object in its entirety
>>> and the thread scanning the next chunk would not.
>>> However, there was a flaw in the way card-dirtiness
>>> was being tested for an object straddling into a
>>> chunk from the left and how it was being tested for
>>> an object straddling past the end (or right).
>>> This could, under the right conditions, result
>>> in a suffix of the object in the
>>> second of the two chunks above being scanned twice.
>>>
>>> While it was the second scenario that caused the
>>> assertion violation, the first could lead to more
>>> insidious problems with stale pointers left in old
>>> generation or permanent generation objects.
>>>
>>> The fix for (1) was to not terminate the search for the
>>> last card to scan to the right not at the end of the
>>> next chunk, but go right to the end of the object in
>>> question.
>>>
>>> The fix for (2) was to make sure the correct predicate
>>> was being tested for the card-values in order to ensure
>>> that the thread for the left chunk did not scan beyond
>>> the end of a specific point, as exemplified in the
>>> "LNC" value.
>>>
>>> Along the way, I added a few assertion checks, some verbose logging
>>> (disabled in all builds but which can be enabled via a source
>>> level change: I'd like to leave this in for future debuggability
>>> and code-comprehensibility), elaborated and corrected the documentation
>>> comments and changed the order of filling the "LNC array"
>>> (which is useful to and consulted by leftward neighbouring workers)
>>> and determining the rightward extent of the chunk scan
>>> past the end of the chunk. I also tightened the set of
>>> conditions under which we extend our scan to the right,
>>> which can save some unnecessary card value examination
>>> in some cases.
>>>
>>> http://cr.openjdk.java.net/~ysr/6883834/webrev.0/
>>>
>>> Note that this is a nearly minimal fix to the correctness
>>> problem in parallel card-scanning with ParNew. A more
>>> extensive rewrite of the code with an eye to simplifying
>>> some of the convolutedness of its current form and of
>>> improving its scalability when there are very large
>>> objects in the heap (or, as in this case, large constant
>>> pools in the perm gen), will be the subject of an RFE
>>> post-hs21.
>>>
>>> PS: I also pushed some of the closure construction
>>> closer to its use. This will also make it easier to
>>> do implement the performance RFE to follow in hs22.
>>> In the same vein, I changed some defined constants
>>> related to chunk size and parallel chunk strides
>>> into global diagnostic flags. I did leave the current
>>> values unchanged, although it is quite likely that these
>>> values could be tuned some, an activity that will be
>>> part of the aforementioned performance RFE.
>>>
>>> Perfromance:
>>> A quick performance run indicated no change in performance,
>>> as one might expect given the rarity of very large
>>> non-array objects in most applications. More thorough performance
>>> testing (with reference server refworkload) is in progress.
>>>
>>> Correctness Testing:
>>> The jck60001 test that tripped the assert; refworkload; JPRT;
>>> all three with and without heap verification on a variety
>>> of different platforms and with the chunk size varied for
>>> good measure.
>>>
>>> thanks for your reviews!
>>> -- ramki
>>>
>>>
>>
>