[jdk16] RFR: 8257999: Parallel GC crash in gc/parallel/TestDynShrinkHeap.java: new region is not in covered_region [v2]

Kim Barrett kbarrett at openjdk.java.net
Thu Dec 17 14:21:14 UTC 2020


> Please review this change to ParallelGC oldgen allocation, adding a missing
> memory barrier.
> 
> The problem arises in the interaction between concurrent oldgen allocations,
> where each would, if done serially (in either order), require expansion of
> the generation.
> 
> An allocation of size N compares the mutable space's (end - top) with N to
> determine if space is available.  If available, use top as the start of the
> object of size N (adjusting top atomically) and assert the resulting memory
> region is in the covered area.  If not, then expand.
> 
> Expansion updates the covered region, then updates the space (i.e. end).
> There is currently no memory barrier between those operations.
> 
> As a result, we can have thread1 having done an expansion, updating the
> covered region and the space end. Because there's no memory barrier there,
> the space end may be updated before the covered region as far as some other
> thread is concerned.
> 
> Meanwhile thread2's allocation reads the new end and goes ahead with the
> allocation (which would not have fit with the old end value), then fails the
> covered region check because it used the old covered range.  Although the
> reads of end and the covered range are ordered here by the intervening CAS
> of top, that doesn't help if the writes by thread1 are not also properly
> ordered.
> 
> There is even a comment about this in PSOldGen::post_resize(), saying the
> space update must be last (including after the covered region update).  But
> without a memory barrier, there's nothing other than source order to ensure
> that ordering.  So add a memory barrier.
> 
> I'm not sure whether this out-of-order update of the space end could lead to  
> problems in a product build (where the assert doesn't apply).  Without
> looking carefully, there appear to be opportunities for problems, such as
> accessing uncovered parts of the card table.
>  
> There's another issue that I'm not addressing with this change.  Various
> values are being read while subject to concurrent writes, without being in
> any way tagged as atomic. (The writes are under the ExpandHeap_lock, the
> reads are not.) This includes at least the covering region bounds and space
> end.
> 
> Testing:
> mach5 tier1
> I was unable to reproduce the failure, so can't show any before / after
> improvement.

Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:

 - Merge branch 'master' into shrink_heap_crash
 - stefanj review
 - add memory barrier

-------------

Changes:
  - all: https://git.openjdk.java.net/jdk16/pull/35/files
  - new: https://git.openjdk.java.net/jdk16/pull/35/files/8a5b6ce5..50ee993d

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk16&pr=35&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=jdk16&pr=35&range=00-01

  Stats: 545 lines in 30 files changed: 462 ins; 2 del; 81 mod
  Patch: https://git.openjdk.java.net/jdk16/pull/35.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk16 pull/35/head:pull/35

PR: https://git.openjdk.java.net/jdk16/pull/35



More information about the hotspot-gc-dev mailing list