[jdk16] RFR: 8257999: Parallel GC crash in gc/parallel/TestDynShrinkHeap.java: new region is not in covered_region [v2]
Kim Barrett
kbarrett at openjdk.java.net
Thu Dec 17 14:21:14 UTC 2020
> Please review this change to ParallelGC oldgen allocation, adding a missing
> memory barrier.
>
> The problem arises in the interaction between concurrent oldgen allocations,
> where each would, if done serially (in either order), require expansion of
> the generation.
>
> An allocation of size N compares the mutable space's (end - top) with N to
> determine if space is available. If available, use top as the start of the
> object of size N (adjusting top atomically) and assert the resulting memory
> region is in the covered area. If not, then expand.
>
> Expansion updates the covered region, then updates the space (i.e. end).
> There is currently no memory barrier between those operations.
>
> As a result, we can have thread1 having done an expansion, updating the
> covered region and the space end. Because there's no memory barrier there,
> the space end may be updated before the covered region as far as some other
> thread is concerned.
>
> Meanwhile thread2's allocation reads the new end and goes ahead with the
> allocation (which would not have fit with the old end value), then fails the
> covered region check because it used the old covered range. Although the
> reads of end and the covered range are ordered here by the intervening CAS
> of top, that doesn't help if the writes by thread1 are not also properly
> ordered.
>
> There is even a comment about this in PSOldGen::post_resize(), saying the
> space update must be last (including after the covered region update). But
> without a memory barrier, there's nothing other than source order to ensure
> that ordering. So add a memory barrier.
>
> I'm not sure whether this out-of-order update of the space end could lead to
> problems in a product build (where the assert doesn't apply). Without
> looking carefully, there appear to be opportunities for problems, such as
> accessing uncovered parts of the card table.
>
> There's another issue that I'm not addressing with this change. Various
> values are being read while subject to concurrent writes, without being in
> any way tagged as atomic. (The writes are under the ExpandHeap_lock, the
> reads are not.) This includes at least the covering region bounds and space
> end.
>
> Testing:
> mach5 tier1
> I was unable to reproduce the failure, so can't show any before / after
> improvement.
Kim Barrett has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
- Merge branch 'master' into shrink_heap_crash
- stefanj review
- add memory barrier
-------------
Changes:
- all: https://git.openjdk.java.net/jdk16/pull/35/files
- new: https://git.openjdk.java.net/jdk16/pull/35/files/8a5b6ce5..50ee993d
Webrevs:
- full: https://webrevs.openjdk.java.net/?repo=jdk16&pr=35&range=01
- incr: https://webrevs.openjdk.java.net/?repo=jdk16&pr=35&range=00-01
Stats: 545 lines in 30 files changed: 462 ins; 2 del; 81 mod
Patch: https://git.openjdk.java.net/jdk16/pull/35.diff
Fetch: git fetch https://git.openjdk.java.net/jdk16 pull/35/head:pull/35
PR: https://git.openjdk.java.net/jdk16/pull/35
More information about the hotspot-gc-dev
mailing list