RFR (XS): 8164948: Initializing stores of HeapRegions are not ordered with regards to its use in G1ConcurrentMark

Fri Sep 9 13:01:59 UTC 2016

Hi all,

 StefanJ had an initial look and commented about the removal of the
reinitialization of the HeapRegion in heapRegion.cpp, i.e. that this is
still necessary.

He is right, I updated the webrev at
http://cr.openjdk.java.net/~tschatzl/8164948/webrev.1 (full)
http://cr.openjdk.java.net/~tschatzl/8164948/webrev.0_to_1 (diff)

Analysis of the new version shows that this will not cause any
additional issues: the effect is that (like before the previous webrev,
and before finding this issue), the marking thread will just scan the
next bitmap occupied by the region unnecessarily (it will not find any
mark, this is correct because that region has been allocated after mark
start, and those are always(tm) kept live).

This is true because G1 only uncommits the region and the associated
auxiliary data at full gc, which will cancel all marking.

I.e. the original problem had been that since we got a TAMS value for
the region that was outside the region, we scanned into non-committed
areas.

I will do some more testing with a new build though.

Thanks,
  Thomas

On Fri, 2016-09-09 at 12:44 +0200, Thomas Schatzl wrote:
> Hi all,
> 
>   can I have reviews for this change that adds required barriers
> before
> publishing HeapRegion instances to the rest of the VM to avoid random
> crashes seen on ARM* machines with G1?
> 
> The problem (in this instance - it might have caused other issues in
> other code before) is that in G1ConcurrentMark when moving the finger
> before marking we may load HeapRegion instances containing invalid
> data. This is because before publishing the HeapRegion* instance to
> the
> global HeapRegionManager::_regions map, we do not make sure that the
> initializating stores retire. So the subsequent loads of values from
> that data structure may not have been visible for the other threads
> (the marking thread in particular).
> 
> About the patch: I think the loadload in G1ConcurrentMark is not
> strictly necessary, as the following loads are always dependent on
> the
> HeapRegion* just loaded. Even ARM (but not supported Alpha)
> processors
> have an implicit loadload barrier here.
> (I could remove it if not wanted).
> 
> Actually a more comprehensive change would remove the need to
> actually
> have these loads from the HeapRegion* like envisioned in the comments
> to this bug (somewhat like JDK-7127707). That has been deemed to be
> too
> complex at this time, and may in fact open up more of these issues
> without very good thought.
> 
> There are lots of other places that might or might not be susceptible
> to this problems, alone G1CollectedHeap::heap_region_containing() is
> used in 61 places, and in very frequently used ones (e.g.
> G1CollectedHeap::is_in()). I did not conduct a full-scale review of
> all
> these places here, just briefly looked at the locations. They all
> seem
> to be covered by the implicit loadload barrier as far as I can tell
> quickly. That would realistically not complete soon. I will file a CR
> for that though.
> 
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8164948
> Webrev:
> http://cr.openjdk.java.net/~tschatzl/8164948/webrev/
> Testing:
> jprt, running crashing test >3000 times without issue (crash
> frequency
> typically 50-1000 iterations, but will keep on running it over the
> weekend) on arm64 hardware - the symptoms shown in the CR very much
> point to this exact problem.
> 
> Thanks,
>   Thomas