RFR: 8272170: Missing memory barrier when checking active state for regions

Wed Nov 10 08:44:59 UTC 2021

Hi all,

  can I have reviews for this fix to memory visibility race where we miss to make sure that the `HeapRegion` pointer in `HeapRegionManager::_regions` is visible before the "active"-check for that `HeapRegion` can return true?

I.e. the problem is this (according to my understanding):

void HeapRegionManager::expand(uint start, uint num_regions, WorkerThreads* pretouch_workers) {
  commit_regions(start, num_regions, pretouch_workers);
  for (uint i = start; i < start + num_regions; i++) {
    HeapRegion* hr = _regions.get_by_index(i);
    if (hr == NULL) {
      hr = new_heap_region(i);
      OrderAccess::storestore();
      _regions.set_by_index(i, hr);
      _allocated_heapregions_length = MAX2(_allocated_heapregions_length, i + 1);
    }
    G1CollectedHeap::heap()->hr_printer()->commit(hr);
  }
  activate_regions(start, num_regions);
}

E.g. we first commit the memory, then create a `HeapRegion` instance which we properly guard with a `StoreStore` before assigning it to the `_regions` array. After that at the end we activate the regions (via `activate_regions`), meaning that the contents of the new `HeapRegion*` in the region table are valid.
These bits in the `_active_regions` bitmap are put with memory barriers (via `par_set_range`).

In `HeapRegionManager::par_iterate`, if we iterate over the region map, we first check whether the region is available in the `_active_regions` bitmap, and then get and pass on the corresponding `HeapRegion*` to the given closure.

However there is no memory ordering between reading the available-bit in the `_active_regions` bitmap, so that bit could become visible to the thread iterating over the regions before the contents of the `_region` map itself, in effect passing a `nullptr` to the closure which isn't allowed. (Some details about the debugging session in the CR).

void HeapRegionManager::par_iterate(HeapRegionClosure* blk, HeapRegionClaimer* hrclaimer, const uint start_index) const {
  // Every worker will actually look at all regions, skipping over regions that
  // are currently not committed.
  // This also (potentially) iterates over regions newly allocated during GC. This
  // is no problem except for some extra work.
  const uint n_regions = hrclaimer->n_regions();
  for (uint count = 0; count < n_regions; count++) {
    const uint index = (start_index + count) % n_regions;
    assert(index < n_regions, "sanity");
    // Skip over unavailable regions
    if (!is_available(index)) {
      continue;
    }
    HeapRegion* r = _regions.get_by_index(index);
[...]
    bool res = blk->do_heap_region(r);
    if (res) {
      return;
    }
  }
}

The suggested fix is to ensure proper memory ordering in `is_available`, i.e. an acquire when accessing the bitmaps to make sure that other stores before the bit store we check are visible.

I did check a bit around the usage of this method and `is_unavailable`, but I do not think there is a similar issue.

Testing: test crashed with a frequency of around 1/500, now passing 5k runs

Thanks,
  Thomas

-------------

Commit messages:
 - Add memory barrier

Changes: https://git.openjdk.java.net/jdk/pull/6324/files
 Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6324&range=00
  Issue: https://bugs.openjdk.java.net/browse/JDK-8272170
  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.java.net/jdk/pull/6324.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/6324/head:pull/6324

PR: https://git.openjdk.java.net/jdk/pull/6324