RFR: Tsan oop map table hash fix

Thu Aug 15 02:36:08 UTC 2024

On Thu, 15 Aug 2024 00:07:22 GMT, Jiangli Zhou <jiangli at openjdk.org> wrote:

> Additional testing on JDK 21 with tsan found some crashes due to using `oop` `identity_hash` for computing the TsanOopMap entry hash. The crashes manifest as following two types:
> 
> 1) `guarantee(moves_this_cycle) failed: "Impossible to reconcile GC"`
> 
> 2) `ThreadSanitizer: CHECK failed: tsan_sync.cpp:261 "((*dst_meta)) == ((0))"`
> 
> With `-Xlog:tsan=trace` output, I found the same `oop` is added to the map twice in some cases. The second time when an `oop` is added occurs during `InterpreterMacroAssembler::lock_object` path. Apparently a new identity hash is computed for the `oop` with a previous computed identity hash. That causes the `oop` being added again. Later when notifying tsan if the `oop` is moved, it triggers the above crashes depending on the memory layout.
> 
> This fix computes `primitive_hash` using the oop address instead of `identity_hash` for entry hash. During GC, `TsanOopMapTable::collect_moved_objects_and_notify_freed` unlinks an entry from the map if the enclosing `oop` is moved. The moved entries are collected into an array and  are inserted back to the map after iterating the map (within a worker thread). That's needed since the entry hash would be different after an `oop` is moved. The operations are protected by the `TsanOopMap_lock`.
> 
> Since `primitive_hash` is now used, I also changed `TsanOopMapTableKey::get_hash` and `TsanOopMapTableKey::equals` to use `entry.object_no_keepalive()` instead of `entry._obj`. If `entry._obj` is outdated, it could result incorrect hash value for the current entry. So the updated oop address should be used instead. Running tsan jtreg tests on slowdebug binary run into new failures without the change.
> 
> Testing
> 
> Tsan jtreg tests on linux-x64:
> 
> $ bash configure --with-boot-jdk=/usr/local/google/home/jianglizhou/openjdk/jdk-21.0.1 --with-debug-level=release --disable-warnings-as-errors --with-jtreg=/usr/local/google/home/jianglizhou/github/jtreg/build/images/jtreg --with-stdc++lib=static --disable-precompiled-headers --enable-unlimited-crypto --with-native-debug-symbols=internal --with-default-make-target=jdk-image --disable-warnings-as-errors --with-toolchain-type=clang
> 
> 
> ==============================
> Test summary
> ==============================
>    TEST                                              TOTAL  PASS  FAIL ERROR   
>    jtreg:test/hotspot/jtreg/tsan                        79    79     0     0   
> ==============================
> TEST SUCCESS

Changes requested by manc (Reviewer).

src/hotspot/share/tsan/tsanOopMap.cpp line 235:

> 233:       const TsanOopMapImpl::MovedEntry &e = moved_entries.at(i);
> 234:       _oop_map->add_entry(e.key(), e.value());
> 235:     }

Is there a memory leak on the `MovedEntry::TsanOopMapTableKey*` after here?
`TsanOopMapTable::add_entry()` makes a copy `*(e.key())`, but `e.key()` appears never freed.

src/hotspot/share/tsan/tsanOopMapTable.cpp line 76:

> 74:   bool added;
> 75:   size_t* v = _table.put_if_absent(*entry, size, &added);
> 76:   assert(added, "must be");

> Since primitive_hash is now used, I also changed TsanOopMapTableKey::get_hash and TsanOopMapTableKey::equals to use entry.object_no_keepalive() instead of entry._obj. If entry._obj is outdated, it could result incorrect hash value for the current entry. So the updated oop address should be used instead. 

I think the concurrent processing of `TsanOopMapTable` and another worker thread processing `OopStorageSet` to update TSAN's `WeakHandles` is a scary situation, when we start using `entry.object_no_keepalive()` or `WeakHandle::peek()`. `peek()` may return an updated or not-yet-updated value, which will likely cause hidden bugs.

I came up with the buggy scenario below:
Suppose object A was at 0x1200, object B was at 0x2300. During GC, A becomes dead, and GC moves B to A's location, so B is now at 0x1200.
Worker thread T1 processed part of the `OopStorageSet` and updated B's weak handle (wh_B).  The weak handle for A (wh_A) is not yet processed.
T1 starts to process `TsanOopMapTable`, and it will try adding the new B to the table, with `wh_B.peek()` returing 0x1200. It will fail to add this entry, because the new B looks identical to the entry for A, with `wh_A.peek()` still returning 0x1200.
Then object B will be lost from the `TsanOopMapTable`.

I think a better approach is ensure the `TsanOopMapTable` is only processed once after all of the `OopStorageSet` is updated.
One option is the alternative approach of inserting  `TsanOopMap::notify_tsan_for_freed_and_moved_objects()` to individual GCs as mentioned in https://github.com/openjdk/tsan/pull/19.
Another possibility is to develop a mechanism to check whether `OopStorageSet` has been fully processed, before doing actual work to update `TsanOopMapTable`.

src/hotspot/share/tsan/tsanOopMapTable.cpp line 161:

> 159:         }
> 160: 
> 161:         entry.update_obj();

It is probably better to remove the `TsanOopMapTableKey::update_obj()` function altogether, so `TsanOopMapTableKey` becomes an immutable object. There is no point to update the old object when it is going to be removed immediately afterwards.

The code can use `wh_obj` to create a new `TsanOopMapTableKey`. `TsanOopMapTableKey` may need a new constructor though:

TsanOopMapTableKey::TsanOopMapTableKey(const TsanOopMapTableKey& src, oop obj) {
  _wh = src._wh;
  _obj = obj;
}

-------------

PR Review: https://git.openjdk.org/tsan/pull/23#pullrequestreview-2239376358
PR Review Comment: https://git.openjdk.org/tsan/pull/23#discussion_r1717670966
PR Review Comment: https://git.openjdk.org/tsan/pull/23#discussion_r1717793720
PR Review Comment: https://git.openjdk.org/tsan/pull/23#discussion_r1717677419