RFR: Tsan oop map table hash fix

Thu Aug 15 23:06:09 UTC 2024

On Thu, 15 Aug 2024 02:17:36 GMT, Man Cao <manc at openjdk.org> wrote:

>> Additional testing on JDK 21 with tsan found some crashes due to using `oop` `identity_hash` for computing the TsanOopMap entry hash. The crashes manifest as following two types:
>> 
>> 1) `guarantee(moves_this_cycle) failed: "Impossible to reconcile GC"`
>> 
>> 2) `ThreadSanitizer: CHECK failed: tsan_sync.cpp:261 "((*dst_meta)) == ((0))"`
>> 
>> With `-Xlog:tsan=trace` output, I found the same `oop` is added to the map twice in some cases. The second time when an `oop` is added occurs during `InterpreterMacroAssembler::lock_object` path. Apparently a new identity hash is computed for the `oop` with a previous computed identity hash. That causes the `oop` being added again. Later when notifying tsan if the `oop` is moved, it triggers the above crashes depending on the memory layout.
>> 
>> This fix computes `primitive_hash` using the oop address instead of `identity_hash` for entry hash. During GC, `TsanOopMapTable::collect_moved_objects_and_notify_freed` unlinks an entry from the map if the enclosing `oop` is moved. The moved entries are collected into an array and  are inserted back to the map after iterating the map (within a worker thread). That's needed since the entry hash would be different after an `oop` is moved. The operations are protected by the `TsanOopMap_lock`.
>> 
>> Since `primitive_hash` is now used, I also changed `TsanOopMapTableKey::get_hash` and `TsanOopMapTableKey::equals` to use `entry.object_no_keepalive()` instead of `entry._obj`. If `entry._obj` is outdated, it could result incorrect hash value for the current entry. So the updated oop address should be used instead. Running tsan jtreg tests on slowdebug binary run into new failures without the change.
>> 
>> Testing
>> 
>> Tsan jtreg tests on linux-x64:
>> 
>> $ bash configure --with-boot-jdk=/usr/local/google/home/jianglizhou/openjdk/jdk-21.0.1 --with-debug-level=release --disable-warnings-as-errors --with-jtreg=/usr/local/google/home/jianglizhou/github/jtreg/build/images/jtreg --with-stdc++lib=static --disable-precompiled-headers --enable-unlimited-crypto --with-native-debug-symbols=internal --with-default-make-target=jdk-image --disable-warnings-as-errors --with-toolchain-type=clang
>> 
>> 
>> ==============================
>> Test summary
>> ==============================
>>    TEST                                              TOTAL  PASS  FAIL ERROR   
>>    jtreg:test/hotspot/jtreg/tsan                        79    79     0     0   
>> ==============================
>> TEST SUCCESS
>
> src/hotspot/share/tsan/tsanOopMapTable.cpp line 76:
> 
>> 74:   bool added;
>> 75:   size_t* v = _table.put_if_absent(*entry, size, &added);
>> 76:   assert(added, "must be");
> 
>> Since primitive_hash is now used, I also changed TsanOopMapTableKey::get_hash and TsanOopMapTableKey::equals to use entry.object_no_keepalive() instead of entry._obj. If entry._obj is outdated, it could result incorrect hash value for the current entry. So the updated oop address should be used instead. 
> 
> I think the concurrent processing of `TsanOopMapTable` and another worker thread processing `OopStorageSet` to update TSAN's `WeakHandles` is a scary situation, when we start using `entry.object_no_keepalive()` or `WeakHandle::peek()`. `peek()` may return an updated or not-yet-updated value, which will likely cause hidden bugs.
> 
> I came up with the buggy scenario below:
> Suppose object A was at 0x1200, object B was at 0x2300. During GC, A becomes dead, and GC moves B to A's location, so B is now at 0x1200.
> Worker thread T1 processed part of the `OopStorageSet` and updated B's weak handle (wh_B).  The weak handle for A (wh_A) is not yet processed.
> T1 starts to process `TsanOopMapTable`, and it will try adding the new B to the table, with `wh_B.peek()` returing 0x1200. It will fail to add this entry, because the new B looks identical to the entry for A, with `wh_A.peek()` still returning 0x1200.
> Then object B will be lost from the `TsanOopMapTable`.
> 
> I think a better approach is ensure the `TsanOopMapTable` is only processed once after all of the `OopStorageSet` is updated.
> One option is the alternative approach of inserting  `TsanOopMap::notify_tsan_for_freed_and_moved_objects()` to individual GCs as mentioned in https://github.com/openjdk/tsan/pull/19.
> Another possibility is to develop a mechanism to check whether `OopStorageSet` has been fully processed, before doing actual work to update `TsanOopMapTable`.

Thanks for the analysis. The scenario describes a potential rare case where an `oop` is moved to an address location containing a freed `oop` and tsan is notified about the moved `oop` before the freed `oop`. I agree that's a potential bug.

Doing `TsanOopMap::notify_tsan_for_freed_and_moved_objects()` to individual GCs has higher risks of not covering all different cases if GC changes. I like you suggestion of implementing "a mechanism to check whether OopStorageSet has been fully processed". Implementing within `OopStorageSet` could be complicated. I think it might be doable in `WeakProcessor::Task::work` to check if all tasks are done. I'm trying that approach.

-------------

PR Review Comment: https://git.openjdk.org/tsan/pull/23#discussion_r1719096635