RFR: 8326936: RISC-V: Shenandoah GC crashes due to incorrect atomic memory operations [v3]

Fei Yang fyang at openjdk.org
Fri Mar 1 02:07:44 UTC 2024


On Thu, 29 Feb 2024 07:13:02 GMT, MaxXing <duke at openjdk.org> wrote:

>> With compressed oops enabled, Shenandoah GC crashes in the concurrent marking phase due to some incorrect atomic memory operations. This resulted in the failure of multiple related tests, including `gc/shenandoah/TestSmallHeap.java`, `gc/metaspace/TestMetaspacePerfCounters.java#Shenandoah-64`, and so on, tested on XuanTie C910 and QEMU.
>> 
>> This failure is related to a GCC bug we recently discovered: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114130.
>> 
>> In detail, there's a pattern in method `ShenandoahHeap::conc_update_with_forwarded`:
>> 
>> 
>> if (in_collection_set(obj)) {
>>   // Dereferences `obj` without explicit null check.
>>   oop fwd = ShenandoahBarrierSet::resolve_forwarded_not_null(obj);
>>   // Then calls atomic built-in.
>>   atomic_update_oop(fwd, p, obj);
>> }
>> 
>> 
>> `atomic_update_oop` then compresses `obj` into a 32-bit word and calls the built-in `__atomic_compare_exchange`. The latter produces incorrect assembly that comparing this unsigned 32-bit word with the sign-extended result of `lr.w` instructions.
>> 
>> This bug can be bypassed by adding an explicit null check (like `if (obj && in_collection_set(obj))`), or adding compiler flag `-fno-delete-null-pointer-checks`.
>> 
>> In previous commits, JDK-8316186 removed RISC-V's `PlatformCmpxchg<4>` but left the flag `-fno-delete-null-pointer-checks` enabled. Then JDK-8316893 removed the flag and made the bug visible. This patch adds `PlatformCmpxchg<4>` back.
>
> MaxXing has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix static assert.

Hi, I have several minor comments. BTW: I am curious about whether GCC old versions like GCC-11/12 would work. Can you take a look? Thanks.

src/hotspot/os_cpu/linux_riscv/atomic_linux_riscv.hpp line 47:

> 45: #define CORRECT_COMPILER_ATOMIC_SUPPORT
> 46: #endif
> 47: 

It doesn't make sense to me to have this `CORRECT_COMPILER_ATOMIC_SUPPORT` macro.

src/hotspot/os_cpu/linux_riscv/atomic_linux_riscv.hpp line 126:

> 124: template<>
> 125: template<typename T>
> 126: inline T Atomic::PlatformCmpxchg<4>::operator()(T volatile* dest __attribute__((unused)),

Can you add a code comment about why we should have this defined? Better to add a link to the related GCC bug.

src/hotspot/os_cpu/linux_riscv/atomic_linux_riscv.hpp line 133:

> 131: 
> 132:   T old_value;
> 133:   long flag;

To be consistent with the preceding `Atomic::PlatformCmpxchg<1>`, I would suggest change `long flag;` into `uint64_t rc_temp;`

-------------

Changes requested by fyang (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18039#pullrequestreview-1910185137
PR Review Comment: https://git.openjdk.org/jdk/pull/18039#discussion_r1508396098
PR Review Comment: https://git.openjdk.org/jdk/pull/18039#discussion_r1508397082
PR Review Comment: https://git.openjdk.org/jdk/pull/18039#discussion_r1508398409


More information about the hotspot-runtime-dev mailing list