RFR: 8326936: RISC-V: Shenandoah GC crashes due to incorrect atomic memory operations [v4]

Robbin Ehn rehn at openjdk.org
Mon Mar 4 07:45:52 UTC 2024


On Fri, 1 Mar 2024 09:43:18 GMT, MaxXing <duke at openjdk.org> wrote:

>> With compressed oops enabled, Shenandoah GC crashes in the concurrent marking phase due to some incorrect atomic memory operations. This resulted in the failure of multiple related tests, including `gc/shenandoah/TestSmallHeap.java`, `gc/metaspace/TestMetaspacePerfCounters.java#Shenandoah-64`, and so on, tested on XuanTie C910 and QEMU.
>> 
>> This failure is related to a GCC bug we recently discovered: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114130.
>> 
>> In detail, there's a pattern in method `ShenandoahHeap::conc_update_with_forwarded`:
>> 
>> 
>> if (in_collection_set(obj)) {
>>   // Dereferences `obj` without explicit null check.
>>   oop fwd = ShenandoahBarrierSet::resolve_forwarded_not_null(obj);
>>   // Then calls atomic built-in.
>>   atomic_update_oop(fwd, p, obj);
>> }
>> 
>> 
>> `atomic_update_oop` then compresses `obj` into a 32-bit word and calls the built-in `__atomic_compare_exchange`. The latter produces incorrect assembly that comparing this unsigned 32-bit word with the sign-extended result of `lr.w` instructions.
>> 
>> This bug can be bypassed by adding an explicit null check (like `if (obj && in_collection_set(obj))`), or adding compiler flag `-fno-delete-null-pointer-checks`.
>> 
>> In previous commits, JDK-8316186 removed RISC-V's `PlatformCmpxchg<4>` but left the flag `-fno-delete-null-pointer-checks` enabled. Then JDK-8316893 removed the flag and made the bug visible. This patch adds `PlatformCmpxchg<4>` back.
>
> MaxXing has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add comments, change `flag` to `rc_temp`.

Hey,

> So the return value should be sign-extended, even if it's unsigned (according to my understand). Here's a simple example: https://godbolt.org/z/n6vfoYPPx. GCC generates `slliw` for the unsigned return value.
> 

This method should, at least often, be inlined, so old_value will not be passed as a return value.
And someone may add additional code in the same method.
Now it seems like the compiler always sign extends every word from inline asm, but I don't think that is a guarantee?
I was told always pass registers to inline asm and not types.
So I suggest having the correct sign extended word type, e.g. int32_t, as old_value.

> > Anyhow can't we just use a int32_t when comparing, and cast that back to the 'uint32_t' on return? (and still use _atomic...)
> 
> Seems we can't. Check this example for more details: https://godbolt.org/z/cv5ejTfxh. I wrote both the inline assembly version and the built-in version CAS, and tested them on the same code snippet. The inline assembly version works well due to the ABI, but the built-in version is still buggy.

Yes, I don't find a good way to force it. It possible using volatile but it involves storing and load from stack.

Thanks, Robbin

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18039#issuecomment-1975909072


More information about the hotspot-runtime-dev mailing list