RFR: 8352185: Shenandoah: Invalid logic for remembered set verification [v14]
Xiaolong Peng
xpeng at openjdk.org
Fri Mar 28 00:54:25 UTC 2025
On Wed, 26 Mar 2025 20:37:59 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:
>> There are some scenarios in which GenShen may have improper remembered set verification logic:
>>
>> 1. Concurrent young cycles following a Full GC:
>>
>> In the end of ShenandoahFullGC, it resets bitmaps for the entire heap w/o resetting marking context to be incomplete, but ShenandoahVerifier has code like below to get a complete old marking context for remembered set verification
>>
>>
>> ShenandoahVerifier
>> ShenandoahMarkingContext* ShenandoahVerifier::get_marking_context_for_old() {
>> shenandoah_assert_generations_reconciled();
>> if (_heap->old_generation()->is_mark_complete() || _heap->gc_generation()->is_global()) {
>> return _heap->complete_marking_context();
>> }
>> return nullptr;
>> }
>>
>>
>> For the concurrent young GC cycles after a full GC, the old marking context used for remembered set verification is stale, and may cause unexpected result.
>>
>> 2. For the impl of `ShenandoahVerifier::get_marking_context_for_old` mentioned above, it always return a marking context for global GC, but marking bitmaps is already reset before before init-mark, `ShenandoahVerifier::help_verify_region_rem_set` always skip verification in this case.
>>
>> 3. ShenandoahConcurrentGC always clean remembered set read table, but only swap read/write table when gc generation is young, this issue causes remembered set verification before init-mark to use a completely clean remembered set, but it is covered by issue 2.
>>
>> 4. After concurrent young cycle evacuates objects from a young region, it update refs using marking bitmaps from marking context, therefore it won't update references of dead old objects(is_marked(obj) is false: obj is not marking strong/weak and it is below tams). In this case, if the next cycle if global concurrent GC, remembered set can't be verified before init-mark because of the dead pointers.
>>
>> ### Solution
>> * After a full GC, always set marking completeness flag to false after reseting the marking bitmaps.
>> * Because there could be dead pointers in old gen were not updated to point to new address after evacuation and refs update, we should disable rem-set validation before init-mark&update-refs if old marking context is incomplete.
>>
>> ### Test
>> - [x] `make test TEST=hotspot_gc_shenandoah`
>> - [x] GHA
>
> Xiaolong Peng has updated the pull request incrementally with one additional commit since the last revision:
>
> Add comments
I have reproduced the bug https://bugs.openjdk.org/browse/JDK-8345399 on ppc64le hardware with tip, crash happens in a young cycle after a full GC, which is one of the problems I'm trying to fix in this PR:
[13.990s][info][gc,start ] GC(101) Pause Full
[13.990s][info][gc,task ] GC(101) Using 4 of 4 workers for full gc
[13.990s][info][gc,start ] GC(101) Verify Before Full GC, Level 4
[13.998s][info][gc ] GC(101) Verify Before Full GC, Level 4 (22772 reachable, 0 marked)
[13.998s][info][gc,phases,start] GC(101) Phase 1: Mark live objects
[14.003s][info][gc,ref ] GC(101) Clearing All SoftReferences
[14.003s][info][gc,ref ] GC(101) Clearing All SoftReferences
[14.009s][info][gc,ref ] GC(101) Encountered references: Soft: 49, Weak: 101, Final: 0, Phantom: 8
[14.009s][info][gc,ref ] GC(101) Discovered references: Soft: 31, Weak: 39, Final: 0, Phantom: 8
[14.009s][info][gc,ref ] GC(101) Enqueued references: Soft: 0, Weak: 0, Final: 0, Phantom: 0
[14.012s][info][gc,phases ] GC(101) Phase 1: Mark live objects 13.674ms
[14.012s][info][gc,phases,start] GC(101) Phase 2: Compute new object addresses
[14.026s][info][gc,phases ] GC(101) Phase 2: Compute new object addresses 14.166ms
[14.026s][info][gc,phases,start] GC(101) Phase 3: Adjust pointers
[14.030s][info][gc,phases ] GC(101) Phase 3: Adjust pointers 3.626ms
[14.030s][info][gc,phases,start] GC(101) Phase 4: Move objects
[14.128s][info][gc,phases ] GC(101) Phase 4: Move objects 98.264ms
[14.128s][info][gc,phases,start] GC(101) Phase 5: Full GC epilog
[14.146s][info][gc,ergo ] GC(101) Transfer 234 region(s) from Old to Young, yielding increased size: 790M
[14.146s][info][gc,ergo ] GC(101) FullGC done: young usage: 450M, old usage: 231M
[14.146s][info][gc,free ] Free: 296M, Max: 512K regular, 296M humongous, Frag: 0% external, 0% internal; Used: 0B, Mutator Free: 592 Collector Reserve: 40959K, Max: 512K; Used: 16B Old Collector Reserve: 1307K, Max: 511K; Used: 740K
[14.146s][info][gc,ergo ] GC(101) After Full GC, successfully transferred 0 regions to none to prepare for next gc, old available: 1307K, young_available: 296M
[14.146s][info][gc,barrier ] GC(101) Cleaned read_table from 0x0000754a50290000 to 0x0000754a5048ffff
[14.146s][info][gc,barrier ] GC(101) Current write_card_table: 0x0000754a4fc90000
[14.148s][info][gc,phases ] GC(101) Phase 5: Full GC epilog 20.265ms
[14.148s][info][gc,start ] GC(101) Verify After Full GC, Level 4
[14.182s][info][gc ] GC(101) Verify After Full GC, Level 4 (22664 reachable, 125 marked)
[14.182s][info][gc,ergo ] GC(101) At end of Full GC: GCU: 6.9%, MU: 9.9% during period of 0.261s
[14.182s][info][gc,ergo ] GC(101) At end of Full GC: Young generation used: 450M, used regions: 454M, humongous waste: 3532K, soft capacity: 1024M, max capacity: 790M, available: 296M
[14.182s][info][gc,ergo ] GC(101) At end of Full GC: Old generation used: 231M, used regions: 234M, humongous waste: 1654K, soft capacity: 0B, max capacity: 234M, available: 1307K
[14.182s][info][gc,ergo ] GC(101) Good progress for free space: 296M, need 10485K
[14.182s][info][gc,ergo ] GC(101) Good progress for used space: 148M, need 512K
[14.182s][info][gc ] GC(101) Pause Full 829M->681M(1024M) 192.311ms
...
[14.196s][info][gc ] Trigger (Young): Free (65536K) is below minimum threshold (80895K)
[14.196s][info][gc,free ] Free: 65536K, Max: 512K regular, 65536K humongous, Frag: 0% external, 0% internal; Used: 0B, Mutator Free: 128 Collector Reserve: 40959K, Max: 512K; Used: 16B Old Collector Reserve: 1307K, Max: 511K; Used: 740K
[14.196s][info][gc,ergo ] GC(102) Start GC cycle (Young)
[14.196s][info][gc,start ] GC(102) Concurrent reset (Young)
[14.196s][info][gc,task ] GC(102) Using 2 of 4 workers for Concurrent reset (Young)
[14.196s][info][gc,ergo ] GC(102) Pacer for Reset. Non-Taxable: 1024M
Allocated: 732 Mb
Allocated: 699 Mb
Allocated: 715 Mb
[14.200s][info][gc,thread ] Cancelling GC: unknown GCCause
[14.200s][info][gc ] Failed to allocate Shared, 61709K
[14.202s][info][gc ] GC(102) Concurrent reset (Young) 6.371ms
[14.203s][info][gc,barrier ] GC(102) Cleaned read_table from 0x0000754a50080000 to 0x0000754a5027ffff
[14.203s][info][gc,start ] GC(102) Pause Init Mark (Young)
[14.203s][info][gc,task ] GC(102) Using 4 of 4 workers for init marking
[14.205s][info][gc,barrier ] GC(102) Current write_card_table: 0x0000754a4fa80000
[14.205s][info][gc,start ] GC(102) Verify Before Mark, Level 4
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/home/xlpeng/repos/jdk/src/hotspot/share/gc/shenandoah/shenandoahVerifier.cpp:1270), pid=2167519, tid=2167538
# Error: Verify init-mark remembered set violation; clean card, it should be dirty.
Referenced from:
interior location: 0x00000000c00c2bfc
inside Java heap
not in collection set
region: | 1|R |O|BTE c0080000, c00c2c78, c0100000|TAMS c0080000|UWM c00c2c78|U 267K|T 0B|G 0B|P 0B|S 267K|L 267K|CP 0
Object:
0x00000000e8c00000 - klass 0x000001df001abfa0 [I
not allocated after mark start
not after update watermark
not marked strong
not marked weak
not in collection set
age: 0
mark: mark(is_unlocked no_hash age=0)
region: | 1304|H |Y|BTE e8c00000, e8c80000, e8c80000|TAMS e8c80000|UWM e8c80000|U 512K|T 0B|G 0B|P 0B|S 512K|L 0B|CP 0
Forwardee:
(the object itself)
I'll run the same test to confirm whether this PR fix the bug.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24092#issuecomment-2759904598
More information about the shenandoah-dev
mailing list