[premain] need help merging g1BarrierSetC2.cpp
Andrew Dinn
adinn at redhat.com
Thu Oct 10 19:13:17 UTC 2024
On 10/10/2024 09:46, Andrew Dinn wrote:
> On 09/10/2024 22:23, ioi.lam at oracle.com wrote:
>> I'm also seeing failures on macos-aarch64 and linux-aarch64. In
>> addition to the tests I listed below, these two also fail reliably
>>
>> runtime/cds/appcds/applications/MicronautFirstApp.java#leyden
>> runtime/cds/appcds/applications/MicronautFirstApp.java#leyden_old
>
> I'll see if I can debug these failures.
I have been debugging MicronautFirstApp.java#leyden on aarch64 and found
a few details of what is going wrong but no fix yet.
The error happens in the production VM run. It is bailing out with a
SEGV at a pc that is at a mapped address but does not identify real
code. The problem appears consistently but not always in the same place.
I have been able to identify some details of what is happening under
calls to ConcurrentHashMap$ValueIterator::next or
ReentrantLock$NonfairSync::initialTryLock. It also crashes in
interpreted code but in that case I have not been yet able to decipher
the stack and identify the method.
When the crash happens it is under compiled code that has been loaded
from the AOT cache. The methods cited above are actually the ones whose
return address is on the stack when the SEGV happens. The preceding
instruction in both cases is a branch to a stub that links to another
method pulled from the cache.
So, for ConcurrentHashMap$ValueIterator::next the stub transfers control
to ConcurrentHashMap$Traverser::advance. The frame that sits above the
return address and frame pointer certainly looks like we are in method
advance (the frame is certainly the right size) but somehow we have
branched off into hyperspace and hit a SEGV.
I assumed this might relate to the GC change I made to the post barrier
to indirectly load grain and card shift from the AOTConstants area. The
called methods both included object writes and gc post barriers. So, I
reverted the GC Barrier code to use hard wired shifts. The error
continued to happen in the same places. So, I am leaning towards the
idea that this is some other conflict with the GC changes to implement
late barriers.
I'll keep on debugging and post when I make progress. Perhaps Vladimir
or Ashu could check the merge and also the tweak I made to the x86
barrier code to make the register pushes safe.
regards,
Andrew Dinn
-----------
>> On 10/9/24 9:02 AM, ioi.lam at oracle.com wrote:
>>> I am seeing some failures on linux/x64 with various assertion
>>> failures in the GC code, either during collection or gc
>>> verification. I merged up to ReentrantLock$NonfairSync::initialTryLocku
>>> d504df0e3d7928f0e7ff10ace03e002953c6a58f from Andrew's repo.
>>>
>>> runtime/cds/appcds/applications/HelidonQuickStartSE.java#leyden
>>> runtime/cds/appcds/applications/HelidonQuickStartSE.java#leyden_old
>>> runtime/cds/appcds/applications/JavacBench.java#leyden
>>> runtime/cds/appcds/applications/JavacBench.java#leyden_old
>>> runtime/cds/appcds/applications/SpringPetClinic.java#leyden_old
>>> runtime/cds/appcds/leyden/ExcludedClasses.java
>>>
>>> A sample stack trace
>>>
>>> # SIGSEGV (0xb) at pc=0x00007ff1d9260f78, pid=472145, tid=472252
>>> V [libjvm.so+0x636f78] oopDesc::size_given_klass(Klass*)+0x28
>>> (oop.inline.hpp:196)
>>> V [libjvm.so+0xd9e66d]
>>> G1ParScanThreadState::do_copy_to_survivor_space(G1HeapRegionAttr,
>>> oop, markWord)+0x9d (g1ParScanThreadState.cpp:480)
>>> V [libjvm.so+0xda43db] void
>>> G1ParScanThreadState::do_oop_evac<narrowOop>(narrowOop*)+0x12b
>>> (g1ParScanThreadState.cpp:221)
>>> V [libjvm.so+0xd9fa98]
>>> G1ParScanThreadState::trim_queue_to_threshold(unsigned int)+0x488
>>> (g1ParScanThreadState.cpp:334)
>>> V [libjvm.so+0xdb96fa]
>>> G1ParScanThreadState::trim_queue_partially()+0x3a
>>> (g1ParScanThreadState.inline.hpp:53)
>>> V [libjvm.so+0xdbe943] void
>>> G1ScanHRForRegionClosure::ChunkScanner::on_dirty_cards<G1ScanHRForRegionClosure::scan_heap_roots(G1HeapRegion*)::{lambda(unsigned char*, unsigned char*)#1}>(G1ScanHRForRegionClosure::scan_heap_roots(G1HeapRegion*)::{lambda(unsigned char*, unsigned char*)#1}&&)+0x2c3 (g1RemSet.cpp:519)
>>> V [libjvm.so+0xdb6f39]
>>> G1RemSet::scan_heap_roots(G1ParScanThreadState*, unsigned int,
>>> G1GCPhaseTimes::GCParPhases, G1GCPhaseTimes::GCParPhases, bool)+0x5e9
>>> (g1RemSet.cpp:664)
>>> V [libjvm.so+0xdea89b]
>>> G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState*, unsigned
>>> int)+0x4b (g1YoungCollector.cpp:670)
>>> V [libjvm.so+0xdeac6a] G1EvacuateRegionsBaseTask::work(unsigned
>>> int)+0x8a (g1YoungCollector.cpp:656)
>>> V [libjvm.so+0x1a2b480] WorkerThread::run()+0x80 (workerThread.cpp:70)
>>> V [libjvm.so+0x18cc40a] Thread::call_run()+0xba (thread.cpp:245)
>>> V [libjvm.so+0x159e0f9] thread_native_entry(Thread*)+0x1c9
>>> (os_linux.cpp:858)
>>>
>>> siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr:
>>> 0x000000000000010e
>>>
>>> I'll run tests on aarch64 as well and will report back when I get the
>>> results.
>>>
>>> Thanks
>>>
>>> - Ioi
>>>
>>>
>>> On 10/9/24 8:23 AM, ioi.lam at oracle.com wrote:
>>>> Thanks Andrew.
>>>>
>>>> I'll test your merge on our CI and then push to premain.
>>>>
>>>> - Ioi
>>>>
>>>> On 10/9/24 7:46 AM, Andrew Dinn wrote:
>>>>> Hi Ioi/Vladimir,
>>>>>
>>>>> I merged the rest of master into premain, fixed the conflicts and
>>>>> pushed the result to my premain branch.
>>>>>
>>>>> https://urldefense.com/v3/__https://github.com/adinn/leyden/tree/premain__;!!ACWV5N9M2RV99hQ!L_H5Q7HrxgwFxgFHEDodn9bmN5PlYgD0PbjCSMB0D8AJpVB-zvJog8wGxXubkqGBozWm8phOCOc$
>>>>> The merged code built and ran ok on aarch64. On x86 it built ok but
>>>>> crashed during the assembly run.
>>>>>
>>>>> The problem was that the x86 post barrier can now be called from C2
>>>>> with tmp and tmp2 allocated by the register allocator. So, that
>>>>> means both rcx and rscratch1 are possible values for register tmp.
>>>>> This is a problem because the x86 AOT variant of the barrier code
>>>>> pushes rcx and rscratch1 to the stack, loads the region grain shift
>>>>> address in to rscratch1 and the grain shift into into rcx. It has
>>>>> to be able to use rcx in order to be able to execute shrptr(tmp).
>>>>> However, if tmp == rcx then the unshifted value is restored into
>>>>> tmp when rcx is popped.
>>>>>
>>>>> I pushed a follow-up patch to my premain that avoids this problem.
>>>>> It uses a pushed register to do the shift and ensures that register
>>>>> differs from tmp and rcx.
>>>>>
>>>>> regards,
>>>>>
>>>>>
>>>>> Andrew Dinn
>>>>> -----------
>>>>>
>>>>>
>>>>> On 09/10/2024 07:08, ioi.lam at oracle.com wrote:
>>>>>> In the premain branch, we have made some changes in
>>>>>> g1BarrierSetC2.cpp
>>>>>>
>>>>>> commit cc111760a0dd123935e039feb19eaac742f865b2
>>>>>> Author: Vladimir Kozlov <vladimir.kozlov at oracle.com>
>>>>>> Date: Mon Aug 26 14:21:28 2024
>>>>>>
>>>>>> Cleanup AOT runtime constants code
>>>>>>
>>>>>> commit c319a3ee2be0236feab88c9be66d71e6e93b53fc
>>>>>> Author: Andrew Dinn <adinn at redhat.com>
>>>>>> Date: Wed Aug 21 07:48:22 2024
>>>>>>
>>>>>> AOT G1 barrier loads region grain and card shifts via runtime
>>>>>> constants data area
>>>>>>
>>>>>> Recently, the following changes in g1BarrierSetC2.cpp have been
>>>>>> pushed to mainline that caused conflicts with the leyden changes.
>>>>>>
>>>>>> commit 81ebbb2463df8b014bb209dc4028668fc78e8327
>>>>>> Author: Roberto CastaƱeda Lozano <rcastanedalo at openjdk.org>
>>>>>> Date: Mon Oct 7 01:28:18 2024
>>>>>>
>>>>>> 8341525: G1: use bit clearing to remove tightly-coupled
>>>>>> initialization store pre-barriers
>>>>>>
>>>>>> commit 0b467e902d591ae9feeec1669918d1588987cd1c
>>>>>> Author: Roberto CastaƱeda Lozano <rcastanedalo at openjdk.org>
>>>>>> Date: Thu Oct 3 01:36:33 2024
>>>>>>
>>>>>> 8334060: Implementation of Late Barrier Expansion for G1
>>>>>>
>>>>>>
>>>>>> When I tried merging with mainline, I couldn't merge these
>>>>>> changes, so I ended up merging to this one (just an hour prior to
>>>>>> 0b467e9), and pushed the result to premain.
>>>>>>
>>>>>> commit c6e7e551928c04b74775b5d4c03eb31232aeb2c9 (master)
>>>>>> Author: Aleksey Shipilev <shade at openjdk.org>
>>>>>> Date: Thu Oct 3 00:25:42 2024
>>>>>>
>>>>>> 8341091: CDS: Segmented roots array misses roots
>>>>>>
>>>>>> Andrew Dinn or Vladimir Kozlov, could you help merging with mainline?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> - Ioi
>>>>>>
>>>>>
>>
More information about the leyden-dev
mailing list