RFR: 8241603: ZGC: java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on macOS
Per Liden
per.liden at oracle.com
Sun Apr 19 11:03:15 UTC 2020
Hi Christoph,
On 4/19/20 8:11 AM, Langer, Christoph wrote:
> Hi Per,
>
> we've encountered the crash once again, but I just discovered that your original patch of http://cr.openjdk.java.net/~pliden/8241603/webrev.0/ was applied as well. They didn't seem to interfere. I have removed the old one. Let's wait for clear results tomorrow.
That's a bit strange since webrev.1 does not apply cleanly on top of
webrev.0 (nor vice versa), so you should have seen a conflict there.
cheers,
Per
>
> Cheers
> Christoph
>
>> -----Original Message-----
>> From: Per Liden <per.liden at oracle.com>
>> Sent: Donnerstag, 16. April 2020 12:58
>> To: Zeller, Arno <arno.zeller at sap.com>; hotspot-gc-dev <hotspot-gc-
>> dev at openjdk.java.net>
>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; Langer, Christoph
>> <christoph.langer at sap.com>
>> Subject: Re: RFR: 8241603: ZGC:
>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>> macOS
>>
>> Hi,
>>
>> I think I've figured out what's going on here. The code querying the
>> x2APIC id miscalculates the possible number space for the mapping table.
>> There are actually two bugs here:
>>
>> 1) The code assumes that cpuid leaf 0xb will give us the "level shift"
>> for the socket ("package") sub id. But that's not the case, you only get
>> is for the "core" and "thread" sub ids.
>>
>> 2) We incorrectly accumulate the "level shift" values from the "core"
>> and "thread" levels, instead of picking the max. This has the effect of
>> sometimes hiding bug #1, for example, when hyper-threading is enabled on
>> a 2 socket machine.
>>
>> As the comment in the patch describes, I'm falling back to using the
>> initial APIC ids, instead of x2APIC ids. While this can be sub-optimal,
>> I don't believe it's a big problem in practice.
>>
>> Could you please take this new patch for a spin in and see if we've
>> finally fixed the problem? Fingers crossed!
>>
>> http://cr.openjdk.java.net/~pliden/8241603/webrev.1/
>>
>> cheers,
>> Per
>>
>> On 4/14/20 1:12 PM, Per Liden wrote:
>>> Thanks a lot for testing! The APIC id issue seems to come in a slightly
>>> different shape from what I expected. I'll try to dig deeper and get back.
>>>
>>> cheers,
>>> Per
>>>
>>> On 4/9/20 8:46 AM, Zeller, Arno wrote:
>>>> Hi Per,
>>>>
>>>> thanks for trying to find a solution for this issue! I am sorry to
>>>> report that the patch did not help. The SIGSEGV still occurs. I copied
>>>> some parts of the hs_err file below
>>>>
>>>> The VMware VM is always configured to have 6 cores. The difference is,
>>>> that in case of the crash, it is configured to have 2 * 3 cores. When
>>>> setting to 1 * 6 cores it does work fine.
>>>> Sorry for not being able to give you better information. I have no
>>>> direct access to the hypervisor myself and have to ask our IT
>>>> colleagues to do changes and then to explain to me what they have done
>>>> 😊.
>>>>
>>>> Best regards,
>>>> Arno
>>>> ----
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> # SIGSEGV (0xb) at pc=0x000000010c3aff88, pid=74065, tid=9219
>>>> ...
>>>> Host: MacPro6,1 x86_64 3337 MHz, 6 cores, 16G, Darwin 18.5.0
>>>> Time: Thu Apr 9 00:12:46 2020 CEST elapsed time: 0.160864 seconds (0d
>>>> 0h 0m 0s)
>>>> ...
>>>> Current thread (0x00007fe038801000): JavaThread "main"
>>>> [_thread_in_vm, id=9219,
>> stack(0x000070000d45e000,0x000070000d55e000)]
>>>>
>>>> Stack: [0x000070000d45e000,0x000070000d55e000],
>>>> sp=0x000070000d55d380, free space=1020k
>>>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>>>> j=interpreted, Vv=VM code, C=native code)
>>>> V [libjvm.dylib+0x7a0f88] ZCPU::id_slow()+0x56
>>>> V [libjvm.dylib+0x7aef1b] ZObjectAllocator::shared_small_page_addr()
>>>> const+0x41
>>>> V [libjvm.dylib+0x7af7d9] ZObjectAllocator::remaining() const+0x9
>>>> V [libjvm.dylib+0x7a4369] ZHeap::unsafe_max_tlab_alloc() const+0xd
>>>> V [libjvm.dylib+0x56218b]
>>>> ThreadLocalAllocBuffer::compute_size(unsigned long)+0x33
>>>> V [libjvm.dylib+0x562080]
>>>> MemAllocator::allocate_inside_tlab_slow(MemAllocator::Allocation&)
>>>> const+0xca
>>>> V [libjvm.dylib+0x562270]
>>>> MemAllocator::mem_allocate(MemAllocator::Allocation&) const+0x24
>>>> V [libjvm.dylib+0x5622d1] MemAllocator::allocate() const+0x47
>>>> V [libjvm.dylib+0x7a1318] ZCollectedHeap::array_allocate(Klass*,
>>>> int, int, bool, Thread*)+0x28
>>>> V [libjvm.dylib+0x32c0c7] InstanceKlass::allocate_objArray(int, int,
>>>> Thread*)+0xd7
>>>> ----
>>>>
>>>>> -----Original Message-----
>>>>> From: Per Liden <per.liden at oracle.com>
>>>>> Sent: Dienstag, 7. April 2020 12:53
>>>>> To: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-gc-dev
>>>>> <hotspot-gc-dev at openjdk.java.net>; Langer, Christoph
>>>>> <christoph.langer at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>>>> Subject: Re: RFR: 8241603: ZGC:
>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes
>> on
>>>>> macOS
>>>>>
>>>>> Thanks! Just checking, are you testing without the workaround[1] you
>> did
>>>>> to your VMware instances?
>>>>>
>>>>> cheers,
>>>>> Per
>>>>>
>>>>> [1] "We solved our issue by reconfiguring the VMWare VM to have no
>>>>> hyperthreading and have the CPUs pinned to the VM. This solved the
>>>>> issues for us." -
>>>>> https://bugs.openjdk.java.net/browse/JDK-
>>>>>
>> 8241603?focusedCommentId=14327438&page=com.atlassian.jira.plugin.syst
>>>>> em.issuetabpanels:comment-tabpanel#comment-14327438
>>>>>
>>>>>
>>>>> On 4/7/20 12:07 PM, Baesken, Matthias wrote:
>>>>>> Hi Per , I put your patch into our build/test queue .
>>>>>>
>>>>>> Best regards, Matthias
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Per Liden <per.liden at oracle.com>
>>>>>> Sent: Montag, 6. April 2020 17:04
>>>>>> To: hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>; Langer,
>>>>> Christoph <christoph.langer at sap.com>; Baesken, Matthias
>>>>> <matthias.baesken at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>>>>> Subject: RFR: 8241603: ZGC:
>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes
>> on
>>>>> macOS
>>>>>>
>>>>>> It was reported that "Every few days, the test
>>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes
>> on
>>>>> macOS. It
>>>>>> is macOS 10.14.4, and it is a virtualized machine running with VMWare
>>>>>> hypervisor."
>>>>>>
>>>>>> The problem seems to be that the hypervisor (in some configurations)
>>>>>> can
>>>>>> migrate a "virtual CPU" from one physical CPU to another, and start to
>>>>>> report a different APIC id. As a result, it can appear as if there are
>>>>>> more than os:processor_count() CPUs in the system. To void this, we
>>>>>> allow more than one APIC id to be mapped to the same logical
>> processor
>>>>>> id, so that os::processor_id() always returns a processor id that is
>>>>>> less than os::processos_count().
>>>>>>
>>>>>> One could argue that this is really a hypervisor bug, but we can still
>>>>>> make an effort to mitigate the problem in the JVM.
>>>>>>
>>>>>> SAP-folks (CC:ing those who commented in the bug), since you ran into
>>>>>> this problem and I don't have access to a VMware setup where I can
>>>>>> test/reproduce this, could you please test this patch to verify it
>>>>>> solves the problem? If so, that would be much appreciated.
>>>>>>
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8241603
>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8241603/webrev.0
>>>>>> Testing: Tier 1-6 on macOS (but not macOS on top of VMware)
>>>>>>
>>>>>> cheers,
>>>>>> Per
>>>>>>
More information about the hotspot-gc-dev
mailing list