RFR: 8241603: ZGC: java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on macOS
Per Liden
per.liden at oracle.com
Thu Apr 16 10:57:41 UTC 2020
Hi,
I think I've figured out what's going on here. The code querying the
x2APIC id miscalculates the possible number space for the mapping table.
There are actually two bugs here:
1) The code assumes that cpuid leaf 0xb will give us the "level shift"
for the socket ("package") sub id. But that's not the case, you only get
is for the "core" and "thread" sub ids.
2) We incorrectly accumulate the "level shift" values from the "core"
and "thread" levels, instead of picking the max. This has the effect of
sometimes hiding bug #1, for example, when hyper-threading is enabled on
a 2 socket machine.
As the comment in the patch describes, I'm falling back to using the
initial APIC ids, instead of x2APIC ids. While this can be sub-optimal,
I don't believe it's a big problem in practice.
Could you please take this new patch for a spin in and see if we've
finally fixed the problem? Fingers crossed!
http://cr.openjdk.java.net/~pliden/8241603/webrev.1/
cheers,
Per
On 4/14/20 1:12 PM, Per Liden wrote:
> Thanks a lot for testing! The APIC id issue seems to come in a slightly
> different shape from what I expected. I'll try to dig deeper and get back.
>
> cheers,
> Per
>
> On 4/9/20 8:46 AM, Zeller, Arno wrote:
>> Hi Per,
>>
>> thanks for trying to find a solution for this issue! I am sorry to
>> report that the patch did not help. The SIGSEGV still occurs. I copied
>> some parts of the hs_err file below
>>
>> The VMware VM is always configured to have 6 cores. The difference is,
>> that in case of the crash, it is configured to have 2 * 3 cores. When
>> setting to 1 * 6 cores it does work fine.
>> Sorry for not being able to give you better information. I have no
>> direct access to the hypervisor myself and have to ask our IT
>> colleagues to do changes and then to explain to me what they have done
>> 😊.
>>
>> Best regards,
>> Arno
>> ----
>> # A fatal error has been detected by the Java Runtime Environment:
>> # SIGSEGV (0xb) at pc=0x000000010c3aff88, pid=74065, tid=9219
>> ...
>> Host: MacPro6,1 x86_64 3337 MHz, 6 cores, 16G, Darwin 18.5.0
>> Time: Thu Apr 9 00:12:46 2020 CEST elapsed time: 0.160864 seconds (0d
>> 0h 0m 0s)
>> ...
>> Current thread (0x00007fe038801000): JavaThread "main"
>> [_thread_in_vm, id=9219, stack(0x000070000d45e000,0x000070000d55e000)]
>>
>> Stack: [0x000070000d45e000,0x000070000d55e000],
>> sp=0x000070000d55d380, free space=1020k
>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>> j=interpreted, Vv=VM code, C=native code)
>> V [libjvm.dylib+0x7a0f88] ZCPU::id_slow()+0x56
>> V [libjvm.dylib+0x7aef1b] ZObjectAllocator::shared_small_page_addr()
>> const+0x41
>> V [libjvm.dylib+0x7af7d9] ZObjectAllocator::remaining() const+0x9
>> V [libjvm.dylib+0x7a4369] ZHeap::unsafe_max_tlab_alloc() const+0xd
>> V [libjvm.dylib+0x56218b]
>> ThreadLocalAllocBuffer::compute_size(unsigned long)+0x33
>> V [libjvm.dylib+0x562080]
>> MemAllocator::allocate_inside_tlab_slow(MemAllocator::Allocation&)
>> const+0xca
>> V [libjvm.dylib+0x562270]
>> MemAllocator::mem_allocate(MemAllocator::Allocation&) const+0x24
>> V [libjvm.dylib+0x5622d1] MemAllocator::allocate() const+0x47
>> V [libjvm.dylib+0x7a1318] ZCollectedHeap::array_allocate(Klass*,
>> int, int, bool, Thread*)+0x28
>> V [libjvm.dylib+0x32c0c7] InstanceKlass::allocate_objArray(int, int,
>> Thread*)+0xd7
>> ----
>>
>>> -----Original Message-----
>>> From: Per Liden <per.liden at oracle.com>
>>> Sent: Dienstag, 7. April 2020 12:53
>>> To: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-gc-dev
>>> <hotspot-gc-dev at openjdk.java.net>; Langer, Christoph
>>> <christoph.langer at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>> Subject: Re: RFR: 8241603: ZGC:
>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>>> macOS
>>>
>>> Thanks! Just checking, are you testing without the workaround[1] you did
>>> to your VMware instances?
>>>
>>> cheers,
>>> Per
>>>
>>> [1] "We solved our issue by reconfiguring the VMWare VM to have no
>>> hyperthreading and have the CPUs pinned to the VM. This solved the
>>> issues for us." -
>>> https://bugs.openjdk.java.net/browse/JDK-
>>> 8241603?focusedCommentId=14327438&page=com.atlassian.jira.plugin.syst
>>> em.issuetabpanels:comment-tabpanel#comment-14327438
>>>
>>>
>>> On 4/7/20 12:07 PM, Baesken, Matthias wrote:
>>>> Hi Per , I put your patch into our build/test queue .
>>>>
>>>> Best regards, Matthias
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Per Liden <per.liden at oracle.com>
>>>> Sent: Montag, 6. April 2020 17:04
>>>> To: hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>; Langer,
>>> Christoph <christoph.langer at sap.com>; Baesken, Matthias
>>> <matthias.baesken at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>>> Subject: RFR: 8241603: ZGC:
>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>>> macOS
>>>>
>>>> It was reported that "Every few days, the test
>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>>> macOS. It
>>>> is macOS 10.14.4, and it is a virtualized machine running with VMWare
>>>> hypervisor."
>>>>
>>>> The problem seems to be that the hypervisor (in some configurations)
>>>> can
>>>> migrate a "virtual CPU" from one physical CPU to another, and start to
>>>> report a different APIC id. As a result, it can appear as if there are
>>>> more than os:processor_count() CPUs in the system. To void this, we
>>>> allow more than one APIC id to be mapped to the same logical processor
>>>> id, so that os::processor_id() always returns a processor id that is
>>>> less than os::processos_count().
>>>>
>>>> One could argue that this is really a hypervisor bug, but we can still
>>>> make an effort to mitigate the problem in the JVM.
>>>>
>>>> SAP-folks (CC:ing those who commented in the bug), since you ran into
>>>> this problem and I don't have access to a VMware setup where I can
>>>> test/reproduce this, could you please test this patch to verify it
>>>> solves the problem? If so, that would be much appreciated.
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8241603
>>>> Webrev: http://cr.openjdk.java.net/~pliden/8241603/webrev.0
>>>> Testing: Tier 1-6 on macOS (but not macOS on top of VMware)
>>>>
>>>> cheers,
>>>> Per
>>>>
More information about the hotspot-gc-dev
mailing list