RFR: 8241603: ZGC: java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on macOS
Per Liden
per.liden at oracle.com
Thu Apr 23 12:37:25 UTC 2020
Thanks all, for reviewing and testing! I'll go ahead and push this.
/Per
On 4/23/20 1:59 PM, Zeller, Arno wrote:
> Hi Per.
>
> looks fine for me too. Seems to have solved our problems. Great work!
> Thanks a lot.
>
> Best regards,
> Arno
>
>> -----Original Message-----
>> From: Baesken, Matthias <matthias.baesken at sap.com>
>> Sent: Donnerstag, 23. April 2020 12:45
>> To: Langer, Christoph <christoph.langer at sap.com>; Per Liden
>> <per.liden at oracle.com>; hotspot-gc-dev <hotspot-gc-
>> dev at openjdk.java.net>
>> Cc: Zeller, Arno <arno.zeller at sap.com>
>> Subject: RE: RFR: 8241603: ZGC:
>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>> macOS
>>
>> Hi ,
>>
>>> So, I think this is good to go (unless Arno or Matthias disagree)
>>
>> Looks good to me as well .
>>
>> Best regards, Matthias
>>
>> -----Original Message-----
>> From: Langer, Christoph <christoph.langer at sap.com>
>> Sent: Donnerstag, 23. April 2020 11:12
>> To: Per Liden <per.liden at oracle.com>; hotspot-gc-dev <hotspot-gc-
>> dev at openjdk.java.net>
>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; Zeller, Arno
>> <arno.zeller at sap.com>
>> Subject: RE: RFR: 8241603: ZGC:
>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>> macOS
>>
>> Hi Per,
>>
>> indeed it was strange that both patches seemed to apply.
>>
>> However, after I corrected it, I didn't see issues with the test case anymore. I
>> also put an edition of the fix for jdk14u into our 14u test queue where we
>> saw the problem as well. And there we also didn't see problems with the test
>> thereafter. So, I think this is good to go (unless Arno or Matthias disagree)😊
>>
>> Cheers
>> Christoph
>>
>>> -----Original Message-----
>>> From: Per Liden <per.liden at oracle.com>
>>> Sent: Sonntag, 19. April 2020 13:03
>>> To: Langer, Christoph <christoph.langer at sap.com>; hotspot-gc-dev
>>> <hotspot-gc-dev at openjdk.java.net>
>>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; Zeller, Arno
>>> <arno.zeller at sap.com>
>>> Subject: Re: RFR: 8241603: ZGC:
>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>>> macOS
>>>
>>> Hi Christoph,
>>>
>>> On 4/19/20 8:11 AM, Langer, Christoph wrote:
>>>> Hi Per,
>>>>
>>>> we've encountered the crash once again, but I just discovered that your
>>> original patch of http://cr.openjdk.java.net/~pliden/8241603/webrev.0/
>> was
>>> applied as well. They didn't seem to interfere. I have removed the old one.
>>> Let's wait for clear results tomorrow.
>>>
>>> That's a bit strange since webrev.1 does not apply cleanly on top of
>>> webrev.0 (nor vice versa), so you should have seen a conflict there.
>>>
>>> cheers,
>>> Per
>>>
>>>>
>>>> Cheers
>>>> Christoph
>>>>
>>>>> -----Original Message-----
>>>>> From: Per Liden <per.liden at oracle.com>
>>>>> Sent: Donnerstag, 16. April 2020 12:58
>>>>> To: Zeller, Arno <arno.zeller at sap.com>; hotspot-gc-dev <hotspot-gc-
>>>>> dev at openjdk.java.net>
>>>>> Cc: Baesken, Matthias <matthias.baesken at sap.com>; Langer, Christoph
>>>>> <christoph.langer at sap.com>
>>>>> Subject: Re: RFR: 8241603: ZGC:
>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes
>> on
>>>>> macOS
>>>>>
>>>>> Hi,
>>>>>
>>>>> I think I've figured out what's going on here. The code querying the
>>>>> x2APIC id miscalculates the possible number space for the mapping
>> table.
>>>>> There are actually two bugs here:
>>>>>
>>>>> 1) The code assumes that cpuid leaf 0xb will give us the "level shift"
>>>>> for the socket ("package") sub id. But that's not the case, you only get
>>>>> is for the "core" and "thread" sub ids.
>>>>>
>>>>> 2) We incorrectly accumulate the "level shift" values from the "core"
>>>>> and "thread" levels, instead of picking the max. This has the effect of
>>>>> sometimes hiding bug #1, for example, when hyper-threading is
>> enabled
>>> on
>>>>> a 2 socket machine.
>>>>>
>>>>> As the comment in the patch describes, I'm falling back to using the
>>>>> initial APIC ids, instead of x2APIC ids. While this can be sub-optimal,
>>>>> I don't believe it's a big problem in practice.
>>>>>
>>>>> Could you please take this new patch for a spin in and see if we've
>>>>> finally fixed the problem? Fingers crossed!
>>>>>
>>>>> http://cr.openjdk.java.net/~pliden/8241603/webrev.1/
>>>>>
>>>>> cheers,
>>>>> Per
>>>>>
>>>>> On 4/14/20 1:12 PM, Per Liden wrote:
>>>>>> Thanks a lot for testing! The APIC id issue seems to come in a slightly
>>>>>> different shape from what I expected. I'll try to dig deeper and get
>> back.
>>>>>>
>>>>>> cheers,
>>>>>> Per
>>>>>>
>>>>>> On 4/9/20 8:46 AM, Zeller, Arno wrote:
>>>>>>> Hi Per,
>>>>>>>
>>>>>>> thanks for trying to find a solution for this issue! I am sorry to
>>>>>>> report that the patch did not help. The SIGSEGV still occurs. I copied
>>>>>>> some parts of the hs_err file below
>>>>>>>
>>>>>>> The VMware VM is always configured to have 6 cores. The difference
>>> is,
>>>>>>> that in case of the crash, it is configured to have 2 * 3 cores. When
>>>>>>> setting to 1 * 6 cores it does work fine.
>>>>>>> Sorry for not being able to give you better information. I have no
>>>>>>> direct access to the hypervisor myself and have to ask our IT
>>>>>>> colleagues to do changes and then to explain to me what they have
>>> done
>>>>>>> 😊.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Arno
>>>>>>> ----
>>>>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>>>>> # SIGSEGV (0xb) at pc=0x000000010c3aff88, pid=74065, tid=9219
>>>>>>> ...
>>>>>>> Host: MacPro6,1 x86_64 3337 MHz, 6 cores, 16G, Darwin 18.5.0
>>>>>>> Time: Thu Apr 9 00:12:46 2020 CEST elapsed time: 0.160864 seconds
>> (0d
>>>>>>> 0h 0m 0s)
>>>>>>> ...
>>>>>>> Current thread (0x00007fe038801000): JavaThread "main"
>>>>>>> [_thread_in_vm, id=9219,
>>>>> stack(0x000070000d45e000,0x000070000d55e000)]
>>>>>>>
>>>>>>> Stack: [0x000070000d45e000,0x000070000d55e000],
>>>>>>> sp=0x000070000d55d380, free space=1020k
>>>>>>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>>>>>>> j=interpreted, Vv=VM code, C=native code)
>>>>>>> V [libjvm.dylib+0x7a0f88] ZCPU::id_slow()+0x56
>>>>>>>
>>> V [libjvm.dylib+0x7aef1b] ZObjectAllocator::shared_small_page_addr()
>>>>>>> const+0x41
>>>>>>> V [libjvm.dylib+0x7af7d9] ZObjectAllocator::remaining() const+0x9
>>>>>>> V [libjvm.dylib+0x7a4369] ZHeap::unsafe_max_tlab_alloc()
>> const+0xd
>>>>>>> V [libjvm.dylib+0x56218b]
>>>>>>> ThreadLocalAllocBuffer::compute_size(unsigned long)+0x33
>>>>>>> V [libjvm.dylib+0x562080]
>>>>>>>
>> MemAllocator::allocate_inside_tlab_slow(MemAllocator::Allocation&)
>>>>>>> const+0xca
>>>>>>> V [libjvm.dylib+0x562270]
>>>>>>> MemAllocator::mem_allocate(MemAllocator::Allocation&)
>> const+0x24
>>>>>>> V [libjvm.dylib+0x5622d1] MemAllocator::allocate() const+0x47
>>>>>>> V [libjvm.dylib+0x7a1318] ZCollectedHeap::array_allocate(Klass*,
>>>>>>> int, int, bool, Thread*)+0x28
>>>>>>> V [libjvm.dylib+0x32c0c7] InstanceKlass::allocate_objArray(int, int,
>>>>>>> Thread*)+0xd7
>>>>>>> ----
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Per Liden <per.liden at oracle.com>
>>>>>>>> Sent: Dienstag, 7. April 2020 12:53
>>>>>>>> To: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-gc-
>>> dev
>>>>>>>> <hotspot-gc-dev at openjdk.java.net>; Langer, Christoph
>>>>>>>> <christoph.langer at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>>>>>>> Subject: Re: RFR: 8241603: ZGC:
>>>>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh
>>> crashes
>>>>> on
>>>>>>>> macOS
>>>>>>>>
>>>>>>>> Thanks! Just checking, are you testing without the workaround[1]
>> you
>>>>> did
>>>>>>>> to your VMware instances?
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> Per
>>>>>>>>
>>>>>>>> [1] "We solved our issue by reconfiguring the VMWare VM to have
>> no
>>>>>>>> hyperthreading and have the CPUs pinned to the VM. This solved
>> the
>>>>>>>> issues for us." -
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-
>>>>>>>>
>>>>>
>>>
>> 8241603?focusedCommentId=14327438&page=com.atlassian.jira.plugin.syst
>>>>>>>> em.issuetabpanels:comment-tabpanel#comment-14327438
>>>>>>>>
>>>>>>>>
>>>>>>>> On 4/7/20 12:07 PM, Baesken, Matthias wrote:
>>>>>>>>> Hi Per , I put your patch into our build/test queue .
>>>>>>>>>
>>>>>>>>> Best regards, Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Per Liden <per.liden at oracle.com>
>>>>>>>>> Sent: Montag, 6. April 2020 17:04
>>>>>>>>> To: hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>; Langer,
>>>>>>>> Christoph <christoph.langer at sap.com>; Baesken, Matthias
>>>>>>>> <matthias.baesken at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>>>>>>>> Subject: RFR: 8241603: ZGC:
>>>>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh
>>> crashes
>>>>> on
>>>>>>>> macOS
>>>>>>>>>
>>>>>>>>> It was reported that "Every few days, the test
>>>>>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh
>>> crashes
>>>>> on
>>>>>>>> macOS. It
>>>>>>>>> is macOS 10.14.4, and it is a virtualized machine running with
>>> VMWare
>>>>>>>>> hypervisor."
>>>>>>>>>
>>>>>>>>> The problem seems to be that the hypervisor (in some
>>> configurations)
>>>>>>>>> can
>>>>>>>>> migrate a "virtual CPU" from one physical CPU to another, and start
>>> to
>>>>>>>>> report a different APIC id. As a result, it can appear as if there are
>>>>>>>>> more than os:processor_count() CPUs in the system. To void this,
>>> we
>>>>>>>>> allow more than one APIC id to be mapped to the same logical
>>>>> processor
>>>>>>>>> id, so that os::processor_id() always returns a processor id that is
>>>>>>>>> less than os::processos_count().
>>>>>>>>>
>>>>>>>>> One could argue that this is really a hypervisor bug, but we can still
>>>>>>>>> make an effort to mitigate the problem in the JVM.
>>>>>>>>>
>>>>>>>>> SAP-folks (CC:ing those who commented in the bug), since you ran
>>> into
>>>>>>>>> this problem and I don't have access to a VMware setup where I
>> can
>>>>>>>>> test/reproduce this, could you please test this patch to verify it
>>>>>>>>> solves the problem? If so, that would be much appreciated.
>>>>>>>>>
>>>>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8241603
>>>>>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8241603/webrev.0
>>>>>>>>> Testing: Tier 1-6 on macOS (but not macOS on top of VMware)
>>>>>>>>>
>>>>>>>>> cheers,
>>>>>>>>> Per
>>>>>>>>>
More information about the hotspot-gc-dev
mailing list