RFR: 8241603: ZGC: java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on macOS

Thu Apr 23 10:20:59 UTC 2020

Hi,

This looks good.

Thanks,
/Erik

On 2020-04-16 12:57, Per Liden wrote:
> Hi,
>
> I think I've figured out what's going on here. The code querying the 
> x2APIC id miscalculates the possible number space for the mapping 
> table. There are actually two bugs here:
>
> 1) The code assumes that cpuid leaf 0xb will give us the "level shift" 
> for the socket ("package") sub id. But that's not the case, you only 
> get is for the "core" and "thread" sub ids.
>
> 2) We incorrectly accumulate the "level shift" values from the "core" 
> and "thread" levels, instead of picking the max. This has the effect 
> of sometimes hiding bug #1, for example, when hyper-threading is 
> enabled on a 2 socket machine.
>
> As the comment in the patch describes, I'm falling back to using the 
> initial APIC ids, instead of x2APIC ids. While this can be 
> sub-optimal, I don't believe it's a big problem in practice.
>
> Could you please take this new patch for a spin in and see if we've 
> finally fixed the problem? Fingers crossed!
>
> http://cr.openjdk.java.net/~pliden/8241603/webrev.1/
>
> cheers,
> Per
>
> On 4/14/20 1:12 PM, Per Liden wrote:
>> Thanks a lot for testing! The APIC id issue seems to come in a 
>> slightly different shape from what I expected. I'll try to dig deeper 
>> and get back.
>>
>> cheers,
>> Per
>>
>> On 4/9/20 8:46 AM, Zeller, Arno wrote:
>>> Hi Per,
>>>
>>> thanks for trying to find a solution for this issue! I am sorry to 
>>> report that the patch did not help. The SIGSEGV still occurs. I 
>>> copied some parts of the hs_err file below
>>>
>>> The VMware VM is always configured to have 6 cores. The difference 
>>> is, that in case of the crash, it is configured to have 2 * 3 cores. 
>>> When setting to 1 * 6 cores it does work fine.
>>> Sorry for not being able to give you  better information. I have no 
>>> direct access to the hypervisor myself and have to ask our IT 
>>> colleagues to do changes and then to explain to me what they have 
>>> done 😊.
>>>
>>> Best regards,
>>> Arno
>>> ----
>>> # A fatal error has been detected by the Java Runtime Environment:
>>> #  SIGSEGV (0xb) at pc=0x000000010c3aff88, pid=74065, tid=9219
>>> ...
>>> Host: MacPro6,1 x86_64 3337 MHz, 6 cores, 16G, Darwin 18.5.0
>>> Time: Thu Apr  9 00:12:46 2020 CEST elapsed time: 0.160864 seconds 
>>> (0d 0h 0m 0s)
>>> ...
>>> Current thread (0x00007fe038801000):  JavaThread "main" 
>>> [_thread_in_vm, id=9219, stack(0x000070000d45e000,0x000070000d55e000)]
>>>
>>> Stack: [0x000070000d45e000,0x000070000d55e000], 
>>> sp=0x000070000d55d380,  free space=1020k
>>> Native frames: (J=compiled Java code, A=aot compiled Java code, 
>>> j=interpreted, Vv=VM code, C=native code)
>>> V  [libjvm.dylib+0x7a0f88]  ZCPU::id_slow()+0x56
>>> V  [libjvm.dylib+0x7aef1b] 
>>> ZObjectAllocator::shared_small_page_addr() const+0x41
>>> V  [libjvm.dylib+0x7af7d9]  ZObjectAllocator::remaining() const+0x9
>>> V  [libjvm.dylib+0x7a4369]  ZHeap::unsafe_max_tlab_alloc() const+0xd
>>> V  [libjvm.dylib+0x56218b] 
>>> ThreadLocalAllocBuffer::compute_size(unsigned long)+0x33
>>> V  [libjvm.dylib+0x562080] 
>>> MemAllocator::allocate_inside_tlab_slow(MemAllocator::Allocation&) 
>>> const+0xca
>>> V  [libjvm.dylib+0x562270] 
>>> MemAllocator::mem_allocate(MemAllocator::Allocation&) const+0x24
>>> V  [libjvm.dylib+0x5622d1]  MemAllocator::allocate() const+0x47
>>> V  [libjvm.dylib+0x7a1318] ZCollectedHeap::array_allocate(Klass*, 
>>> int, int, bool, Thread*)+0x28
>>> V  [libjvm.dylib+0x32c0c7] InstanceKlass::allocate_objArray(int, 
>>> int, Thread*)+0xd7
>>> ----
>>>
>>>> -----Original Message-----
>>>> From: Per Liden <per.liden at oracle.com>
>>>> Sent: Dienstag, 7. April 2020 12:53
>>>> To: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-gc-dev
>>>> <hotspot-gc-dev at openjdk.java.net>; Langer, Christoph
>>>> <christoph.langer at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>>> Subject: Re: RFR: 8241603: ZGC:
>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>>>> macOS
>>>>
>>>> Thanks! Just checking, are you testing without the workaround[1] 
>>>> you did
>>>> to your VMware instances?
>>>>
>>>> cheers,
>>>> Per
>>>>
>>>> [1] "We solved our issue by reconfiguring the VMWare VM to have no
>>>> hyperthreading and have the CPUs pinned to the VM. This solved the
>>>> issues for us." -
>>>> https://bugs.openjdk.java.net/browse/JDK-
>>>> 8241603?focusedCommentId=14327438&page=com.atlassian.jira.plugin.syst
>>>> em.issuetabpanels:comment-tabpanel#comment-14327438
>>>>
>>>>
>>>> On 4/7/20 12:07 PM, Baesken, Matthias wrote:
>>>>> Hi Per , I put your patch  into our build/test queue .
>>>>>
>>>>> Best regards, Matthias
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Per Liden <per.liden at oracle.com>
>>>>> Sent: Montag, 6. April 2020 17:04
>>>>> To: hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>; Langer,
>>>> Christoph <christoph.langer at sap.com>; Baesken, Matthias
>>>> <matthias.baesken at sap.com>; Zeller, Arno <arno.zeller at sap.com>
>>>>> Subject: RFR: 8241603: ZGC:
>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>>>> macOS
>>>>>
>>>>> It was reported that "Every few days, the test
>>>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
>>>> macOS. It
>>>>> is macOS 10.14.4, and it is a virtualized machine running with VMWare
>>>>> hypervisor."
>>>>>
>>>>> The problem seems to be that the hypervisor (in some 
>>>>> configurations) can
>>>>> migrate a "virtual CPU" from one physical CPU to another, and 
>>>>> start to
>>>>> report a different APIC id. As a result, it can appear as if there 
>>>>> are
>>>>> more than os:processor_count() CPUs in the system. To void this, we
>>>>> allow more than one APIC id to be mapped to the same logical 
>>>>> processor
>>>>> id, so that os::processor_id() always returns a processor id that is
>>>>> less than os::processos_count().
>>>>>
>>>>> One could argue that this is really a hypervisor bug, but we can 
>>>>> still
>>>>> make an effort to mitigate the problem in the JVM.
>>>>>
>>>>> SAP-folks (CC:ing those who commented in the bug), since you ran into
>>>>> this problem and I don't have access to a VMware setup where I can
>>>>> test/reproduce this, could you please test this patch to verify it
>>>>> solves the problem? If so, that would be much appreciated.
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8241603
>>>>> Webrev: http://cr.openjdk.java.net/~pliden/8241603/webrev.0
>>>>> Testing: Tier 1-6 on macOS (but not macOS on top of VMware)
>>>>>
>>>>> cheers,
>>>>> Per
>>>>>