RFR: 8241603: ZGC: java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on macOS

Sun Apr 19 06:11:35 UTC 2020

Hi Per,

we've encountered the crash once again, but I just discovered that your original patch of http://cr.openjdk.java.net/~pliden/8241603/webrev.0/ was applied as well. They didn't seem to interfere. I have removed the old one. Let's wait for clear results tomorrow.

Cheers
Christoph

> -----Original Message-----
> From: Per Liden <per.liden at oracle.com>
> Sent: Donnerstag, 16. April 2020 12:58
> To: Zeller, Arno <arno.zeller at sap.com>; hotspot-gc-dev <hotspot-gc-
> dev at openjdk.java.net>
> Cc: Baesken, Matthias <matthias.baesken at sap.com>; Langer, Christoph
> <christoph.langer at sap.com>
> Subject: Re: RFR: 8241603: ZGC:
> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes on
> macOS
> 
> Hi,
> 
> I think I've figured out what's going on here. The code querying the
> x2APIC id miscalculates the possible number space for the mapping table.
> There are actually two bugs here:
> 
> 1) The code assumes that cpuid leaf 0xb will give us the "level shift"
> for the socket ("package") sub id. But that's not the case, you only get
> is for the "core" and "thread" sub ids.
> 
> 2) We incorrectly accumulate the "level shift" values from the "core"
> and "thread" levels, instead of picking the max. This has the effect of
> sometimes hiding bug #1, for example, when hyper-threading is enabled on
> a 2 socket machine.
> 
> As the comment in the patch describes, I'm falling back to using the
> initial APIC ids, instead of x2APIC ids. While this can be sub-optimal,
> I don't believe it's a big problem in practice.
> 
> Could you please take this new patch for a spin in and see if we've
> finally fixed the problem? Fingers crossed!
> 
> http://cr.openjdk.java.net/~pliden/8241603/webrev.1/
> 
> cheers,
> Per
> 
> On 4/14/20 1:12 PM, Per Liden wrote:
> > Thanks a lot for testing! The APIC id issue seems to come in a slightly
> > different shape from what I expected. I'll try to dig deeper and get back.
> >
> > cheers,
> > Per
> >
> > On 4/9/20 8:46 AM, Zeller, Arno wrote:
> >> Hi Per,
> >>
> >> thanks for trying to find a solution for this issue! I am sorry to
> >> report that the patch did not help. The SIGSEGV still occurs. I copied
> >> some parts of the hs_err file below
> >>
> >> The VMware VM is always configured to have 6 cores. The difference is,
> >> that in case of the crash, it is configured to have 2 * 3 cores. When
> >> setting to 1 * 6 cores it does work fine.
> >> Sorry for not being able to give you  better information. I have no
> >> direct access to the hypervisor myself and have to ask our IT
> >> colleagues to do changes and then to explain to me what they have done
> >> 😊.
> >>
> >> Best regards,
> >> Arno
> >> ----
> >> # A fatal error has been detected by the Java Runtime Environment:
> >> #  SIGSEGV (0xb) at pc=0x000000010c3aff88, pid=74065, tid=9219
> >> ...
> >> Host: MacPro6,1 x86_64 3337 MHz, 6 cores, 16G, Darwin 18.5.0
> >> Time: Thu Apr  9 00:12:46 2020 CEST elapsed time: 0.160864 seconds (0d
> >> 0h 0m 0s)
> >> ...
> >> Current thread (0x00007fe038801000):  JavaThread "main"
> >> [_thread_in_vm, id=9219,
> stack(0x000070000d45e000,0x000070000d55e000)]
> >>
> >> Stack: [0x000070000d45e000,0x000070000d55e000],
> >> sp=0x000070000d55d380,  free space=1020k
> >> Native frames: (J=compiled Java code, A=aot compiled Java code,
> >> j=interpreted, Vv=VM code, C=native code)
> >> V  [libjvm.dylib+0x7a0f88]  ZCPU::id_slow()+0x56
> >> V  [libjvm.dylib+0x7aef1b]  ZObjectAllocator::shared_small_page_addr()
> >> const+0x41
> >> V  [libjvm.dylib+0x7af7d9]  ZObjectAllocator::remaining() const+0x9
> >> V  [libjvm.dylib+0x7a4369]  ZHeap::unsafe_max_tlab_alloc() const+0xd
> >> V  [libjvm.dylib+0x56218b]
> >> ThreadLocalAllocBuffer::compute_size(unsigned long)+0x33
> >> V  [libjvm.dylib+0x562080]
> >> MemAllocator::allocate_inside_tlab_slow(MemAllocator::Allocation&)
> >> const+0xca
> >> V  [libjvm.dylib+0x562270]
> >> MemAllocator::mem_allocate(MemAllocator::Allocation&) const+0x24
> >> V  [libjvm.dylib+0x5622d1]  MemAllocator::allocate() const+0x47
> >> V  [libjvm.dylib+0x7a1318]  ZCollectedHeap::array_allocate(Klass*,
> >> int, int, bool, Thread*)+0x28
> >> V  [libjvm.dylib+0x32c0c7]  InstanceKlass::allocate_objArray(int, int,
> >> Thread*)+0xd7
> >> ----
> >>
> >>> -----Original Message-----
> >>> From: Per Liden <per.liden at oracle.com>
> >>> Sent: Dienstag, 7. April 2020 12:53
> >>> To: Baesken, Matthias <matthias.baesken at sap.com>; hotspot-gc-dev
> >>> <hotspot-gc-dev at openjdk.java.net>; Langer, Christoph
> >>> <christoph.langer at sap.com>; Zeller, Arno <arno.zeller at sap.com>
> >>> Subject: Re: RFR: 8241603: ZGC:
> >>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes
> on
> >>> macOS
> >>>
> >>> Thanks! Just checking, are you testing without the workaround[1] you
> did
> >>> to your VMware instances?
> >>>
> >>> cheers,
> >>> Per
> >>>
> >>> [1] "We solved our issue by reconfiguring the VMWare VM to have no
> >>> hyperthreading and have the CPUs pinned to the VM. This solved the
> >>> issues for us." -
> >>> https://bugs.openjdk.java.net/browse/JDK-
> >>>
> 8241603?focusedCommentId=14327438&page=com.atlassian.jira.plugin.syst
> >>> em.issuetabpanels:comment-tabpanel#comment-14327438
> >>>
> >>>
> >>> On 4/7/20 12:07 PM, Baesken, Matthias wrote:
> >>>> Hi Per , I put your patch  into our  build/test queue .
> >>>>
> >>>> Best regards, Matthias
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Per Liden <per.liden at oracle.com>
> >>>> Sent: Montag, 6. April 2020 17:04
> >>>> To: hotspot-gc-dev <hotspot-gc-dev at openjdk.java.net>; Langer,
> >>> Christoph <christoph.langer at sap.com>; Baesken, Matthias
> >>> <matthias.baesken at sap.com>; Zeller, Arno <arno.zeller at sap.com>
> >>>> Subject: RFR: 8241603: ZGC:
> >>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes
> on
> >>> macOS
> >>>>
> >>>> It was reported that "Every few days, the test
> >>>> java/lang/management/MemoryMXBean/MemoryTestZGC.sh crashes
> on
> >>> macOS. It
> >>>> is macOS 10.14.4, and it is a virtualized machine running with VMWare
> >>>> hypervisor."
> >>>>
> >>>> The problem seems to be that the hypervisor (in some configurations)
> >>>> can
> >>>> migrate a "virtual CPU" from one physical CPU to another, and start to
> >>>> report a different APIC id. As a result, it can appear as if there are
> >>>> more than os:processor_count() CPUs in the system. To void this, we
> >>>> allow more than one APIC id to be mapped to the same logical
> processor
> >>>> id, so that os::processor_id() always returns a processor id that is
> >>>> less than os::processos_count().
> >>>>
> >>>> One could argue that this is really a hypervisor bug, but we can still
> >>>> make an effort to mitigate the problem in the JVM.
> >>>>
> >>>> SAP-folks (CC:ing those who commented in the bug), since you ran into
> >>>> this problem and I don't have access to a VMware setup where I can
> >>>> test/reproduce this, could you please test this patch to verify it
> >>>> solves the problem? If so, that would be much appreciated.
> >>>>
> >>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8241603
> >>>> Webrev: http://cr.openjdk.java.net/~pliden/8241603/webrev.0
> >>>> Testing: Tier 1-6 on macOS (but not macOS on top of VMware)
> >>>>
> >>>> cheers,
> >>>> Per
> >>>>