From per.liden at oracle.com Wed Apr 1 07:14:54 2020 From: per.liden at oracle.com (Per Liden) Date: Wed, 1 Apr 2020 09:14:54 +0200 Subject: JVM stalls around uncommitting In-Reply-To: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> Message-ID: <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> Hi, On 3/31/20 9:59 PM, Zoltan Baranyi wrote: > Hi ZGC Team, > > I run benchmarks against our application using ZGC on heaps in few > hundreds GB scale. In the beginning everything goes smooth, but > eventually I experience very long JVM stalls, sometimes longer than one > minute. According to the JVM log, reaching safepoints occasionally takes > very long time, matching to the duration of the stalls I experience. > > After a few iterations, I started looking at uncommitting and learned > that the way ZGC performs uncommitting - flushing the pages, punching > holes, removing blocks from the backing file - can be expensive [1] when > uncommitting tens or more than a hundred GB of memory. The trace level > heap logs confirmed that uncommitting blocks in this size takes many > seconds. After disabled uncommitting my benchmark runs without the huge > stalls and the overall experience with ZGC is quite good. > > Since uncommitting is done asynchronously to the mutators, I expected it > not to interfere with them. My understanding is that flushing, > bookeeping and uncommitting is done under a mutex [2], and contention on > that can be the source of the stalls I see, such as when there is a > demand to commit memory while uncommitting is taking place. Can you > confirm if this above is an explanation that makes sense to you? If so, > is there a cure to this that I couldn't find? Like a time bound or a cap > on the amount of the memory that can be uncommitted in one go. Yes, uncommitting is relatively expensive. And it's also true that there is a potential for lock contention affecting mutators. That can be improved in various ways. Like you say, uncommitting in smaller chunks, or possibly by releasing the lock while doing the actual syscall. If you still want uncommit to happen, one thing to try is using large pages (-XX:+UseLargePages), since committing/uncommitting large pages is typically less expensive. This issue is on our radar, so we intend to improve this going forward. cheers, Per > > This is an example log captured during a stall: > > [1778,704s][info ][safepoint] Safepoint "ZMarkStart", Time since last: > 34394880194 ns, Reaching safepoint: 247308 ns, At safepoint: 339634 ns, > Total: 586942 ns > [1833,707s][trace][gc,heap? ] Uncommitting memory: 459560M-459562M (2M) > [...] > [... zillions of continuous uncommitting log lines ...] > [...] > [1846,076s][trace][gc,heap? ] Uncommitting memory: 84M-86M (2M) > [1846,076s][info ][gc,heap? ] Capacity: 528596M(86%)->386072M(63%), > Uncommitted: 142524M > [1846,076s][trace][gc,heap? ] Uncommit Timeout: 1s > [1846,078s][info ][safepoint] Safepoint "Cleanup", Time since last: > 18001682918 ns, Reaching safepoint: 49371131055 ns, At safepoint: 252559 > ns, Total: 49371383614 ns > > In the above case TTSP is 49s, while the uncommitting lines cover only > 13s. The TTSP would indicate that the safepoint request was signaled at > 1797s, but the log is empty between 1778s and 1883s. If my understanding > above is correct, could it be that waiting for the mutex, flushing etc > takes that much time and just not visible in the log? > > If needed, I can dig out more details since I can reliably reproduce the > stalls. > > My environment is OpenJDK 14 running on Linux 5.2.9 with these > arguments: "-Xmx600G -XX:+HeapDumpOnOutOfMemoryError > -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:+UseNUMA > -XX:+AlwaysPreTouch -Xlog:gc,safepoint,gc+heap=trace:jvm.log". > > Best regards, > Zoltan > > [1] > https://github.com/openjdk/zgc/blob/d90d2b1097a9de06d8b6e3e6f2f6bd4075471fa0/src/hotspot/os/linux/gc/z/zPhysicalMemoryBacking_linux.cpp#L566-L573 > > [2] > https://github.com/openjdk/zgc/blob/d90d2b1097a9de06d8b6e3e6f2f6bd4075471fa0/src/hotspot/share/gc/z/zPageAllocator.cpp#L685-L711 > From blazember at gmail.com Thu Apr 2 23:27:41 2020 From: blazember at gmail.com (=?UTF-8?Q?Zolt=C3=A1n_Baranyi?=) Date: Fri, 3 Apr 2020 01:27:41 +0200 Subject: JVM stalls around uncommitting In-Reply-To: <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> Message-ID: Hi Per, Thank you for confirming the issue and for recommending large pages. I re-run my benchmarks with large pages and it gave me a 25-30% performance boost, which is a bit more than what I expected. My benchmarks run on a 600G heap with 1.5-2GB/s allocation rate on a 40 core machine, so ZGC is busy. Since a significant part of the workload is ZGC itself, I assume - besides the higher TLB hit rate - this gain is from managing the ZPages more effectively on large pages. I have a good experience overall, nice to see ZGC getting more and more mature. Cheers, Zoltan On Wed, Apr 1, 2020 at 9:15 AM Per Liden wrote: > Hi, > > On 3/31/20 9:59 PM, Zoltan Baranyi wrote: > > Hi ZGC Team, > > > > I run benchmarks against our application using ZGC on heaps in few > > hundreds GB scale. In the beginning everything goes smooth, but > > eventually I experience very long JVM stalls, sometimes longer than one > > minute. According to the JVM log, reaching safepoints occasionally takes > > very long time, matching to the duration of the stalls I experience. > > > > After a few iterations, I started looking at uncommitting and learned > > that the way ZGC performs uncommitting - flushing the pages, punching > > holes, removing blocks from the backing file - can be expensive [1] when > > uncommitting tens or more than a hundred GB of memory. The trace level > > heap logs confirmed that uncommitting blocks in this size takes many > > seconds. After disabled uncommitting my benchmark runs without the huge > > stalls and the overall experience with ZGC is quite good. > > > > Since uncommitting is done asynchronously to the mutators, I expected it > > not to interfere with them. My understanding is that flushing, > > bookeeping and uncommitting is done under a mutex [2], and contention on > > that can be the source of the stalls I see, such as when there is a > > demand to commit memory while uncommitting is taking place. Can you > > confirm if this above is an explanation that makes sense to you? If so, > > is there a cure to this that I couldn't find? Like a time bound or a cap > > on the amount of the memory that can be uncommitted in one go. > > Yes, uncommitting is relatively expensive. And it's also true that there > is a potential for lock contention affecting mutators. That can be > improved in various ways. Like you say, uncommitting in smaller chunks, > or possibly by releasing the lock while doing the actual syscall. > > If you still want uncommit to happen, one thing to try is using large > pages (-XX:+UseLargePages), since committing/uncommitting large pages is > typically less expensive. > > This issue is on our radar, so we intend to improve this going forward. > > cheers, > Per > > From per.liden at oracle.com Fri Apr 3 07:35:55 2020 From: per.liden at oracle.com (Per Liden) Date: Fri, 3 Apr 2020 09:35:55 +0200 Subject: JVM stalls around uncommitting In-Reply-To: References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> Message-ID: <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> Hi Zoltan, On 4/3/20 1:27 AM, Zolt?n Baranyi wrote: > Hi Per, > > Thank you for confirming the issue and for recommending large pages. I > re-run my benchmarks with large pages and it gave me a 25-30% performance > boost, which is a bit more than what I expected. My benchmarks run on a > 600G heap with 1.5-2GB/s allocation rate on a 40 core machine, so ZGC is > busy. Since a significant part of the workload is ZGC itself, I assume - > besides the higher TLB hit rate - this gain is from managing the ZPages > more effectively on large pages. A 25-30% improvement is indeed more than I would have expected. ZGC's internal handling of ZPages is the same regardless of the underlying page size, but as you say, you'll get better TLB hit-rate and the mmap/fallocate syscalls become a lot less expensive. Another reason for the boost might be that ZGC's NUMA-awareness, until recently, worked much better when using large pages. But this has now been fixed, see https://bugs.openjdk.java.net/browse/JDK-8237649. Btw, which JDK version are you using? > > I have a good experience overall, nice to see ZGC getting more and more > mature. Good to hear. Thanks for the feedback! /Per > > Cheers, > Zoltan > > On Wed, Apr 1, 2020 at 9:15 AM Per Liden wrote: > >> Hi, >> >> On 3/31/20 9:59 PM, Zoltan Baranyi wrote: >>> Hi ZGC Team, >>> >>> I run benchmarks against our application using ZGC on heaps in few >>> hundreds GB scale. In the beginning everything goes smooth, but >>> eventually I experience very long JVM stalls, sometimes longer than one >>> minute. According to the JVM log, reaching safepoints occasionally takes >>> very long time, matching to the duration of the stalls I experience. >>> >>> After a few iterations, I started looking at uncommitting and learned >>> that the way ZGC performs uncommitting - flushing the pages, punching >>> holes, removing blocks from the backing file - can be expensive [1] when >>> uncommitting tens or more than a hundred GB of memory. The trace level >>> heap logs confirmed that uncommitting blocks in this size takes many >>> seconds. After disabled uncommitting my benchmark runs without the huge >>> stalls and the overall experience with ZGC is quite good. >>> >>> Since uncommitting is done asynchronously to the mutators, I expected it >>> not to interfere with them. My understanding is that flushing, >>> bookeeping and uncommitting is done under a mutex [2], and contention on >>> that can be the source of the stalls I see, such as when there is a >>> demand to commit memory while uncommitting is taking place. Can you >>> confirm if this above is an explanation that makes sense to you? If so, >>> is there a cure to this that I couldn't find? Like a time bound or a cap >>> on the amount of the memory that can be uncommitted in one go. >> >> Yes, uncommitting is relatively expensive. And it's also true that there >> is a potential for lock contention affecting mutators. That can be >> improved in various ways. Like you say, uncommitting in smaller chunks, >> or possibly by releasing the lock while doing the actual syscall. >> >> If you still want uncommit to happen, one thing to try is using large >> pages (-XX:+UseLargePages), since committing/uncommitting large pages is >> typically less expensive. >> >> This issue is on our radar, so we intend to improve this going forward. >> >> cheers, >> Per >> >> From thomas.stuefe at gmail.com Sat Apr 4 08:00:33 2020 From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=) Date: Sat, 4 Apr 2020 10:00:33 +0200 Subject: JVM stalls around uncommitting In-Reply-To: <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> Message-ID: Hi Per, Zoltan, sorry for getting in a question sideways, but I was curious. I always thought large pages are memory-pinned, so cannot be uncommitted? Or are you talking using THPs? Cheers, Thomas On Fri, Apr 3, 2020 at 9:38 AM Per Liden wrote: > Hi Zoltan, > > On 4/3/20 1:27 AM, Zolt?n Baranyi wrote: > > Hi Per, > > > > Thank you for confirming the issue and for recommending large pages. I > > re-run my benchmarks with large pages and it gave me a 25-30% performance > > boost, which is a bit more than what I expected. My benchmarks run on a > > 600G heap with 1.5-2GB/s allocation rate on a 40 core machine, so ZGC is > > busy. Since a significant part of the workload is ZGC itself, I assume - > > besides the higher TLB hit rate - this gain is from managing the ZPages > > more effectively on large pages. > > A 25-30% improvement is indeed more than I would have expected. ZGC's > internal handling of ZPages is the same regardless of the underlying > page size, but as you say, you'll get better TLB hit-rate and the > mmap/fallocate syscalls become a lot less expensive. > > Another reason for the boost might be that ZGC's NUMA-awareness, until > recently, worked much better when using large pages. But this has now > been fixed, see https://bugs.openjdk.java.net/browse/JDK-8237649. > > Btw, which JDK version are you using? > > > > > I have a good experience overall, nice to see ZGC getting more and more > > mature. > > Good to hear. Thanks for the feedback! > > /Per > > > > > Cheers, > > Zoltan > > > > On Wed, Apr 1, 2020 at 9:15 AM Per Liden wrote: > > > >> Hi, > >> > >> On 3/31/20 9:59 PM, Zoltan Baranyi wrote: > >>> Hi ZGC Team, > >>> > >>> I run benchmarks against our application using ZGC on heaps in few > >>> hundreds GB scale. In the beginning everything goes smooth, but > >>> eventually I experience very long JVM stalls, sometimes longer than one > >>> minute. According to the JVM log, reaching safepoints occasionally > takes > >>> very long time, matching to the duration of the stalls I experience. > >>> > >>> After a few iterations, I started looking at uncommitting and learned > >>> that the way ZGC performs uncommitting - flushing the pages, punching > >>> holes, removing blocks from the backing file - can be expensive [1] > when > >>> uncommitting tens or more than a hundred GB of memory. The trace level > >>> heap logs confirmed that uncommitting blocks in this size takes many > >>> seconds. After disabled uncommitting my benchmark runs without the huge > >>> stalls and the overall experience with ZGC is quite good. > >>> > >>> Since uncommitting is done asynchronously to the mutators, I expected > it > >>> not to interfere with them. My understanding is that flushing, > >>> bookeeping and uncommitting is done under a mutex [2], and contention > on > >>> that can be the source of the stalls I see, such as when there is a > >>> demand to commit memory while uncommitting is taking place. Can you > >>> confirm if this above is an explanation that makes sense to you? If so, > >>> is there a cure to this that I couldn't find? Like a time bound or a > cap > >>> on the amount of the memory that can be uncommitted in one go. > >> > >> Yes, uncommitting is relatively expensive. And it's also true that there > >> is a potential for lock contention affecting mutators. That can be > >> improved in various ways. Like you say, uncommitting in smaller chunks, > >> or possibly by releasing the lock while doing the actual syscall. > >> > >> If you still want uncommit to happen, one thing to try is using large > >> pages (-XX:+UseLargePages), since committing/uncommitting large pages is > >> typically less expensive. > >> > >> This issue is on our radar, so we intend to improve this going forward. > >> > >> cheers, > >> Per > >> > >> > From thomas.stuefe at gmail.com Sat Apr 4 09:21:33 2020 From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=) Date: Sat, 4 Apr 2020 11:21:33 +0200 Subject: JVM stalls around uncommitting In-Reply-To: References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> Message-ID: Sorry, let me formulate this question in a more precise manner: Assuming you use the "traditional" huge pages, HugeTLBFS, we take the pages from the huge page pool. The content of the huge page pool is not usable for other applications. So uncommitting would only benefit other applications using huge pages. But that's okay and would be useful too. The question to me would be if reserving but not committing memory backed by huge pages is any different from committing them right away. Or, whether uncommitted pages are returned to the pool. I made a simple test with UseLagePages and a VM with a 100M heap, and see that both heap and code heap are now backed by huge pages as expected. I ran once with AlwaysPreTouch, once without. I do not see any difference from the outside as toward the number of used huge pages. In /proc/pid/smaps the memory segments look identical in each case. I may be doing this test wrong though... Thanks a lot, and sorry again for hijacking this thread, Thomas p.s. without doubt using huge pages is hugely beneficial even without uncommitting. On Sat, Apr 4, 2020 at 10:00 AM Thomas St?fe wrote: > Hi Per, Zoltan, > > sorry for getting in a question sideways, but I was curious. > > I always thought large pages are memory-pinned, so cannot be uncommitted? > Or are you talking using THPs? > > Cheers, Thomas > > > On Fri, Apr 3, 2020 at 9:38 AM Per Liden wrote: > >> Hi Zoltan, >> >> On 4/3/20 1:27 AM, Zolt?n Baranyi wrote: >> > Hi Per, >> > >> > Thank you for confirming the issue and for recommending large pages. I >> > re-run my benchmarks with large pages and it gave me a 25-30% >> performance >> > boost, which is a bit more than what I expected. My benchmarks run on a >> > 600G heap with 1.5-2GB/s allocation rate on a 40 core machine, so ZGC is >> > busy. Since a significant part of the workload is ZGC itself, I assume - >> > besides the higher TLB hit rate - this gain is from managing the ZPages >> > more effectively on large pages. >> >> A 25-30% improvement is indeed more than I would have expected. ZGC's >> internal handling of ZPages is the same regardless of the underlying >> page size, but as you say, you'll get better TLB hit-rate and the >> mmap/fallocate syscalls become a lot less expensive. >> >> Another reason for the boost might be that ZGC's NUMA-awareness, until >> recently, worked much better when using large pages. But this has now >> been fixed, see https://bugs.openjdk.java.net/browse/JDK-8237649. >> >> Btw, which JDK version are you using? >> >> > >> > I have a good experience overall, nice to see ZGC getting more and more >> > mature. >> >> Good to hear. Thanks for the feedback! >> >> /Per >> >> > >> > Cheers, >> > Zoltan >> > >> > On Wed, Apr 1, 2020 at 9:15 AM Per Liden wrote: >> > >> >> Hi, >> >> >> >> On 3/31/20 9:59 PM, Zoltan Baranyi wrote: >> >>> Hi ZGC Team, >> >>> >> >>> I run benchmarks against our application using ZGC on heaps in few >> >>> hundreds GB scale. In the beginning everything goes smooth, but >> >>> eventually I experience very long JVM stalls, sometimes longer than >> one >> >>> minute. According to the JVM log, reaching safepoints occasionally >> takes >> >>> very long time, matching to the duration of the stalls I experience. >> >>> >> >>> After a few iterations, I started looking at uncommitting and learned >> >>> that the way ZGC performs uncommitting - flushing the pages, punching >> >>> holes, removing blocks from the backing file - can be expensive [1] >> when >> >>> uncommitting tens or more than a hundred GB of memory. The trace level >> >>> heap logs confirmed that uncommitting blocks in this size takes many >> >>> seconds. After disabled uncommitting my benchmark runs without the >> huge >> >>> stalls and the overall experience with ZGC is quite good. >> >>> >> >>> Since uncommitting is done asynchronously to the mutators, I expected >> it >> >>> not to interfere with them. My understanding is that flushing, >> >>> bookeeping and uncommitting is done under a mutex [2], and contention >> on >> >>> that can be the source of the stalls I see, such as when there is a >> >>> demand to commit memory while uncommitting is taking place. Can you >> >>> confirm if this above is an explanation that makes sense to you? If >> so, >> >>> is there a cure to this that I couldn't find? Like a time bound or a >> cap >> >>> on the amount of the memory that can be uncommitted in one go. >> >> >> >> Yes, uncommitting is relatively expensive. And it's also true that >> there >> >> is a potential for lock contention affecting mutators. That can be >> >> improved in various ways. Like you say, uncommitting in smaller chunks, >> >> or possibly by releasing the lock while doing the actual syscall. >> >> >> >> If you still want uncommit to happen, one thing to try is using large >> >> pages (-XX:+UseLargePages), since committing/uncommitting large pages >> is >> >> typically less expensive. >> >> >> >> This issue is on our radar, so we intend to improve this going forward. >> >> >> >> cheers, >> >> Per >> >> >> >> >> > From per.liden at oracle.com Sat Apr 4 18:42:34 2020 From: per.liden at oracle.com (Per Liden) Date: Sat, 4 Apr 2020 20:42:34 +0200 Subject: JVM stalls around uncommitting In-Reply-To: References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> Message-ID: <8a3c04f3-ba6b-7d09-6ac1-7bc07fa765f4@oracle.com> Hi Thomas, On 4/4/20 11:21 AM, Thomas St?fe wrote: > Sorry, let me formulate this question in a more precise manner: > > Assuming you use the "traditional" huge pages, HugeTLBFS, we take the > pages from the huge page pool. The content of the huge page pool is not > usable for other applications. So uncommitting would only benefit other > applications using huge pages. But that's okay and would be useful too. It depends a bit on how you've setup the huge page pool. Normally, you set nr_hugepages to configure the huge page pool to have a fixed number of pages, with a guarantee that those pages will actually be there when needed. Applications explicitly using huge pages will allocate from the pool. Applications that uncommit such pages will return them to the pool for other applications (that are also explicitly using huge pages) to use. However, you can instead (or also) choose to configure nr_overcommit_hugepages. When the huge page pool is depleted (e.g. because nr_hugepages was set to 0 from the start) the kernel will try to allocate at most this number of huge pages from the normal page pool. These pages will show up as HugePages_Surp in /proc/meminfo. When uncommiting such pages they will be returned to the normal page pool, for any other process to use (not just those explicitly using huge pages). Of course, you don't have the same guarantee that there are large pages available. > > The question to me would be if reserving but not committing memory > backed by huge pages is any different from committing them right away. > Or, whether uncommitted pages are returned to the pool. It depends on what you mean with reserving. If you're going through ReservedSpace (i.e. os::reserve_memory_special() and friends), then yes, it's the same thing. But ZGC is not using those APIs, it has it's own reserve/commit/uncommit infrastructure where reserve only reserves address space, and commit/uncommit actually allocates/deallocates pages. > > I made a simple test with UseLagePages?and a VM with a 100M heap, and > see that both heap and code heap are now backed by huge pages > as?expected. I ran once with AlwaysPreTouch, once without. I do not see > any difference from the outside as toward the number of used huge pages. > In /proc/pid/smaps the memory segments look identical in each case. I > may be doing this test wrong though... Maybe you weren't using ZGC? The code heap and all GCs, except ZGC, use ReservedSpace where large pages will be committed and "pinned" upfront, and no uncommit will happen. cheers, Per > > Thanks a lot, and sorry again for hijacking this thread, > > Thomas > > p.s. without doubt using huge pages is hugely beneficial even without > uncommitting. > > > > > On Sat, Apr 4, 2020 at 10:00 AM Thomas St?fe > wrote: > > Hi Per, Zoltan, > > sorry for getting in a question sideways, but I was curious. > > I always thought large pages are memory-pinned, so cannot be > uncommitted? Or are you talking using THPs? > > Cheers, Thomas > > > On Fri, Apr 3, 2020 at 9:38 AM Per Liden > wrote: > > Hi Zoltan, > > On 4/3/20 1:27 AM, Zolt?n Baranyi wrote: > > Hi Per, > > > > Thank you for confirming the issue and for recommending large > pages. I > > re-run my benchmarks with large pages and it gave me a 25-30% > performance > > boost, which is a bit more than what I expected. My > benchmarks run on a > > 600G heap with 1.5-2GB/s allocation rate on a 40 core > machine, so ZGC is > > busy. Since a significant part of the workload is ZGC itself, > I assume - > > besides the higher TLB hit rate - this gain is from managing > the ZPages > > more effectively on large pages. > > A 25-30% improvement is indeed more than I would have expected. > ZGC's > internal handling of ZPages is the same regardless of the > underlying > page size, but as you say, you'll get better TLB hit-rate and the > mmap/fallocate syscalls become a lot less expensive. > > Another reason for the boost might be that ZGC's NUMA-awareness, > until > recently, worked much better when using large pages. But this > has now > been fixed, see https://bugs.openjdk.java.net/browse/JDK-8237649. > > Btw, which JDK version are you using? > > > > > I have a good experience overall, nice to see ZGC getting > more and more > > mature. > > Good to hear. Thanks for the feedback! > > /Per > > > > > Cheers, > > Zoltan > > > > On Wed, Apr 1, 2020 at 9:15 AM Per Liden > > wrote: > > > >> Hi, > >> > >> On 3/31/20 9:59 PM, Zoltan Baranyi wrote: > >>> Hi ZGC Team, > >>> > >>> I run benchmarks against our application using ZGC on heaps > in few > >>> hundreds GB scale. In the beginning everything goes smooth, but > >>> eventually I experience very long JVM stalls, sometimes > longer than one > >>> minute. According to the JVM log, reaching safepoints > occasionally takes > >>> very long time, matching to the duration of the stalls I > experience. > >>> > >>> After a few iterations, I started looking at uncommitting > and learned > >>> that the way ZGC performs uncommitting - flushing the > pages, punching > >>> holes, removing blocks from the backing file - can be > expensive [1] when > >>> uncommitting tens or more than a hundred GB of memory. The > trace level > >>> heap logs confirmed that uncommitting blocks in this size > takes many > >>> seconds. After disabled uncommitting my benchmark runs > without the huge > >>> stalls and the overall experience with ZGC is quite good. > >>> > >>> Since uncommitting is done asynchronously to the mutators, > I expected it > >>> not to interfere with them. My understanding is that flushing, > >>> bookeeping and uncommitting is done under a mutex [2], and > contention on > >>> that can be the source of the stalls I see, such as when > there is a > >>> demand to commit memory while uncommitting is taking place. > Can you > >>> confirm if this above is an explanation that makes sense to > you? If so, > >>> is there a cure to this that I couldn't find? Like a time > bound or a cap > >>> on the amount of the memory that can be uncommitted in one go. > >> > >> Yes, uncommitting is relatively expensive. And it's also > true that there > >> is a potential for lock contention affecting mutators. That > can be > >> improved in various ways. Like you say, uncommitting in > smaller chunks, > >> or possibly by releasing the lock while doing the actual > syscall. > >> > >> If you still want uncommit to happen, one thing to try is > using large > >> pages (-XX:+UseLargePages), since committing/uncommitting > large pages is > >> typically less expensive. > >> > >> This issue is on our radar, so we intend to improve this > going forward. > >> > >> cheers, > >> Per > >> > >> > From thomas.stuefe at gmail.com Mon Apr 6 06:33:35 2020 From: thomas.stuefe at gmail.com (=?UTF-8?Q?Thomas_St=C3=BCfe?=) Date: Mon, 6 Apr 2020 08:33:35 +0200 Subject: JVM stalls around uncommitting In-Reply-To: <8a3c04f3-ba6b-7d09-6ac1-7bc07fa765f4@oracle.com> References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> <8a3c04f3-ba6b-7d09-6ac1-7bc07fa765f4@oracle.com> Message-ID: Hi Per, On Sat, Apr 4, 2020 at 8:42 PM Per Liden wrote: > Hi Thomas, > > On 4/4/20 11:21 AM, Thomas St?fe wrote: > > Sorry, let me formulate this question in a more precise manner: > > > > Assuming you use the "traditional" huge pages, HugeTLBFS, we take the > > pages from the huge page pool. The content of the huge page pool is not > > usable for other applications. So uncommitting would only benefit other > > applications using huge pages. But that's okay and would be useful too. > > It depends a bit on how you've setup the huge page pool. Normally, you > set nr_hugepages to configure the huge page pool to have a fixed number > of pages, with a guarantee that those pages will actually be there when > needed. Applications explicitly using huge pages will allocate from the > pool. Applications that uncommit such pages will return them to the pool > for other applications (that are also explicitly using huge pages) to use. > > Good to know. > However, you can instead (or also) choose to configure > nr_overcommit_hugepages. When the huge page pool is depleted (e.g. > because nr_hugepages was set to 0 from the start) the kernel will try to > allocate at most this number of huge pages from the normal page pool. > These pages will show up as HugePages_Surp in /proc/meminfo. When > uncommiting such pages they will be returned to the normal page pool, > for any other process to use (not just those explicitly using huge > pages). Of course, you don't have the same guarantee that there are > large pages available. > > Oh this is nice. I did not know you could do this. It takes the sting out of preallocating a huge page pool, especially on development machines. > > > > The question to me would be if reserving but not committing memory > > backed by huge pages is any different from committing them right away. > > Or, whether uncommitted pages are returned to the pool. > > It depends on what you mean with reserving. If you're going through > ReservedSpace (i.e. os::reserve_memory_special() and friends), then yes, > it's the same thing. But ZGC is not using those APIs, it has it's own > reserve/commit/uncommit infrastructure where reserve only reserves > address space, and commit/uncommit actually allocates/deallocates pages. > > > > I made a simple test with UseLagePages and a VM with a 100M heap, and > > see that both heap and code heap are now backed by huge pages > > as expected. I ran once with AlwaysPreTouch, once without. I do not see > > any difference from the outside as toward the number of used huge pages. > > In /proc/pid/smaps the memory segments look identical in each case. I > > may be doing this test wrong though... > > Maybe you weren't using ZGC? The code heap and all GCs, except ZGC, use > ReservedSpace where large pages will be committed and "pinned" upfront, > and no uncommit will happen. > > That, and I also got confused with AIX where huge pages are pinned by the OS :) > cheers, > Per > > Thank you for that extensive answer! Cheers, Thomas > > > > Thanks a lot, and sorry again for hijacking this thread, > > > > Thomas > > > > p.s. without doubt using huge pages is hugely beneficial even without > > uncommitting. > > > > > > > > > > On Sat, Apr 4, 2020 at 10:00 AM Thomas St?fe > > wrote: > > > > Hi Per, Zoltan, > > > > sorry for getting in a question sideways, but I was curious. > > > > I always thought large pages are memory-pinned, so cannot be > > uncommitted? Or are you talking using THPs? > > > > Cheers, Thomas > > > > > > On Fri, Apr 3, 2020 at 9:38 AM Per Liden > > wrote: > > > > Hi Zoltan, > > > > On 4/3/20 1:27 AM, Zolt?n Baranyi wrote: > > > Hi Per, > > > > > > Thank you for confirming the issue and for recommending large > > pages. I > > > re-run my benchmarks with large pages and it gave me a 25-30% > > performance > > > boost, which is a bit more than what I expected. My > > benchmarks run on a > > > 600G heap with 1.5-2GB/s allocation rate on a 40 core > > machine, so ZGC is > > > busy. Since a significant part of the workload is ZGC itself, > > I assume - > > > besides the higher TLB hit rate - this gain is from managing > > the ZPages > > > more effectively on large pages. > > > > A 25-30% improvement is indeed more than I would have expected. > > ZGC's > > internal handling of ZPages is the same regardless of the > > underlying > > page size, but as you say, you'll get better TLB hit-rate and the > > mmap/fallocate syscalls become a lot less expensive. > > > > Another reason for the boost might be that ZGC's NUMA-awareness, > > until > > recently, worked much better when using large pages. But this > > has now > > been fixed, see https://bugs.openjdk.java.net/browse/JDK-8237649 > . > > > > Btw, which JDK version are you using? > > > > > > > > I have a good experience overall, nice to see ZGC getting > > more and more > > > mature. > > > > Good to hear. Thanks for the feedback! > > > > /Per > > > > > > > > Cheers, > > > Zoltan > > > > > > On Wed, Apr 1, 2020 at 9:15 AM Per Liden > > > wrote: > > > > > >> Hi, > > >> > > >> On 3/31/20 9:59 PM, Zoltan Baranyi wrote: > > >>> Hi ZGC Team, > > >>> > > >>> I run benchmarks against our application using ZGC on heaps > > in few > > >>> hundreds GB scale. In the beginning everything goes smooth, > but > > >>> eventually I experience very long JVM stalls, sometimes > > longer than one > > >>> minute. According to the JVM log, reaching safepoints > > occasionally takes > > >>> very long time, matching to the duration of the stalls I > > experience. > > >>> > > >>> After a few iterations, I started looking at uncommitting > > and learned > > >>> that the way ZGC performs uncommitting - flushing the > > pages, punching > > >>> holes, removing blocks from the backing file - can be > > expensive [1] when > > >>> uncommitting tens or more than a hundred GB of memory. The > > trace level > > >>> heap logs confirmed that uncommitting blocks in this size > > takes many > > >>> seconds. After disabled uncommitting my benchmark runs > > without the huge > > >>> stalls and the overall experience with ZGC is quite good. > > >>> > > >>> Since uncommitting is done asynchronously to the mutators, > > I expected it > > >>> not to interfere with them. My understanding is that > flushing, > > >>> bookeeping and uncommitting is done under a mutex [2], and > > contention on > > >>> that can be the source of the stalls I see, such as when > > there is a > > >>> demand to commit memory while uncommitting is taking place. > > Can you > > >>> confirm if this above is an explanation that makes sense to > > you? If so, > > >>> is there a cure to this that I couldn't find? Like a time > > bound or a cap > > >>> on the amount of the memory that can be uncommitted in one > go. > > >> > > >> Yes, uncommitting is relatively expensive. And it's also > > true that there > > >> is a potential for lock contention affecting mutators. That > > can be > > >> improved in various ways. Like you say, uncommitting in > > smaller chunks, > > >> or possibly by releasing the lock while doing the actual > > syscall. > > >> > > >> If you still want uncommit to happen, one thing to try is > > using large > > >> pages (-XX:+UseLargePages), since committing/uncommitting > > large pages is > > >> typically less expensive. > > >> > > >> This issue is on our radar, so we intend to improve this > > going forward. > > >> > > >> cheers, > > >> Per > > >> > > >> > > > From blazember at gmail.com Mon Apr 6 13:23:06 2020 From: blazember at gmail.com (=?UTF-8?Q?Zolt=C3=A1n_Baranyi?=) Date: Mon, 6 Apr 2020 15:23:06 +0200 Subject: JVM stalls around uncommitting In-Reply-To: <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> References: <2000f65b-07a1-6b90-f065-ede64d9f9413@gmail.com> <3deda55f-cba9-6299-c0d9-7e1b4ca9c411@oracle.com> <6a8582aa-a0d7-810c-ee7d-82a1bb6ec171@oracle.com> Message-ID: Hi Per, Thanks for the link to the NUMA issue, it could be part of the difference indeed. My benchmarks use the fresh OpenJdk 14+36-1461 GA build, so I don't have this improvement yet. Btw, it turned out that running my same benchmark multiple times with 4K pages and exactly the same parameters, it produces throughput results with low and high values, where low can be ~75% of the high. This can very well be behind the unexpectedly high gain I saw with large pages since I had results only from the low range for 4K pages when I replied. I need to figure this out, but currently, I think this is unrelated to ZGC even if NUMA can play some role in it. Thanks for the useful hints and explanations, I will come back if I find anything interesting related to ZGC. Cheers, Zoltan On Fri, Apr 3, 2020 at 9:36 AM Per Liden wrote: > Hi Zoltan, > > On 4/3/20 1:27 AM, Zolt?n Baranyi wrote: > > Hi Per, > > > > Thank you for confirming the issue and for recommending large pages. I > > re-run my benchmarks with large pages and it gave me a 25-30% performance > > boost, which is a bit more than what I expected. My benchmarks run on a > > 600G heap with 1.5-2GB/s allocation rate on a 40 core machine, so ZGC is > > busy. Since a significant part of the workload is ZGC itself, I assume - > > besides the higher TLB hit rate - this gain is from managing the ZPages > > more effectively on large pages. > > A 25-30% improvement is indeed more than I would have expected. ZGC's > internal handling of ZPages is the same regardless of the underlying > page size, but as you say, you'll get better TLB hit-rate and the > mmap/fallocate syscalls become a lot less expensive. > > Another reason for the boost might be that ZGC's NUMA-awareness, until > recently, worked much better when using large pages. But this has now > been fixed, see https://bugs.openjdk.java.net/browse/JDK-8237649. > > Btw, which JDK version are you using? > > > > > I have a good experience overall, nice to see ZGC getting more and more > > mature. > > Good to hear. Thanks for the feedback! > > /Per > > > > > Cheers, > > Zoltan > > > > On Wed, Apr 1, 2020 at 9:15 AM Per Liden wrote: > > > >> Hi, > >> > >> On 3/31/20 9:59 PM, Zoltan Baranyi wrote: > >>> Hi ZGC Team, > >>> > >>> I run benchmarks against our application using ZGC on heaps in few > >>> hundreds GB scale. In the beginning everything goes smooth, but > >>> eventually I experience very long JVM stalls, sometimes longer than one > >>> minute. According to the JVM log, reaching safepoints occasionally > takes > >>> very long time, matching to the duration of the stalls I experience. > >>> > >>> After a few iterations, I started looking at uncommitting and learned > >>> that the way ZGC performs uncommitting - flushing the pages, punching > >>> holes, removing blocks from the backing file - can be expensive [1] > when > >>> uncommitting tens or more than a hundred GB of memory. The trace level > >>> heap logs confirmed that uncommitting blocks in this size takes many > >>> seconds. After disabled uncommitting my benchmark runs without the huge > >>> stalls and the overall experience with ZGC is quite good. > >>> > >>> Since uncommitting is done asynchronously to the mutators, I expected > it > >>> not to interfere with them. My understanding is that flushing, > >>> bookeeping and uncommitting is done under a mutex [2], and contention > on > >>> that can be the source of the stalls I see, such as when there is a > >>> demand to commit memory while uncommitting is taking place. Can you > >>> confirm if this above is an explanation that makes sense to you? If so, > >>> is there a cure to this that I couldn't find? Like a time bound or a > cap > >>> on the amount of the memory that can be uncommitted in one go. > >> > >> Yes, uncommitting is relatively expensive. And it's also true that there > >> is a potential for lock contention affecting mutators. That can be > >> improved in various ways. Like you say, uncommitting in smaller chunks, > >> or possibly by releasing the lock while doing the actual syscall. > >> > >> If you still want uncommit to happen, one thing to try is using large > >> pages (-XX:+UseLargePages), since committing/uncommitting large pages is > >> typically less expensive. > >> > >> This issue is on our radar, so we intend to improve this going forward. > >> > >> cheers, > >> Per > >> > >> > From aph at redhat.com Tue Apr 7 11:52:38 2020 From: aph at redhat.com (Andrew Haley) Date: Tue, 7 Apr 2020 12:52:38 +0100 Subject: [aarch64-port-dev ] RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <10e5adb8-3170-253b-17c4-ed70a708e404@redhat.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <105c4a4a-59c9-8095-6d45-642595f65539@redhat.com> <10e5adb8-3170-253b-17c4-ed70a708e404@redhat.com> Message-ID: <8af3b484-e9d6-5571-42d8-a42a66ebdd42@redhat.com> I notice that even after applying your patch we are still using embedded OOPs in two places. Here in aarch64.ad: if (rtype == relocInfo::oop_type) { __ movoop(dst_reg, (jobject)con, /*immediate*/true); } and here in sharedRuntime_aarch64.cpp: // load oop into a register __ movoop(c_rarg1, JNIHandles::make_local(method->method_holder()->java_mirror()), /*immediate*/true); Why is this? -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Tue Apr 7 12:25:14 2020 From: aph at redhat.com (Andrew Haley) Date: Tue, 7 Apr 2020 13:25:14 +0100 Subject: [aarch64-port-dev ] RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <8af3b484-e9d6-5571-42d8-a42a66ebdd42@redhat.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <105c4a4a-59c9-8095-6d45-642595f65539@redhat.com> <10e5adb8-3170-253b-17c4-ed70a708e404@redhat.com> <8af3b484-e9d6-5571-42d8-a42a66ebdd42@redhat.com> Message-ID: On 4/7/20 12:52 PM, Andrew Haley wrote: > I notice that even after applying your patch we are still using embedded > OOPs in two places. > > Here in aarch64.ad: > > if (rtype == relocInfo::oop_type) { > __ movoop(dst_reg, (jobject)con, /*immediate*/true); > } > > and here in sharedRuntime_aarch64.cpp: > > // load oop into a register > __ movoop(c_rarg1, > JNIHandles::make_local(method->method_holder()->java_mirror()), > /*immediate*/true); > > Why is this? Ah, the second one is a handle, of course, and AFAIK handles don't move. Having said that, the use of movoop on something that is the address of an oop rather than an oop is odd,but it's done on other targets. The C2 one is still suspect. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Wed Apr 8 13:08:56 2020 From: aph at redhat.com (Andrew Haley) Date: Wed, 8 Apr 2020 14:08:56 +0100 Subject: [aarch64-port-dev ] RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <105c4a4a-59c9-8095-6d45-642595f65539@redhat.com> <10e5adb8-3170-253b-17c4-ed70a708e404@redhat.com> <8af3b484-e9d6-5571-42d8-a42a66ebdd42@redhat.com> Message-ID: On 4/7/20 1:25 PM, Andrew Haley wrote: > On 4/7/20 12:52 PM, Andrew Haley wrote: >> I notice that even after applying your patch we are still using embedded >> OOPs in two places. >> >> Here in aarch64.ad: >> >> if (rtype == relocInfo::oop_type) { >> __ movoop(dst_reg, (jobject)con, /*immediate*/true); >> } >> >> and here in sharedRuntime_aarch64.cpp: >> >> // load oop into a register >> __ movoop(c_rarg1, >> JNIHandles::make_local(method->method_holder()->java_mirror()), >> /*immediate*/true); >> >> Why is this? > > Ah, the second one is a handle, of course, and AFAIK handles don't move. > Having said that, the use of movoop on something that is the address of an > oop rather than an oop is odd,but it's done on other targets. > > The C2 one is still suspect. I made the following changes, bootstrap still works: diff -r cd06d732d5f0 src/hotspot/cpu/aarch64/aarch64.ad --- a/src/hotspot/cpu/aarch64/aarch64.ad Wed Apr 08 08:57:07 2020 -0400 +++ b/src/hotspot/cpu/aarch64/aarch64.ad Wed Apr 08 09:03:18 2020 -0400 @@ -3160,7 +3160,7 @@ } else { relocInfo::relocType rtype = $src->constant_reloc(); if (rtype == relocInfo::oop_type) { - __ movoop(dst_reg, (jobject)con, /*immediate*/true); + __ movoop(dst_reg, (jobject)con, /*immediate*/false); } else if (rtype == relocInfo::metadata_type) { __ mov_metadata(dst_reg, (Metadata*)con); } else { diff -r cd06d732d5f0 src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp --- a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Wed Apr 08 08:57:07 2020 -0400 +++ b/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Wed Apr 08 09:03:18 2020 -0400 @@ -4145,7 +4145,7 @@ if (! immediate) { // nmethod barriers need to be ordered with respected to oop accesses, so // we can't use immediate literals as that would necessitate ISBs. - if (BarrierSet::barrier_set()->barrier_set_nmethod() != NULL) { + if (0 && BarrierSet::barrier_set()->barrier_set_nmethod() != NULL) { adr(dst, InternalAddress(address_constant((address)obj, rspec))); ldr(dst, Address(dst)); } else { diff -r cd06d732d5f0 src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp --- a/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp Wed Apr 08 08:57:07 2020 -0400 +++ b/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp Wed Apr 08 09:03:18 2020 -0400 @@ -1676,11 +1676,10 @@ // Pre-load a static method's oop into c_rarg1. if (method->is_static() && !is_critical_native) { - // load oop into a register __ movoop(c_rarg1, JNIHandles::make_local(method->method_holder()->java_mirror()), - /*immediate*/true); + /*immediate*/false); // Now handlize the static class mirror it's known not-null. __ str(c_rarg1, Address(sp, klass_offset)); diff -r cd06d732d5f0 src/hotspot/share/runtime/sharedRuntime.cpp --- a/src/hotspot/share/runtime/sharedRuntime.cpp Wed Apr 08 08:57:07 2020 -0400 +++ b/src/hotspot/share/runtime/sharedRuntime.cpp Wed Apr 08 09:03:18 2020 -0400 @@ -2873,6 +2873,7 @@ CodeBuffer buffer(buf); double locs_buf[20]; buffer.insts()->initialize_shared_locs((relocInfo*)locs_buf, sizeof(locs_buf) / sizeof(relocInfo)); + buffer.initialize_consts_size(8); MacroAssembler _masm(&buffer); // Fill in the signature array, for the calling-convention call. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From stumon01 at arm.com Wed Apr 8 15:33:20 2020 From: stumon01 at arm.com (Stuart Monteith) Date: Wed, 8 Apr 2020 16:33:20 +0100 Subject: [aarch64-port-dev ] RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <105c4a4a-59c9-8095-6d45-642595f65539@redhat.com> <10e5adb8-3170-253b-17c4-ed70a708e404@redhat.com> <8af3b484-e9d6-5571-42d8-a42a66ebdd42@redhat.com> Message-ID: I see what you did there. This comes back to our previous discussion about the value of having immediate oops at all. isn't that what you are effectively suggesting? That would simply the code somewhat. On 08/04/2020 14:08, Andrew Haley wrote: > On 4/7/20 1:25 PM, Andrew Haley wrote: >> On 4/7/20 12:52 PM, Andrew Haley wrote: >>> I notice that even after applying your patch we are still using embedded >>> OOPs in two places. >>> >>> Here in aarch64.ad: >>> >>> if (rtype == relocInfo::oop_type) { >>> __ movoop(dst_reg, (jobject)con, /*immediate*/true); >>> } >>> >>> and here in sharedRuntime_aarch64.cpp: >>> >>> // load oop into a register >>> __ movoop(c_rarg1, >>> JNIHandles::make_local(method->method_holder()->java_mirror()), >>> /*immediate*/true); >>> >>> Why is this? >> >> Ah, the second one is a handle, of course, and AFAIK handles don't move. >> Having said that, the use of movoop on something that is the address of an >> oop rather than an oop is odd,but it's done on other targets. >> >> The C2 one is still suspect. > > I made the following changes, bootstrap still works: > > diff -r cd06d732d5f0 src/hotspot/cpu/aarch64/aarch64.ad > --- a/src/hotspot/cpu/aarch64/aarch64.ad Wed Apr 08 08:57:07 2020 -0400 > +++ b/src/hotspot/cpu/aarch64/aarch64.ad Wed Apr 08 09:03:18 2020 -0400 > @@ -3160,7 +3160,7 @@ > } else { > relocInfo::relocType rtype = $src->constant_reloc(); > if (rtype == relocInfo::oop_type) { > - __ movoop(dst_reg, (jobject)con, /*immediate*/true); > + __ movoop(dst_reg, (jobject)con, /*immediate*/false); > } else if (rtype == relocInfo::metadata_type) { > __ mov_metadata(dst_reg, (Metadata*)con); > } else { > diff -r cd06d732d5f0 src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp > --- a/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Wed Apr 08 08:57:07 2020 -0400 > +++ b/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp Wed Apr 08 09:03:18 2020 -0400 > @@ -4145,7 +4145,7 @@ > if (! immediate) { > // nmethod barriers need to be ordered with respected to oop accesses, so > // we can't use immediate literals as that would necessitate ISBs. > - if (BarrierSet::barrier_set()->barrier_set_nmethod() != NULL) { > + if (0 && BarrierSet::barrier_set()->barrier_set_nmethod() != NULL) { > adr(dst, InternalAddress(address_constant((address)obj, rspec))); > ldr(dst, Address(dst)); > } else { > diff -r cd06d732d5f0 src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp > --- a/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp Wed Apr 08 08:57:07 2020 -0400 > +++ b/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp Wed Apr 08 09:03:18 2020 -0400 > @@ -1676,11 +1676,10 @@ > > // Pre-load a static method's oop into c_rarg1. > if (method->is_static() && !is_critical_native) { > - > // load oop into a register > __ movoop(c_rarg1, > JNIHandles::make_local(method->method_holder()->java_mirror()), > - /*immediate*/true); > + /*immediate*/false); > > // Now handlize the static class mirror it's known not-null. > __ str(c_rarg1, Address(sp, klass_offset)); > diff -r cd06d732d5f0 src/hotspot/share/runtime/sharedRuntime.cpp > --- a/src/hotspot/share/runtime/sharedRuntime.cpp Wed Apr 08 08:57:07 2020 -0400 > +++ b/src/hotspot/share/runtime/sharedRuntime.cpp Wed Apr 08 09:03:18 2020 -0400 > @@ -2873,6 +2873,7 @@ > CodeBuffer buffer(buf); > double locs_buf[20]; > buffer.insts()->initialize_shared_locs((relocInfo*)locs_buf, sizeof(locs_buf) / sizeof(relocInfo)); > + buffer.initialize_consts_size(8); > MacroAssembler _masm(&buffer); > > // Fill in the signature array, for the calling-convention call. > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. From aph at redhat.com Wed Apr 8 16:15:58 2020 From: aph at redhat.com (Andrew Haley) Date: Wed, 8 Apr 2020 17:15:58 +0100 Subject: [aarch64-port-dev ] RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <105c4a4a-59c9-8095-6d45-642595f65539@redhat.com> <10e5adb8-3170-253b-17c4-ed70a708e404@redhat.com> <8af3b484-e9d6-5571-42d8-a42a66ebdd42@redhat.com> Message-ID: <5018c8e8-73f9-ad71-1e0b-7874e98dea3c@redhat.com> On 4/8/20 4:33 PM, Stuart Monteith wrote: > I see what you did there. This comes back to our previous discussion > about the value of having immediate oops at all. isn't that what you are > effectively suggesting? That would simply the code somewhat. No entirely. Immediate oops are good for most GCs. But according to what Erik said, immediate oops are verboten when we're using ZGC with concurrent method unloading, and it seems to be very easy to do. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From stumon01 at arm.com Thu Apr 16 11:29:19 2020 From: stumon01 at arm.com (Stuart Monteith) Date: Thu, 16 Apr 2020 12:29:19 +0100 Subject: [aarch64-port-dev ] RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <5018c8e8-73f9-ad71-1e0b-7874e98dea3c@redhat.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <105c4a4a-59c9-8095-6d45-642595f65539@redhat.com> <10e5adb8-3170-253b-17c4-ed70a708e404@redhat.com> <8af3b484-e9d6-5571-42d8-a42a66ebdd42@redhat.com> <5018c8e8-73f9-ad71-1e0b-7874e98dea3c@redhat.com> Message-ID: <85312260-c65f-cb86-5a44-ee77e8d04b4d@arm.com> On 08/04/2020 17:15, Andrew Haley wrote: > On 4/8/20 4:33 PM, Stuart Monteith wrote: >> I see what you did there. This comes back to our previous discussion >> about the value of having immediate oops at all. isn't that what you are >> effectively suggesting? That would simply the code somewhat. > > No entirely. Immediate oops are good for most GCs. But according to what > Erik said, immediate oops are verboten when we're using ZGC with concurrent > method unloading, and it seems to be very easy to do. > I've incorporated everyone's comments into the latest: http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ I've cleaned up the logic somewhat - instead movoop will only make an oop an immediate if there aren't nmethod entry guards - hopefully the comments make than clear. This tested OK with a full run of JTREG. I've made adding constants to wrapper only for AARCH64 with a comment explaining why. I presume we don't want architectures that don't need it to take the (albeit small) overhead. I've not made it conditional on aarch64 with ZGC or class unloading. BR, Stuart IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. From stuart.monteith at arm.com Mon Apr 20 14:19:28 2020 From: stuart.monteith at arm.com (Stuart Monteith) Date: Mon, 20 Apr 2020 15:19:28 +0100 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> Message-ID: <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> Hi, If anyone has bandwidth, would the be able to review this patch? It addresses Andrew, Per and Erik's comments: http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ Thanks, Stuart On 27/03/2020 09:47, Erik ?sterlund wrote: > Hi Stuart, > > Thanks for sorting this out on AArch64. It is nice to see thatyou can > implement these > barriers on platforms that do not have instruction cache coherency. > > One small change request: > It looks like in C1 you inject the entry barrier right after build_frame > is done: > > ?629?????? build_frame(); > ?630?????? { > ?631???????? // Insert nmethod entry barrier into frame. > ?632???????? BarrierSetAssembler* bs = > BarrierSet::barrier_set()->barrier_set_assembler(); > ?633???????? bs->nmethod_entry_barrier(_masm); > ?634?????? } > > Unfortunately, this is in the platform independent part of the LIR > assembler. In the x86 version > we inject it at the very end of build_frame() instead, which is a > platform-specific function. > The platform-specific function is in the C1 macro assembler file for > that platform. > > We intentionally put it in the platform-specific path as it is a > platform-specific feature. > Now on x86, the barrier code will be emitted once in build_frame() and > once after returning > from build_frame, resulting in two nmethod entry barriers, and only the > first one will get > patched, causing the second one to mostly take slow paths, which isn't > necessarily wrong, > but will cause regressions. > > I would propose you just move those lines into the very end of the > AArch64-specific part of > build_frame(). > > I don't need to see another webrev for that trivial code motion. This > looks good to me. > Agan, thanks a lot for fixing this! It will allow me to go forward with > concurrent stack > scanning on AArch64 as well. > > Thanks, > /Erik > > > On 2020-03-26 23:42, Stuart Monteith wrote: >> Hello, >> ???????? Please review this change to implement nmethod entry barriers on >> aarch64, and hence concurrent class unloading with ZGC. Shenandoah will >> need to be separately tested and enabled - there are problems with this >> on Shenandoah. >> >> It has been tested with JTreg, runs with SPECjbb, gcbench, and Lucene as >> well as Netbeans. >> >> In terms of interesting features: >> ????????? With nmethod entry barriers,? immediate oops are removed by: >> ???????????????? LIR_Assembler::jobject2reg? and? MacroAssembler::movoop >> ???????? This is to ensure consistency with the entry barrier, as >> otherwise with >> an immediate we'd otherwise need an ISB. >> >> ???????? I've added "-XX:DeoptNMethodBarrierALot". I found this >> functionality >> useful in testing as deoptimisation is very infrequent. I've written it >> as an atomic to avoid it happening too frequently. As it is a new >> option, I'm not sure whether any more is needed than this review. A new >> test has been added >> "test/hotspot/jtreg/gc/stress/gcbasher/TestGCBasherDeoptWithZ.java" to >> test GC with that option enabled. >> >> ???????? BarrierSetAssembler::nmethod_entry_barrier >> ???????? This method emits the barrier code. In internal review it was >> suggested >> the "dmb( ISHLD )" should be replaced by "membar(LoadLoad)". I've not >> done this as the BarrierSetNMethod code checks the exact instruction >> sequence, and I prefer to be explicit. >> >> ???????? Benchmarking method entry shows an increase of around 6ns >> with the >> nmethod entry barrier. >> >> >> The deoptimisation code was contributed by Andrew Haley. >> >> The bug: >> ???????? https://bugs.openjdk.java.net/browse/JDK-8216557 >> >> The webrev: >> ???????? http://cr.openjdk.java.net/~smonteith/8216557/webrev.0/ >> >> >> BR, >> ???????? Stuart From aph at redhat.com Mon Apr 20 16:35:35 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 17:35:35 +0100 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> Message-ID: <7cd269af-c621-8d33-c7d6-1baa6729fc31@redhat.com> On 4/20/20 3:19 PM, Stuart Monteith wrote: > If anyone has bandwidth, would the be able to review this patch? It > addresses Andrew, Per and Erik's comments: > http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ Yes, yes. I'm pedalling as quickly as I can. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From aph at redhat.com Mon Apr 20 17:22:12 2020 From: aph at redhat.com (Andrew Haley) Date: Mon, 20 Apr 2020 18:22:12 +0100 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> Message-ID: On 4/20/20 3:19 PM, Stuart Monteith wrote: > If anyone has bandwidth, would the be able to review this patch? It > addresses Andrew, Per and Erik's comments: > http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ It looks right. For clarity I wonder if perhaps we should have a method bool BarrierSet::use_nmethod_barriers() or somesuch. It would be much easier to read. In future we should perhaps not inline the guard value, and move the infrequently-executed code out of line. Then this: 0x0000ffffa97fff14: ldr w8, 0x0000ffffa97fff3c 0x0000ffffa97fff18: dmb ishld 0x0000ffffa97fff1c: ldr w9, [x28, #36] 0x0000ffffa97fff20: cmp w8, w9 0x0000ffffa97fff24: b.eq 0x0000ffffa97fff40 // b.none ;; 0xFFFFA9118B00 0x0000ffffa97fff28: mov x8, #0x8b00 // #35584 0x0000ffffa97fff2c: movk x8, #0xa911, lsl #16 0x0000ffffa97fff30: movk x8, #0xffff, lsl #32 0x0000ffffa97fff34: blr x8 0x0000ffffa97fff38: b 0x0000ffffa97fff40 0x0000ffffa97fff3c would turn into this: : ldr w8, 0x0000ffffa97fff3c : dmb ishld : ldr w9, [x28, #36] : cmp w8, w9 : b.ne 0x0000ffffa97fff40 // b.none but we don't have to do that right now. OK. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From stuart.monteith at arm.com Mon Apr 20 19:35:19 2020 From: stuart.monteith at arm.com (Stuart Monteith) Date: Mon, 20 Apr 2020 20:35:19 +0100 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> Message-ID: On 20/04/2020 18:22, Andrew Haley wrote: > On 4/20/20 3:19 PM, Stuart Monteith wrote: >> If anyone has bandwidth, would the be able to review this patch? It >> addresses Andrew, Per and Erik's comments: >> http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ > > It looks right. For clarity I wonder if perhaps we should have a method > bool BarrierSet::use_nmethod_barriers() or somesuch. It would be much > easier to read. > That would be good. How about I apply that as a separate patch, as it would necessarily be cross-platform. > In future we should perhaps not inline the guard value, and move the > infrequently-executed code out of line. > > Then this: > > 0x0000ffffa97fff14: ldr w8, 0x0000ffffa97fff3c > 0x0000ffffa97fff18: dmb ishld > 0x0000ffffa97fff1c: ldr w9, [x28, #36] > 0x0000ffffa97fff20: cmp w8, w9 > 0x0000ffffa97fff24: b.eq 0x0000ffffa97fff40 // b.none > ;; 0xFFFFA9118B00 > 0x0000ffffa97fff28: mov x8, #0x8b00 // #35584 > 0x0000ffffa97fff2c: movk x8, #0xa911, lsl #16 > 0x0000ffffa97fff30: movk x8, #0xffff, lsl #32 > 0x0000ffffa97fff34: blr x8 > 0x0000ffffa97fff38: b 0x0000ffffa97fff40 > 0x0000ffffa97fff3c > > would turn into this: > > : ldr w8, 0x0000ffffa97fff3c > : dmb ishld > : ldr w9, [x28, #36] > : cmp w8, w9 > : b.ne 0x0000ffffa97fff40 // b.none > > but we don't have to do that right now. OK. > That would be ideal - a CodeStub to handle the slow path would then take the responsibility for recording the relative location of the guard value - currently that happens to be fixed. I think I'd prefer to do that as a separate patch while I work out the details. Thanks for the review, Stuart From per.liden at oracle.com Tue Apr 21 08:20:02 2020 From: per.liden at oracle.com (Per Liden) Date: Tue, 21 Apr 2020 10:20:02 +0200 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> Message-ID: <90c015a8-3db9-b7bd-f8d3-f05f5e6458d3@oracle.com> Looks good to me. One minor thing, you no longer need -XX:+UnlockExperimentalVMOptions in test/hotspot/jtreg/gc/stress/gcbasher/TestGCBasherWithZ.java. I don't need to see another webrev for that. cheers, Per On 4/20/20 4:19 PM, Stuart Monteith wrote: > Hi, > If anyone has bandwidth, would the be able to review this patch? It > addresses Andrew, Per and Erik's comments: > http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ > > Thanks, > Stuart > > > On 27/03/2020 09:47, Erik ?sterlund wrote: >> Hi Stuart, >> >> Thanks for sorting this out on AArch64. It is nice to see thatyou can >> implement these >> barriers on platforms that do not have instruction cache coherency. >> >> One small change request: >> It looks like in C1 you inject the entry barrier right after build_frame >> is done: >> >> ?629?????? build_frame(); >> ?630?????? { >> ?631???????? // Insert nmethod entry barrier into frame. >> ?632???????? BarrierSetAssembler* bs = >> BarrierSet::barrier_set()->barrier_set_assembler(); >> ?633???????? bs->nmethod_entry_barrier(_masm); >> ?634?????? } >> >> Unfortunately, this is in the platform independent part of the LIR >> assembler. In the x86 version >> we inject it at the very end of build_frame() instead, which is a >> platform-specific function. >> The platform-specific function is in the C1 macro assembler file for >> that platform. >> >> We intentionally put it in the platform-specific path as it is a >> platform-specific feature. >> Now on x86, the barrier code will be emitted once in build_frame() and >> once after returning >> from build_frame, resulting in two nmethod entry barriers, and only the >> first one will get >> patched, causing the second one to mostly take slow paths, which isn't >> necessarily wrong, >> but will cause regressions. >> >> I would propose you just move those lines into the very end of the >> AArch64-specific part of >> build_frame(). >> >> I don't need to see another webrev for that trivial code motion. This >> looks good to me. >> Agan, thanks a lot for fixing this! It will allow me to go forward with >> concurrent stack >> scanning on AArch64 as well. >> >> Thanks, >> /Erik >> >> >> On 2020-03-26 23:42, Stuart Monteith wrote: >>> Hello, >>> ???????? Please review this change to implement nmethod entry barriers on >>> aarch64, and hence concurrent class unloading with ZGC. Shenandoah will >>> need to be separately tested and enabled - there are problems with this >>> on Shenandoah. >>> >>> It has been tested with JTreg, runs with SPECjbb, gcbench, and Lucene as >>> well as Netbeans. >>> >>> In terms of interesting features: >>> ????????? With nmethod entry barriers,? immediate oops are removed by: >>> ???????????????? LIR_Assembler::jobject2reg? and? MacroAssembler::movoop >>> ???????? This is to ensure consistency with the entry barrier, as >>> otherwise with >>> an immediate we'd otherwise need an ISB. >>> >>> ???????? I've added "-XX:DeoptNMethodBarrierALot". I found this >>> functionality >>> useful in testing as deoptimisation is very infrequent. I've written it >>> as an atomic to avoid it happening too frequently. As it is a new >>> option, I'm not sure whether any more is needed than this review. A new >>> test has been added >>> "test/hotspot/jtreg/gc/stress/gcbasher/TestGCBasherDeoptWithZ.java" to >>> test GC with that option enabled. >>> >>> ???????? BarrierSetAssembler::nmethod_entry_barrier >>> ???????? This method emits the barrier code. In internal review it was >>> suggested >>> the "dmb( ISHLD )" should be replaced by "membar(LoadLoad)". I've not >>> done this as the BarrierSetNMethod code checks the exact instruction >>> sequence, and I prefer to be explicit. >>> >>> ???????? Benchmarking method entry shows an increase of around 6ns >>> with the >>> nmethod entry barrier. >>> >>> >>> The deoptimisation code was contributed by Andrew Haley. >>> >>> The bug: >>> ???????? https://bugs.openjdk.java.net/browse/JDK-8216557 >>> >>> The webrev: >>> ???????? http://cr.openjdk.java.net/~smonteith/8216557/webrev.0/ >>> >>> >>> BR, >>> ???????? Stuart > From kirk at kodewerk.com Tue Apr 21 21:23:38 2020 From: kirk at kodewerk.com (Kirk Pepperdine) Date: Tue, 21 Apr 2020 14:23:38 -0700 Subject: ZGC logs Message-ID: Hi, I?ve been looking at GC log data trying to reconcile some of the numbers. [4.719s][info ][gc,heap ] GC(3) Mark Start Mark End Relocate Start Relocate End High Low [4.719s][info ][gc,heap ] GC(3) Capacity: 1032M (25%) 1176M (29%) 1176M (29%) 1176M (29%) 1176M (29%) 1032M (25%) [4.719s][info ][gc,heap ] GC(3) Reserve: 42M (1%) 42M (1%) 42M (1%) 42M (1%) 42M (1%) 42M (1%) [4.719s][info ][gc,heap ] GC(3) Free: 3064M (75%) 2954M (72%) 3818M (93%) 3704M (90%) 3842M (94%) 2920M (71%) [4.719s][info ][gc,heap ] GC(3) Used: 990M (24%) 1100M (27%) 236M (6%) 350M (9%) 1134M (28%) 212M (5%) [4.719s][info ][gc,heap ] GC(3) Live: - 10M (0%) 10M (0%) 10M (0%) - - [4.719s][info ][gc,heap ] GC(3) Allocated: - 208M (5%) 240M (6%) 764M (19%) - - [4.719s][info ][gc,heap ] GC(3) Garbage: - 979M (24%) 83M (2%) 5M (0%) - - [4.719s][info ][gc,heap ] GC(3) Reclaimed: - - 896M (22%) 974M (24%) - - If I understand what this log is telling me, there is initially 990M of data of which 10M is marked Live leaving 979M of garbage (-1 rounding error). During the concurrent mark an additional 208M of data was allocated requiring an additional 144M from committed memory. By Mark End, used increase by 110M, not 208M. That leaves 98M unaccounted for. I have some ideas but I?d feel more comfortable if someone could offer a comment on how to reconcile the live, allocated, garbage and reclaimed numbers with used. For example, I would assume that of the 350M remaining at Relocate End is a mix of live and floating garbage. That said, I cannot seem to reconcile the Used, allocated, and reclaimed in a way that yields 350M. The calculation that seems to work is used at Mark End + allocated - recovered = used @ Relocate Start. That is 1100 + (240-208)-896 = 336. However applying that logic shifted over fails. 236 + (764-240) - (974-896) = 681 != 350. So I?m seem to be missing bits that my starting at zStat hasn?t cleared up. All comments appreciated. Kind regards, Kirk From per.liden at oracle.com Wed Apr 22 10:46:28 2020 From: per.liden at oracle.com (Per Liden) Date: Wed, 22 Apr 2020 12:46:28 +0200 Subject: ZGC logs In-Reply-To: References: Message-ID: Hi Kirk, On 4/21/20 11:23 PM, Kirk Pepperdine wrote: > Hi, > > I?ve been looking at GC log data trying to reconcile some of the numbers. > > [4.719s][info ][gc,heap ] GC(3) Mark Start Mark End Relocate Start Relocate End High Low > [4.719s][info ][gc,heap ] GC(3) Capacity: 1032M (25%) 1176M (29%) 1176M (29%) 1176M (29%) 1176M (29%) 1032M (25%) > [4.719s][info ][gc,heap ] GC(3) Reserve: 42M (1%) 42M (1%) 42M (1%) 42M (1%) 42M (1%) 42M (1%) > [4.719s][info ][gc,heap ] GC(3) Free: 3064M (75%) 2954M (72%) 3818M (93%) 3704M (90%) 3842M (94%) 2920M (71%) > [4.719s][info ][gc,heap ] GC(3) Used: 990M (24%) 1100M (27%) 236M (6%) 350M (9%) 1134M (28%) 212M (5%) > [4.719s][info ][gc,heap ] GC(3) Live: - 10M (0%) 10M (0%) 10M (0%) - - > [4.719s][info ][gc,heap ] GC(3) Allocated: - 208M (5%) 240M (6%) 764M (19%) - - > [4.719s][info ][gc,heap ] GC(3) Garbage: - 979M (24%) 83M (2%) 5M (0%) - - > [4.719s][info ][gc,heap ] GC(3) Reclaimed: - - 896M (22%) 974M (24%) - - > > > If I understand what this log is telling me, there is initially 990M of data of which 10M is marked Live leaving 979M of garbage (-1 rounding error). During the concurrent mark an additional 208M of data was allocated requiring an additional 144M from committed memory. By Mark End, used increase by 110M, not 208M. That leaves 98M unaccounted for. I have some ideas but I?d feel more comfortable if someone could offer a comment on how to reconcile the live, allocated, garbage and reclaimed numbers with used. For example, I would assume that of the 350M remaining at Relocate End is a mix of live and floating garbage. That said, I cannot seem to reconcile the Used, allocated, and reclaimed in a way that yields 350M. The calculation that seems to work is used at Mark End + allocated - recovered = used @ Relocate Start. That is 1100 + (240-208)-896 = 336. However applying that logic shifted over fails. 236 + (764-240) - (974-896) = 681 != 350. > > So I?m seem to be missing bits that my starting at zStat hasn?t cleared up. > > All comments appreciated. What's confusing, as it doesn't show in the log, is that the "allocated" number can be inflated because some page allocations where "undone". This can happen in situations where, for example, two Java threads competed to allocate a new shared medium page, both threads allocate one medium page each, but only one thread will win the race to install that page. The thread who lost the race will then immediately free its newly allocated page (undo the allocation). However, we only increase the "allocated" number when pages are allocated, but we don't decrease it when we undo an allocation. You can confirm that it's cause by "undo" by looking at the statistics table and look for the "Memory: Undo Page Allocation" line, which should show non-zero numbers. The need to undo allocations is usually a rare event, at least on most workloads. One could argue that the "allocated" number is correct (in the sense that these pages where in fact allocated), but I think it would be more helpful to log the "allocated - undone" number, as the undo part is more of an artifact of how things work internally, and not what a user expects to see. Here's a patch to adjust "allocated" so that it takes undone allocations into account. Feel free to try it out and see if the numbers make more sense. diff --git a/src/hotspot/share/gc/z/zPageAllocator.cpp b/src/hotspot/share/gc/z/zPageAllocator.cpp --- a/src/hotspot/share/gc/z/zPageAllocator.cpp +++ b/src/hotspot/share/gc/z/zPageAllocator.cpp @@ -287,12 +287,14 @@ } void ZPageAllocator::decrease_used(size_t size, bool reclaimed) { + // Only pages explicitly released with the reclaimed flag set + // counts as reclaimed bytes. This flag is true when a worker + // releases a page after relocation, and is false when we + // release a page to undo an allocation. if (reclaimed) { - // Only pages explicitly released with the reclaimed flag set - // counts as reclaimed bytes. This flag is typically true when - // a worker releases a page after relocation, and is typically - // false when we release a page to undo an allocation. _reclaimed += size; + } else { + _allocated -= size; } _used -= size; if (_used < _used_low) { cheers, Per > > Kind regards, > Kirk > From erik.osterlund at oracle.com Thu Apr 23 10:51:46 2020 From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=) Date: Thu, 23 Apr 2020 12:51:46 +0200 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> Message-ID: Hi Stuart, This looks good to me. Thanks, /Erik On 2020-04-20 16:19, Stuart Monteith wrote: > Hi, > If anyone has bandwidth, would the be able to review this patch? It > addresses Andrew, Per and Erik's comments: > http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ > > Thanks, > Stuart > > > On 27/03/2020 09:47, Erik ?sterlund wrote: >> Hi Stuart, >> >> Thanks for sorting this out on AArch64. It is nice to see thatyou can >> implement these >> barriers on platforms that do not have instruction cache coherency. >> >> One small change request: >> It looks like in C1 you inject the entry barrier right after build_frame >> is done: >> >> ?629?????? build_frame(); >> ?630?????? { >> ?631???????? // Insert nmethod entry barrier into frame. >> ?632???????? BarrierSetAssembler* bs = >> BarrierSet::barrier_set()->barrier_set_assembler(); >> ?633???????? bs->nmethod_entry_barrier(_masm); >> ?634?????? } >> >> Unfortunately, this is in the platform independent part of the LIR >> assembler. In the x86 version >> we inject it at the very end of build_frame() instead, which is a >> platform-specific function. >> The platform-specific function is in the C1 macro assembler file for >> that platform. >> >> We intentionally put it in the platform-specific path as it is a >> platform-specific feature. >> Now on x86, the barrier code will be emitted once in build_frame() and >> once after returning >> from build_frame, resulting in two nmethod entry barriers, and only the >> first one will get >> patched, causing the second one to mostly take slow paths, which isn't >> necessarily wrong, >> but will cause regressions. >> >> I would propose you just move those lines into the very end of the >> AArch64-specific part of >> build_frame(). >> >> I don't need to see another webrev for that trivial code motion. This >> looks good to me. >> Agan, thanks a lot for fixing this! It will allow me to go forward with >> concurrent stack >> scanning on AArch64 as well. >> >> Thanks, >> /Erik >> >> >> On 2020-03-26 23:42, Stuart Monteith wrote: >>> Hello, >>> ???????? Please review this change to implement nmethod entry barriers on >>> aarch64, and hence concurrent class unloading with ZGC. Shenandoah will >>> need to be separately tested and enabled - there are problems with this >>> on Shenandoah. >>> >>> It has been tested with JTreg, runs with SPECjbb, gcbench, and Lucene as >>> well as Netbeans. >>> >>> In terms of interesting features: >>> ????????? With nmethod entry barriers,? immediate oops are removed by: >>> ???????????????? LIR_Assembler::jobject2reg? and? MacroAssembler::movoop >>> ???????? This is to ensure consistency with the entry barrier, as >>> otherwise with >>> an immediate we'd otherwise need an ISB. >>> >>> ???????? I've added "-XX:DeoptNMethodBarrierALot". I found this >>> functionality >>> useful in testing as deoptimisation is very infrequent. I've written it >>> as an atomic to avoid it happening too frequently. As it is a new >>> option, I'm not sure whether any more is needed than this review. A new >>> test has been added >>> "test/hotspot/jtreg/gc/stress/gcbasher/TestGCBasherDeoptWithZ.java" to >>> test GC with that option enabled. >>> >>> ???????? BarrierSetAssembler::nmethod_entry_barrier >>> ???????? This method emits the barrier code. In internal review it was >>> suggested >>> the "dmb( ISHLD )" should be replaced by "membar(LoadLoad)". I've not >>> done this as the BarrierSetNMethod code checks the exact instruction >>> sequence, and I prefer to be explicit. >>> >>> ???????? Benchmarking method entry shows an increase of around 6ns >>> with the >>> nmethod entry barrier. >>> >>> >>> The deoptimisation code was contributed by Andrew Haley. >>> >>> The bug: >>> ???????? https://bugs.openjdk.java.net/browse/JDK-8216557 >>> >>> The webrev: >>> ???????? http://cr.openjdk.java.net/~smonteith/8216557/webrev.0/ >>> >>> >>> BR, >>> ???????? Stuart From per.liden at oracle.com Thu Apr 23 17:31:32 2020 From: per.liden at oracle.com (Per Liden) Date: Thu, 23 Apr 2020 19:31:32 +0200 Subject: ZGC logs In-Reply-To: References: Message-ID: <22dcaedb-26f7-e78f-081a-fc04f9661fe8@oracle.com> On 4/22/20 12:46 PM, Per Liden wrote: > Hi Kirk, > > On 4/21/20 11:23 PM, Kirk Pepperdine wrote: >> Hi, >> >> I?ve been looking at GC log data trying to reconcile some of the numbers. >> >> [4.719s][info ][gc,heap???????????????????? ] >> GC(3)?????????????????????? Mark Start????????? Mark End >> Relocate Start???????? Relocate End?????????? High?????????????? Low >> [4.719s][info ][gc,heap???????????????????? ] GC(3)? Capacity: >> 1032M (25%)??????? 1176M (29%)???????? 1176M (29%)??????? 1176M >> (29%)??????? 1176M (29%)??????? 1032M (25%) >> [4.719s][info ][gc,heap???????????????????? ] GC(3)?? Reserve: >> 42M (1%)????????????? 42M (1%)????????????? 42M (1%)????????????? 42M >> (1%)?????????? 42M (1%)?????????? 42M (1%) >> [4.719s][info ][gc,heap???????????????????? ] GC(3)????? Free: >> 3064M (75%)??????? 2954M (72%)??????? 3818M (93%)??????? 3704M >> (90%)??????? 3842M (94%)??????? 2920M (71%) >> [4.719s][info ][gc,heap???????????????????? ] GC(3)????? Used: >> 990M (24%)??????? 1100M (27%)????????? 236M (6%)??????????? 350M >> (9%)???????? 1134M (28%)???????? 212M (5%) >> [4.719s][info ][gc,heap???????????????????? ] GC(3) >> Live:?????????????? -????????????????????????? 10M (0%) >> 10M (0%)????????????? 10M (0%)???????????? -????????????????? - >> [4.719s][info ][gc,heap???????????????????? ] GC(3) >> Allocated:?????????? -??????????????????????? 208M (5%) >> 240M (6%)??????????? 764M (19%)??????????? -????????????????? - >> [4.719s][info ][gc,heap???????????????????? ] GC(3) >> Garbage:????????? -??????????????????????? 979M (24%)???????????? 83M >> (2%)??????????????? 5M (0%)???????????? -????????????????? - >> [4.719s][info ][gc,heap???????????????????? ] GC(3) Reclaimed: >> -????????????????????????????? -?????????????????????? 896M >> (22%)?????????? 974M (24%)??????????? -????????????????? - >> >> >> If I understand what this log is telling me, there is initially 990M >> of data of which 10M is marked Live leaving 979M of garbage (-1 >> rounding error). During the concurrent mark an additional 208M of data >> was allocated requiring an additional 144M from committed memory. By >> Mark End, used increase by 110M, not 208M. That leaves 98M unaccounted >> for. I have some ideas but I?d feel more comfortable if someone could >> offer a comment on how to reconcile the live, allocated, garbage and >> reclaimed numbers with used. For example, I would assume that of the >> 350M remaining at Relocate End is a mix of live and floating garbage. >> That said, I cannot seem to reconcile the Used, allocated, and >> reclaimed in a way that yields 350M. The calculation that seems to >> work is used at Mark End + allocated - recovered = used @ Relocate >> Start. That is 1100 + (240-208)-896 = 336. However applying that logic >> shifted over fails. 236 + (764-240) - (974-896) = 681 != 350. >> >> So I?m seem to be missing bits that my starting at zStat hasn?t >> cleared up. >> >> All comments appreciated. > > What's confusing, as it doesn't show in the log, is that the "allocated" > number can be inflated because some page allocations where "undone". > This can happen in situations where, for example, two Java threads > competed to allocate a new shared medium page, both threads allocate one > medium page each, but only one thread will win the race to install that > page. The thread who lost the race will then immediately free its newly > allocated page (undo the allocation). However, we only increase the > "allocated" number when pages are allocated, but we don't decrease it > when we undo an allocation. > > You can confirm that it's cause by "undo" by looking at the statistics > table and look for the "Memory: Undo Page Allocation" line, which should > show non-zero numbers. The need to undo allocations is usually a rare > event, at least on most workloads. > > One could argue that the "allocated" number is correct (in the sense > that these pages where in fact allocated), but I think it would be more > helpful to log the "allocated - undone" number, as the undo part is more > of an artifact of how things work internally, and not what a user > expects to see. > > Here's a patch to adjust "allocated" so that it takes undone allocations > into account. Feel free to try it out and see if the numbers make more > sense. > > diff --git a/src/hotspot/share/gc/z/zPageAllocator.cpp > b/src/hotspot/share/gc/z/zPageAllocator.cpp > --- a/src/hotspot/share/gc/z/zPageAllocator.cpp > +++ b/src/hotspot/share/gc/z/zPageAllocator.cpp > @@ -287,12 +287,14 @@ > ?} > > ?void ZPageAllocator::decrease_used(size_t size, bool reclaimed) { > +? // Only pages explicitly released with the reclaimed flag set > +? // counts as reclaimed bytes. This flag is true when a worker > +? // releases a page after relocation, and is false when we > +? // release a page to undo an allocation. > ?? if (reclaimed) { > -??? // Only pages explicitly released with the reclaimed flag set > -??? // counts as reclaimed bytes. This flag is typically true when > -??? // a worker releases a page after relocation, and is typically > -??? // false when we release a page to undo an allocation. > ???? _reclaimed += size; > +? } else { > +??? _allocated -= size; > ?? } > ?? _used -= size; > ?? if (_used < _used_low) { > Just a follow up, to say that he above fix has now been integrated. http://hg.openjdk.java.net/jdk/jdk/rev/5736a55d1389 cheers, Per From stuart.monteith at arm.com Mon Apr 27 16:34:49 2020 From: stuart.monteith at arm.com (Stuart Monteith) Date: Mon, 27 Apr 2020 17:34:49 +0100 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> Message-ID: <4a01252e-e7b2-d2bc-2858-ca1785b8b2a2@arm.com> Thanks Erik, Per, Andrew, I've fixed up the testcase and retested. Uploaded here: http://cr.openjdk.java.net/~smonteith/8216557/webrev.2/ Would someone be able to submit this for me? Thanks, Stuart On 23/04/2020 11:51, Erik ?sterlund wrote: > Hi Stuart, > > This looks good to me. > > Thanks, > /Erik > > On 2020-04-20 16:19, Stuart Monteith wrote: >> Hi, >> ????If anyone has bandwidth, would the be able to review this patch? It >> addresses Andrew, Per and Erik's comments: >> ???????? http://cr.openjdk.java.net/~smonteith/8216557/webrev.1/ >> >> Thanks, >> ????Stuart >> >> >> On 27/03/2020 09:47, Erik ?sterlund wrote: >>> Hi Stuart, >>> >>> Thanks for sorting this out on AArch64. It is nice to see thatyou can >>> implement these >>> barriers on platforms that do not have instruction cache coherency. >>> >>> One small change request: >>> It looks like in C1 you inject the entry barrier right after build_frame >>> is done: >>> >>> ??629?????? build_frame(); >>> ??630?????? { >>> ??631???????? // Insert nmethod entry barrier into frame. >>> ??632???????? BarrierSetAssembler* bs = >>> BarrierSet::barrier_set()->barrier_set_assembler(); >>> ??633???????? bs->nmethod_entry_barrier(_masm); >>> ??634?????? } >>> >>> Unfortunately, this is in the platform independent part of the LIR >>> assembler. In the x86 version >>> we inject it at the very end of build_frame() instead, which is a >>> platform-specific function. >>> The platform-specific function is in the C1 macro assembler file for >>> that platform. >>> >>> We intentionally put it in the platform-specific path as it is a >>> platform-specific feature. >>> Now on x86, the barrier code will be emitted once in build_frame() and >>> once after returning >>> from build_frame, resulting in two nmethod entry barriers, and only the >>> first one will get >>> patched, causing the second one to mostly take slow paths, which isn't >>> necessarily wrong, >>> but will cause regressions. >>> >>> I would propose you just move those lines into the very end of the >>> AArch64-specific part of >>> build_frame(). >>> >>> I don't need to see another webrev for that trivial code motion. This >>> looks good to me. >>> Agan, thanks a lot for fixing this! It will allow me to go forward with >>> concurrent stack >>> scanning on AArch64 as well. >>> >>> Thanks, >>> /Erik >>> >>> >>> On 2020-03-26 23:42, Stuart Monteith wrote: >>>> Hello, >>>> ????????? Please review this change to implement nmethod entry barriers on >>>> aarch64, and hence concurrent class unloading with ZGC. Shenandoah will >>>> need to be separately tested and enabled - there are problems with this >>>> on Shenandoah. >>>> >>>> It has been tested with JTreg, runs with SPECjbb, gcbench, and Lucene as >>>> well as Netbeans. >>>> >>>> In terms of interesting features: >>>> ?????????? With nmethod entry barriers,? immediate oops are removed by: >>>> ????????????????? LIR_Assembler::jobject2reg? and? MacroAssembler::movoop >>>> ????????? This is to ensure consistency with the entry barrier, as >>>> otherwise with >>>> an immediate we'd otherwise need an ISB. >>>> >>>> ????????? I've added "-XX:DeoptNMethodBarrierALot". I found this >>>> functionality >>>> useful in testing as deoptimisation is very infrequent. I've written it >>>> as an atomic to avoid it happening too frequently. As it is a new >>>> option, I'm not sure whether any more is needed than this review. A new >>>> test has been added >>>> "test/hotspot/jtreg/gc/stress/gcbasher/TestGCBasherDeoptWithZ.java" to >>>> test GC with that option enabled. >>>> >>>> ????????? BarrierSetAssembler::nmethod_entry_barrier >>>> ????????? This method emits the barrier code. In internal review it was >>>> suggested >>>> the "dmb( ISHLD )" should be replaced by "membar(LoadLoad)". I've not >>>> done this as the BarrierSetNMethod code checks the exact instruction >>>> sequence, and I prefer to be explicit. >>>> >>>> ????????? Benchmarking method entry shows an increase of around 6ns >>>> with the >>>> nmethod entry barrier. >>>> >>>> >>>> The deoptimisation code was contributed by Andrew Haley. >>>> >>>> The bug: >>>> ????????? https://bugs.openjdk.java.net/browse/JDK-8216557 >>>> >>>> The webrev: >>>> ????????? http://cr.openjdk.java.net/~smonteith/8216557/webrev.0/ >>>> >>>> >>>> BR, >>>> ????????? Stuart > From stuart.monteith at arm.com Tue Apr 28 11:28:52 2020 From: stuart.monteith at arm.com (Stuart Monteith) Date: Tue, 28 Apr 2020 12:28:52 +0100 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: <3f193fdc-b1fb-9f0a-4635-acdb7de29bca@arm.com> References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> <4a01252e-e7b2-d2bc-2858-ca1785b8b2a2@arm.com> <3f193fdc-b1fb-9f0a-4635-acdb7de29bca@arm.com> Message-ID: On 28/04/2020 06:26, Ningsheng Jian wrote: > Hi Stuart, > > On 4/28/20 12:34 AM, Stuart Monteith wrote: >> Thanks Erik, Per, Andrew, >> ????I've fixed up the testcase and retested. >> >> Uploaded here: >> >> ????http://cr.openjdk.java.net/~smonteith/8216557/webrev.2/ >> >> Would someone be able to submit this for me? >> > > I submitted a build job before pushing your code, but it failed to build with minimal variant configure. Here's error > message: > > ./src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp: In static member function 'static AdapterHandlerEntry* > SharedRuntime::generate_i2c2i_adapters(MacroAssembler*, int, int, const BasicType*, const VMRegPair*, > AdapterFingerPrint*)': > > ./src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp:736:5: error: invalid use of incomplete type 'class > BarrierSetAssembler' > > ?? bs->c2i_entry_barrier(masm); > > I think you need to include barrierSetAssembler.hpp in sharedRuntime_aarch64.cpp? > > Thanks, > Ningsheng Thanks for that Ningsheng - I've made some changes, and built with minimal. The revised patch: http://cr.openjdk.java.net/~smonteith/8216557/webrev.3/ There were contributions from aph at redhat.com Thanks, Stuart From ningsheng.jian at arm.com Wed Apr 29 06:59:18 2020 From: ningsheng.jian at arm.com (Ningsheng Jian) Date: Wed, 29 Apr 2020 14:59:18 +0800 Subject: RFR: 8216557 Aarch64: Add support for Concurrent Class Unloading In-Reply-To: References: <520f8085-eaa0-46bc-9eb9-c1244fca2531@arm.com> <8f317840-a2b2-3ccb-fbb2-a38b2ebcbf4b@oracle.com> <7e49dc25-da51-50d3-eb3f-4840dab7db47@arm.com> <4a01252e-e7b2-d2bc-2858-ca1785b8b2a2@arm.com> <3f193fdc-b1fb-9f0a-4635-acdb7de29bca@arm.com> Message-ID: <0f602574-a4ea-da3e-46de-d35862e276d6@arm.com> On 4/28/20 7:28 PM, Stuart Monteith wrote: > On 28/04/2020 06:26, Ningsheng Jian wrote: >> Hi Stuart, >> >> On 4/28/20 12:34 AM, Stuart Monteith wrote: >>> Thanks Erik, Per, Andrew, >>> ????I've fixed up the testcase and retested. >>> >>> Uploaded here: >>> >>> ????http://cr.openjdk.java.net/~smonteith/8216557/webrev.2/ >>> >>> Would someone be able to submit this for me? >>> >> >> I submitted a build job before pushing your code, but it failed to build with minimal variant configure. Here's error >> message: >> >> ./src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp: In static member function 'static AdapterHandlerEntry* >> SharedRuntime::generate_i2c2i_adapters(MacroAssembler*, int, int, const BasicType*, const VMRegPair*, >> AdapterFingerPrint*)': >> >> ./src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp:736:5: error: invalid use of incomplete type 'class >> BarrierSetAssembler' >> >> ?? bs->c2i_entry_barrier(masm); >> >> I think you need to include barrierSetAssembler.hpp in sharedRuntime_aarch64.cpp? >> >> Thanks, >> Ningsheng > > Thanks for that Ningsheng - I've made some changes, and built with minimal. > > The revised patch: > > http://cr.openjdk.java.net/~smonteith/8216557/webrev.3/ > Looks good and pushed. Thanks, Ningsheng