From jmu at activeviam.com Wed May 3 17:01:40 2017 From: jmu at activeviam.com (=?UTF-8?B?Sm9zw6kgTXXDsW96?=) Date: Wed, 3 May 2017 19:01:40 +0200 Subject: Long TTSP Message-ID: Hi, I'm analyzing the performance of a VM with heap of 160g and I see TTSP of up to 2 seconds. I added some flags to get more information: -XX:+PrintSafepointStatistics -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=1000 -XX:PrintSafepointStatisticsTimeout=1000 -XX:PrintSafepointStatisticsCount=1 but I couldn't find any flag to print the stack trace of a thread that is executed when it fails to reach the safepoint after the delay. I would like to find a way to figure out which code is executed when a thread reached the TTSP threshold ? Thanks, Jose -------------- next part -------------- An HTML attachment was scrubbed... URL: From yu.zhang at oracle.com Wed May 3 17:58:15 2017 From: yu.zhang at oracle.com (yu.zhang at oracle.com) Date: Wed, 3 May 2017 10:58:15 -0700 Subject: Long TTSP In-Reply-To: References: Message-ID: <3e6f8d45-ffe6-78b7-c43c-bc1f741b0bdc@oracle.com> Hi, There is a development flag +DieOnSafepointTimeout that will create a core dump when the time to safepoint exceeds the threshold. I suggest you change this flag to product and rebuild jdk. Because running a development build may skew the application. Then you can debug the core dump and see what is going on. I found some times slow block device can cause the TTSP reach seconds. There is a good discussion https://groups.google.com/forum/#!topic/nosql-databases/OSBlUVp0vbw I am interested to learn what you have found. Thanks Jenny On 05/03/2017 10:01 AM, Jos? Mu?oz wrote: > Hi, > > I'm analyzing the performance of a VM with heap of 160g and I see TTSP > of up to 2 seconds. > > I added some flags to get more information: > > -XX:+PrintSafepointStatistics > -XX:+SafepointTimeout > -XX:SafepointTimeoutDelay=1000 > -XX:PrintSafepointStatisticsTimeout=1000 > -XX:PrintSafepointStatisticsCount=1 > > but I couldn't find any flag to print the stack trace of a thread that > is executed when it fails to reach the safepoint after the delay. I > would like to find a way to figure out which code is executed when a > thread reached the TTSP threshold ? > > Thanks, > Jose > > > _______________________________________________ > hotspot-gc-use mailing list > hotspot-gc-use at openjdk.java.net > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomas.schatzl at oracle.com Tue May 16 15:35:45 2017 From: thomas.schatzl at oracle.com (Thomas Schatzl) Date: Tue, 16 May 2017 17:35:45 +0200 Subject: Benchmark scenario with high G1 performance degradation In-Reply-To: <1857671.YkCaYY5MNi@tapsy> References: <1857671.YkCaYY5MNi@tapsy> Message-ID: <1494948945.2440.15.camel@oracle.com> Hi Jens, ? sorry for the late reply... On Thu, 2017-04-20 at 09:49 +0700, Jens Wilke wrote: > Hello, > > I am currently benchmarking different in-process caching libraries. > In some benchmark scenarios I found very "odd" results when using > the G1 garbage collector. In a particular benchmark scenario the > performance (as measured in ops/s of the particular benchmark)? > drops to about 30% when compared to CMS, typically I expect (and > observe) only a performance degradation to around 80% of??the > performance with CMS. > > In the blog post is a little bit more background of what I am doing: > https://cruftex.net/2017/03/28/The-6-Memory-Metrics-You-Should-Track- > in-Your-Java-Benchmarks.html Some quirks about the benchmark setup: - after every iteration it does not try to get the VM back to some kind of initial state, but just continues running. This means that depending on when GCs occur, the performance results can vary a lot (and they do). E.g. I am noticing a consistent score error (deviation?) of +-50% for parallel gc... I would recommend trying to make longer iterations to improve repeatability. Particularly Parallel GC with its regular Full GCs suffer a lot from it: depending on if or when Full GCs occur the results vary from better than CMS to much worse. Depending how many of these full gcs occur (i.e. how much the default policies of the different collectors expand the heap and ergonomics starts them - and the policies are quite different) the impact of the various policies is larger than the impact of the GC algorithm. Eg. Parallel GC young GCs seem to be faster than e.g. CMS'es in this case; the difference is that in this short running benchmark CMS does not run into a concurrent mode failure (full gc). It may also be that CMS might never run into one as there is no fragmentation. CMS happens to have more stable behavior because its (still 4-500ms) pauses in this case are young-only pauses.? - It is also unclear to me what you are measuring: probably out-of-box experience, which is a reasonable setup (I guess), but I wonder why in the options -XX:BiasedLockingStartupDelay=0 is used? - The benchmark as given also does not do any warmup iterations. Looking at the raw data for the various collectors I would say that at least the first three iterations give values that are heavily influenced by warmup (in this case benefitting CMS). (I am aware that in that blog entry you used two warmup iterations) > For the particular scenario I have made the following observations: > > - Java 8 VM 25.131-b11: Average benchmark performance throughput is > 2098524 ops/s > - Java 9 WM 9-EA+165: Average benchmark performance throughput is? >??566699 ops/s > - Java 8 VM 25.131-b11: Throughput is quite steady > - Java 9 WM 9-EA+165: Throughput has big variations and the tendency to decrease > - Java 8 VM 25.131-b11: VmRSS, as reported by Linux, grows to 4.2 GB? >?- Java 9 WM 9-EA+165: VmRSS, as reported by Linux, grows to 6.3 GB Which collectors do you compare in these results? > - Java 9 WM 9-EA+165: Profiling shows that 44.19% of CPU cycles is > spent in OtherRegionsTable::add_reference (for Java 8 G1 it is > similar) > > And less quantified: > > - With Java 8 G1 it seems more worse > - Scenarios with smaller heap/cache sizes don't show the high? > performance drop when comparing CMS and G1 In this case the caches seem to be the cause for the many old->young references, as they are humongous objects (larger than half a region). I did some tests with a largish young gen, trying to keep stuff in young gen as long as possible, and at least G1 improved... > - Java 9 WM 9-EA+165 with the options -XX:+UseParallelGC -XX: > +UseParallelOldGC, seems to have 50% performance of Java 8 and higher > memory consumption > ? (What are the correct parameters to restore the old default > behavior?) Can not reproduce a regression for parallel gc on a similar machine (spec below). I did not try on larger machines since my colleague already indiciated that g1 does not show this extreme behavior there. > - The overall GC activity and time spend for GC is quite low That may be the case (I am not completely sure about that; I did not measure), but GC times highly influence the score. The scores of e.g. parallel gc drop significantly every time there is a full gc during an iteration. Other than that parallel gc young gc times are better than for CMS which is imho substantiated by parallel gc achieving higher scores if there is no full gc during an iteration. > To reproduce the measurements: > > Hardware: > Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz, 1x4GB, 1x16GB @ DDR3 > 1600MHz, 2 cores with hyperthreading enabled > OS/Kernel: > Linux version 4.4.0-72-generic (buildd at lcy01-17) (gcc version 5.4.0 > 20160609? > (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #93-Ubuntu SMP Fri Mar 31 14:07:41 > UTC 2017 Running on an?i5-3320M at 2.60Ghz?with 8GB of RAM - I had to drop -Xmx to 6GB. Using java8-b92 vs. java9-b169. GC-wise (particularly CMS and Parallel) there should not be much difference between these and yours. Just checking, did you make sure that external factors that typically make a mess of benchmarking on a laptop (but not limited to it) like turbo/THP/etc were disabled? Note that the four threads on these laptop machines are not full cores, because in the test on the website you seem to specifically disable hyperthreading. > git clone https://github.com/headissue/cache2k-benchmark.git > cd cache2k-benchmark > git checkout d68d7608f18ed6c5a10671f6dd3c48f76afdf0a8 > mvn -DskipTests package > java -jar jmh-suite/target/benchmarks.jar \\.RandomSequenceBenchmark > -jvmArgs? > -server\ -Xmx20G\ -XX:BiasedLockingStartupDelay=0\ -verbose:gc\ -XX: > +PrintGCDetails -f 1 -wi 0 -i 10 -r 20s -t 4 -prof? > org.cache2k.benchmark.jmh.LinuxVmProfiler??-p? > cacheFactory=org.cache2k.benchmark.Cache2kFactory -rf json -rff > result.json > > When not on Linux, strip "-prof > org.cache2k.benchmark.jmh.LinuxVmProfiler". > > I have the feeling this could be worth a closer look. > If there are any questions or things I can help with, let me know. > > I would be interested to know whether there is something that I can > change in the code to avoid triggering this behavior. The benchmark creates a lot of references from old to young gen that impact pause times and so scores a lot. So anything that reduces these in the application will help (for all collectors, specifically G1 benefits the most though). Some tests indicated that what can help (and is specific to G1) is early reclaim of large object arrays (of some hash table I guess)[3] if it were available - by doing so it seems G1 could remove a lot of stale references (in the old array, apparently after expanding) from the remembered sets, that need to be updated all the time. So maybe if the code tried to do less expansions for that hash table, it would improve the situation? Or it may be worth a try to clear out (null out) references in the old array if possible and not done yet. Other than that there is lots of time spent in managing the remembered sets (e.g. the time spent in add_reference() you noticed) concurrently. There are multiple options to improve this, some you could do right away, e.g. increasing heap region size (even to 32M), potentially moving concurrent remembered set work into the pause (see the tuning guide [0]) or others that need work, e.g. the rebuild remembered sets on the fly idea [1], the throughput remembered set proposed earlier [2]. Thanks, ? Thomas [0]?https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col lector-tuning.htm#JSGCT-GUID-70E3F150-B68E-4787-BBF1-F91315AC9AB9 [1]?https://bugs.openjdk.java.net/browse/JDK-8180415 [2]?http://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2016-November /019215.html [3]?https://bugs.openjdk.java.net/browse/JDK-8048180