Feedback from experiments on production application
Petr Janouch
janoupe at gmail.com
Tue May 8 22:44:06 UTC 2018
Ok here it goes:
First shortly about the application:
It is a service that is part of an e-commerce platform and is tuned for high load (of course not Facebook scale ;)) and low latency. It’s most latency sensitive REST endpoint responds in single digit milliseconds plus GC pauses. We managed to tune G1 for 20ms pauses. It is what I would consider a typical backend service (might be a little big to call it a microservice) that provides REST API and does some business logic based on data in its database. So there is relatively few application-scoped objects (services, configuration, … ) and a lot of request-scoped objects (HTTP requests/responses with JSON bodies, Hibernate entities, intermediate business objects, …)
Here is the list of libraries it uses extracted from Gradle build file (just to show that it is not a monstrosity with many dependencies):
compile 'org.springframework.boot:spring-boot-starter-jetty'
compile 'org.springframework.boot:spring-boot-starter-logging'
compile 'com.paypal.springboot:resteasy-spring-boot-starter'
compile 'org.jboss.resteasy:resteasy-json-p-provider'
compile 'javax.inject:javax.inject'
compile 'org.springframework.boot:spring-boot-starter-data-jpa'
compile 'mysql:mysql-connector-java'
compile 'org.springframework.boot:spring-boot-starter-actuator'
compile 'org.springframework:spring-context-support'
compile 'com.fasterxml.jackson.datatype:jackson-datatype-jsr310'
compile 'net.ttddyy:datasource-proxy'
compile 'org.hdrhistogram:HdrHistogram'
compile 'org.infinispan:infinispan-core'
compile 'com.zaxxer:HikariCP'
compile 'org.glassfish:javax.json'
compile 'org.infinispan:infinispan-hibernate-cache’
We have some performance/stress tests that simulate the typical workload and measure latency percentiles. The observations bellow were made when running those tests with Shenandoah enabled.
The tests were performed on AWS c4.4xlarge machine with Fedora and OpenJDK 64-Bit Server VM (build 25.162-b12, mixed mode)
Here is the biggest issue I have experienced:
Even though most GC pauses were reasonably small, there was an occasional much larger pause usually around 20ms. Here is an extract from the logs that shows a GC cycle with such a big pause:
Concurrent marking triggered. Free: 1221M, Free Threshold: 1228M; Allocated: 1221M, Alloc Threshold: 0M
Application time: 8.0970046 seconds
[Pause Init Mark, 2.985 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
1037.151: ShenandoahInitMark [ 1075 1 0 ] [ 0 0 0 1 3 ] 0
Total time for which application threads were stopped: 0.0054568 seconds, Stopping threads took: 0.0005413 seconds
[Concurrent marking 18G->19G(20G), 120.974 ms]
Application time: 0.1211539 seconds
[Pause Final MarkTotal Garbage: 19176M
Immediate Garbage: 17379M, 2173 regions (90% of total)
Garbage to be collected: 1758M (9% of total), 221 regions
Live objects to be evacuated: 9M
Live/garbage ratio in collected regions: 0%
19G->2366M(20G), 16.095 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
1037.278: ShenandoahFinalMarkStartEvac [ 1075 2 0 ] [ 0 0 0 1 16 ] 0
Total time for which application threads were stopped: 0.0185424 seconds, Stopping threads took: 0.0005058 seconds
[Concurrent evacuation 2370M->3559M(20G), 71.193 ms]
Application time: 0.0713514 seconds
[Pause Init Update Refs, 0.341 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
1037.368: ShenandoahInitUpdateRefs [ 1075 0 1 ] [ 0 0 0 1 0 ] 0
Total time for which application threads were stopped: 0.0029125 seconds, Stopping threads took: 0.0005305 seconds
[Concurrent update references 3562M->4722M(20G), 442.210 ms]
Application time: 0.4423174 seconds
[Pause Final Update Refs 4722M->2954M(20G), 4.855 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
1037.813: ShenandoahFinalUpdateRefs [ 1075 0 2 ] [ 0 0 0 1 4 ] 0
Total time for which application threads were stopped: 0.0070979 seconds, Stopping threads took: 0.0004547 seconds
[Concurrent reset bitmaps 2954M->2954M(20G), 1.096 ms]
Capacity: 20480M, Peak Occupancy: 19746M, Lowest Free: 733M, Free Threshold: 614M
Concurrent marking triggered. Free: 1211M, Free Threshold: 1228M; Allocated: 1211M, Alloc Threshold: 0M
Application time: 7.8430755 seconds
I nearly gave up on Shenandoah, because 20ms GC pauses are G1 territory for this application and workload. Luckily I found ShenandoahUnloadClassesFrequency = 5 in the VM flags, turned on class unloading logging and found out that the large GC pause is really associated with Shenandoah trying to GC and unload classes. Unloading classes is pointless for this application, so I just turned it off. I did not figure out how to turn it off completely, but after setting the unloading frequency to something infinity-like, the no GC pause exceeded 10ms.
I would suggest mentioning somewhere in the user guide the fact, that the class unloading feature can significantly prolong GC pauses. Also the GC logging could mention that the class unloading is going on (yes, you can see it with class unloading logging on, but most people usually don’t have it on, when trying to fix GC issues).
Here goes my second big issue:
The concurrent part is sometimes too short (who would think it would a performance problem ;)) which makes the GC pauses too tightly spaced. Here is again an extract from logs to illustrate this:
Concurrent marking triggered. Free: 4081M, Free Threshold: 4096M; Allocated: 4081M, Alloc Threshold: 0M
Application time: 7.1562731 seconds
[Pause Init Mark, 2.733 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
369.133: ShenandoahInitMark [ 1076 0 1 ] [ 0 0 0 1 2 ] 0
Total time for which application threads were stopped: 0.0052538 seconds, Stopping threads took: 0.0004846 seconds
[Concurrent marking 16G->16G(20G), 30.012 ms]
[Concurrent precleaning 16G->16G(20G), 1.419 ms]
Application time: 0.0328136 seconds
[Pause Final MarkTotal Garbage: 16310M
Immediate Garbage: 14771M, 1847 regions (90% of total)
Garbage to be collected: 1477M (9% of total), 186 regions
Live objects to be evacuated: 10M
Live/garbage ratio in collected regions: 0%
16G->1715M(20G), 4.473 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
369.172: ShenandoahFinalMarkStartEvac [ 1076 2 0 ] [ 0 0 0 1 4 ] 0
Total time for which application threads were stopped: 0.0072521 seconds, Stopping threads took: 0.0006079 seconds
[Concurrent evacuation 1715M->1871M(20G), 5.272 ms]
Application time: 0.0061683 seconds
[Pause Init Update Refs, 0.214 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
369.185: ShenandoahInitUpdateRefs [ 1076 0 1 ] [ 0 0 0 2 0 ] 0
Total time for which application threads were stopped: 0.0029071 seconds, Stopping threads took: 0.0004698 seconds
[Concurrent update references 1875M->2125M(20G), 72.610 ms]
Application time: 0.0727956 seconds
[Pause Final Update Refs 2126M->638M(20G), 1.635 ms]
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
369.261: ShenandoahFinalUpdateRefs [ 1076 3 0 ] [ 0 0 0 2 1 ] 0
Total time for which application threads were stopped: 0.0043687 seconds, Stopping threads took: 0.0005341 seconds
[Concurrent reset bitmaps 638M->639M(20G), 0.526 ms]
Capacity: 20480M, Peak Occupancy: 16486M, Lowest Free: 3993M, Free Threshold: 614M
In the sample above, ShenandoahFinalMarkStartEvac and ShenandoahInitUpdateRefs are extremely close together and some unlucky requests can experience both pauses during their processing. Moreover, for instance if a request took 50ms to process, which I would still consider low latency, such request could experience 3 pauses during its processing. This issue went mostly away by setting ConcGCThreads = 1, but even with only one thread doing the concurrent work, the GC pauses are sometimes too close together. GC cycles are many seconds apart even with the highest load, so it seems like a waste if a requests can experience multiple GC pauses, which for observers creates the same effect as one larger pause.
Have you considered a possibility to provide a configuration option that would be something like “target min STW distance”?
With ShenandoahUnloadClassesFrequency = <something really large> and ConcGCThreads = 1 the performance was really good. I did not notice any significant drop in throughput, but it had much better request tail latency compared to G1. As I wrote above a typical GC pause for G1 was 20ms. With Shenandoah all pauses were < 10ms with some very close to the 10ms mark.
Here are statistics, in case you find them helpful:
Cancelling concurrent GC: Stopping VM
Heap
Shenandoah Heap
20971520K total, 20406272K committed, 11820035K used
2560 x 8192K regions
Status: idle cancelled
Reserved region:
- [0x00000002c0000000, 0x00000007c0000000)
GC STATISTICS:
"(G)" (gross) pauses include VM time: time to notify and block threads, do the pre-
and post-safepoint housekeeping. Use -XX:+PrintSafepointStatistics to dissect.
"(N)" (net) pauses are the times spent in the actual GC code.
"a" is average time for each phase, look at levels to see if average makes sense.
"lvls" are quantiles: 0% (minimum), 25%, 50% (median), 75%, 100% (maximum).
Total Pauses (G) = 3.09 s (a = 8148 us) (n = 379) (lvls, us = 2383, 4570, 5566, 7559, 396598)
Total Pauses (N) = 3.42 s (a = 8977 us) (n = 381) (lvls, us = 188, 1465, 2930, 4121, 1492103)
Pause Init Mark (G) = 1.02 s (a = 7423 us) (n = 137) (lvls, us = 5078, 5430, 5605, 5781, 135628)
Pause Init Mark (N) = 0.57 s (a = 4148 us) (n = 137) (lvls, us = 2656, 2949, 3086, 3145, 122749)
Accumulate Stats = 0.03 s (a = 248 us) (n = 137) (lvls, us = 213, 236, 240, 250, 345)
Make Parsable = 0.04 s (a = 311 us) (n = 137) (lvls, us = 115, 303, 309, 328, 426)
Clear Liveness = 0.02 s (a = 116 us) (n = 137) (lvls, us = 67, 113, 115, 117, 133)
Scan Roots = 0.44 s (a = 3212 us) (n = 137) (lvls, us = 1758, 2012, 2148, 2207, 121455)
S: Thread Roots = 0.12 s (a = 862 us) (n = 137) (lvls, us = 297, 350, 367, 391, 41679)
S: String Table Roots = 0.09 s (a = 661 us) (n = 137) (lvls, us = 395, 402, 410, 428, 29442)
S: Universe Roots = 0.00 s (a = 2 us) (n = 137) (lvls, us = 1, 2, 2, 2, 22)
S: JNI Roots = 0.00 s (a = 16 us) (n = 137) (lvls, us = 10, 11, 14, 16, 172)
S: JNI Weak Roots = 0.01 s (a = 60 us) (n = 137) (lvls, us = 0, 35, 54, 60, 1077)
S: Synchronizer Roots = 0.00 s (a = 34 us) (n = 137) (lvls, us = 25, 29, 31, 33, 74)
S: Flat Profiler Roots = 0.01 s (a = 60 us) (n = 137) (lvls, us = 40, 50, 57, 65, 154)
S: Management Roots = 0.00 s (a = 2 us) (n = 137) (lvls, us = 1, 1, 2, 2, 4)
S: System Dict Roots = 0.01 s (a = 69 us) (n = 137) (lvls, us = 16, 20, 25, 29, 5703)
S: CLDG Roots = 0.07 s (a = 483 us) (n = 137) (lvls, us = 195, 223, 238, 248, 30918)
S: JVMTI Roots = 0.00 s (a = 1 us) (n = 137) (lvls, us = 0, 1, 1, 1, 1)
Resize TLABs = 0.02 s (a = 176 us) (n = 137) (lvls, us = 141, 164, 168, 178, 343)
Pause Final Mark (G) = 1.17 s (a = 8560 us) (n = 137) (lvls, us = 3906, 4961, 7559, 9141, 30611)
Pause Final Mark (N) = 0.56 s (a = 4117 us) (n = 137) (lvls, us = 1133, 1406, 4316, 4980, 21240)
Update Roots = 0.04 s (a = 1945 us) (n = 22) (lvls, us = 1348, 1484, 1973, 2305, 2841)
U: Thread Roots = 0.02 s (a = 903 us) (n = 22) (lvls, us = 312, 326, 984, 1328, 1814)
U: String Table Roots = 0.01 s (a = 264 us) (n = 22) (lvls, us = 236, 252, 260, 266, 299)
U: Universe Roots = 0.00 s (a = 1 us) (n = 22) (lvls, us = 1, 1, 1, 1, 1)
U: JNI Roots = 0.00 s (a = 11 us) (n = 22) (lvls, us = 8, 9, 10, 12, 17)
U: JNI Weak Roots = 0.00 s (a = 21 us) (n = 22) (lvls, us = 10, 15, 21, 24, 36)
U: Synchronizer Roots = 0.00 s (a = 53 us) (n = 22) (lvls, us = 38, 46, 56, 59, 74)
U: Flat Profiler Roots = 0.00 s (a = 59 us) (n = 22) (lvls, us = 47, 51, 52, 65, 81)
U: Management Roots = 0.00 s (a = 2 us) (n = 22) (lvls, us = 1, 2, 2, 2, 2)
U: System Dict Roots = 0.00 s (a = 16 us) (n = 22) (lvls, us = 11, 14, 16, 17, 20)
U: CLDG Roots = 0.00 s (a = 161 us) (n = 22) (lvls, us = 152, 154, 156, 164, 177)
U: JVMTI Roots = 0.00 s (a = 1 us) (n = 22) (lvls, us = 1, 1, 1, 1, 1)
Finish Queues = 0.13 s (a = 981 us) (n = 137) (lvls, us = 457, 643, 746, 832, 13097)
Weak References = 0.03 s (a = 1198 us) (n = 27) (lvls, us = 303, 375, 426, 2129, 3010)
Process = 0.02 s (a = 867 us) (n = 27) (lvls, us = 40, 45, 79, 1797, 2590)
Enqueue = 0.01 s (a = 327 us) (n = 27) (lvls, us = 248, 287, 312, 352, 419)
Prepare Evacuation = 0.07 s (a = 531 us) (n = 137) (lvls, us = 236, 514, 547, 566, 777)
Initial Evacuation = 0.26 s (a = 3468 us) (n = 76) (lvls, us = 2793, 3086, 3145, 3379, 8814)
E: Thread Roots = 0.05 s (a = 646 us) (n = 76) (lvls, us = 295, 354, 400, 523, 2960)
E: Code Cache Roots = 0.13 s (a = 1729 us) (n = 76) (lvls, us = 1660, 1680, 1699, 1719, 1908)
Pause Init Update Refs (G) = 0.19 s (a = 3553 us) (n = 53) (lvls, us = 2383, 2812, 3086, 3457, 12040)
Pause Init Update Refs (N) = 0.01 s (a = 222 us) (n = 53) (lvls, us = 188, 201, 215, 227, 334)
Pause Final Update Refs (G) = 0.71 s (a = 13643 us) (n = 52) (lvls, us = 2695, 4355, 4590, 5039, 396594)
Pause Final Update Refs (N) = 0.54 s (a = 10463 us) (n = 52) (lvls, us = 1465, 1582, 1680, 1719, 393407)
Finish Work = 0.45 s (a = 151299 us) (n = 3) (lvls, us = 5195, 5195, 5195, 57031, 391500)
Update Roots = 0.08 s (a = 1543 us) (n = 52) (lvls, us = 1328, 1426, 1504, 1543, 2426)
UR: Thread Roots = 0.02 s (a = 413 us) (n = 52) (lvls, us = 268, 330, 354, 402, 1360)
UR: String Table Roots = 0.01 s (a = 266 us) (n = 52) (lvls, us = 229, 252, 264, 277, 316)
UR: Universe Roots = 0.00 s (a = 1 us) (n = 52) (lvls, us = 1, 1, 1, 1, 2)
UR: JNI Roots = 0.00 s (a = 12 us) (n = 52) (lvls, us = 8, 9, 11, 13, 41)
UR: JNI Weak Roots = 0.00 s (a = 24 us) (n = 52) (lvls, us = 15, 20, 22, 24, 58)
UR: Synchronizer Roots = 0.00 s (a = 39 us) (n = 52) (lvls, us = 27, 33, 34, 36, 115)
UR: Flat Profiler Roots = 0.00 s (a = 59 us) (n = 52) (lvls, us = 43, 52, 61, 64, 87)
UR: Management Roots = 0.00 s (a = 2 us) (n = 52) (lvls, us = 1, 1, 2, 2, 3)
UR: System Dict Roots = 0.00 s (a = 16 us) (n = 52) (lvls, us = 11, 14, 16, 17, 42)
UR: CLDG Roots = 0.01 s (a = 160 us) (n = 52) (lvls, us = 146, 154, 158, 162, 191)
UR: JVMTI Roots = 0.00 s (a = 1 us) (n = 52) (lvls, us = 1, 1, 1, 1, 1)
Recycle = 0.01 s (a = 144 us) (n = 52) (lvls, us = 98, 111, 117, 152, 258)
Pause Full GC = 1.73 s (a = 865870 us) (n = 2) (lvls, us = 238281, 238281, 238281, 238281, 1492101)
Heap Dumps = 0.00 s (a = 1 us) (n = 4) (lvls, us = 1, 1, 1, 1, 1)
Prepare = 0.00 s (a = 497 us) (n = 2) (lvls, us = 475, 475, 475, 475, 517)
Roots = 0.07 s (a = 17314 us) (n = 4) (lvls, us = 3770, 3770, 4785, 6035, 54657)
F: Thread Roots = 0.03 s (a = 5217 us) (n = 6) (lvls, us = 1562, 1562, 2246, 2324, 21039)
F: Code Cache Roots = 0.01 s (a = 1133 us) (n = 6) (lvls, us = 0, 0, 1660, 1680, 1740)
F: String Table Roots = 0.00 s (a = 252 us) (n = 6) (lvls, us = 0, 0, 299, 363, 467)
F: Universe Roots = 0.00 s (a = 2 us) (n = 6) (lvls, us = 1, 1, 2, 2, 3)
F: JNI Roots = 0.00 s (a = 18 us) (n = 6) (lvls, us = 11, 11, 14, 16, 35)
F: JNI Weak Roots = 0.00 s (a = 61 us) (n = 6) (lvls, us = 0, 0, 23, 99, 135)
F: Synchronizer Roots = 0.00 s (a = 64 us) (n = 6) (lvls, us = 51, 51, 63, 70, 72)
F: Flat Profiler Roots = 0.00 s (a = 82 us) (n = 6) (lvls, us = 57, 57, 78, 78, 118)
F: Management Roots = 0.00 s (a = 2 us) (n = 6) (lvls, us = 1, 1, 2, 2, 4)
F: System Dict Roots = 0.00 s (a = 402 us) (n = 6) (lvls, us = 19, 19, 32, 35, 1508)
F: CLDG Roots = 0.00 s (a = 529 us) (n = 6) (lvls, us = 89, 89, 193, 238, 1998)
F: JVMTI Roots = 0.00 s (a = 1 us) (n = 6) (lvls, us = 1, 1, 1, 1, 1)
Mark = 0.30 s (a = 148858 us) (n = 2) (lvls, us = 68359, 68359, 68359, 68359, 229302)
Finish Queues = 0.17 s (a = 85505 us) (n = 2) (lvls, us = 39648, 39648, 39648, 39648, 131294)
Weak References = 0.01 s (a = 6880 us) (n = 2) (lvls, us = 2598, 2598, 2598, 2598, 11155)
Process = 0.01 s (a = 6429 us) (n = 2) (lvls, us = 2227, 2227, 2227, 2227, 10627)
Enqueue = 0.00 s (a = 445 us) (n = 2) (lvls, us = 367, 367, 367, 367, 521)
System Purge = 0.04 s (a = 21831 us) (n = 2) (lvls, us = 17383, 17383, 17383, 17383, 26143)
Unload Classes = 0.02 s (a = 10718 us) (n = 2) (lvls, us = 7988, 7988, 7988, 7988, 13429)
Parallel Cleanup = 0.02 s (a = 9692 us) (n = 2) (lvls, us = 9473, 9473, 9473, 9473, 9900)
Code Cache = 0.01 s (a = 4005 us) (n = 2) (lvls, us = 3867, 3867, 3867, 3867, 4141)
String/Symbol Tables = 0.00 s (a = 2383 us) (n = 2) (lvls, us = 2305, 2305, 2305, 2305, 2450)
Clean Classes = 0.01 s (a = 3125 us) (n = 2) (lvls, us = 3105, 3105, 3105, 3105, 3128)
CLDG = 0.00 s (a = 1413 us) (n = 2) (lvls, us = 19, 19, 19, 19, 2806)
Calculate Addresses = 0.03 s (a = 15912 us) (n = 2) (lvls, us = 15430, 15430, 15430, 15430, 16284)
Adjust Pointers = 0.07 s (a = 33302 us) (n = 2) (lvls, us = 31641, 31641, 31641, 31641, 34817)
Copy Objects = 1.33 s (a = 666542 us) (n = 2) (lvls, us = 121094, 121094, 121094, 121094, 1210432)
Resize TLABs = 0.00 s (a = 481 us) (n = 2) (lvls, us = 467, 467, 467, 467, 495)
Concurrent Marking = 15.49 s (a = 113039 us) (n = 137) (lvls, us = 13672, 52344, 63867, 89062, 661729)
Concurrent Precleaning = 0.05 s (a = 1892 us) (n = 27) (lvls, us = 1367, 1602, 1699, 2031, 3704)
Concurrent Evacuation = 1.51 s (a = 19892 us) (n = 76) (lvls, us = 2578, 4395, 5898, 10352, 315417)
Concurrent Update Refs = 4.93 s (a = 93068 us) (n = 53) (lvls, us = 8594, 68555, 86523, 94922, 675641)
Concurrent Reset Bitmaps = 2.43 s (a = 18021 us) (n = 135) (lvls, us = 439, 672, 1152, 1758, 505395)
2 allocation failure and 0 user requested GCs
135 successful and 2 degenerated concurrent markings
110 successful and 3 degenerated update references
ALLOCATION TRACING
These are the slow-path allocations, including TLAB/GCLAB refills, and out-of-TLAB allocations.
In-TLAB/GCLAB allocations happen orders of magnitude more frequently, and without delays.
Allocation tracing is disabled, use -XX:+ShenandoahAllocationTrace to enable.
Application time: 0.0627666 seconds
vmop [threads: total initially_running wait_to_block] [time: spin block sync cleanup vmop] page_trap_count
1072.992: Exit [ 22 0 0 ] [ 0 0 0 0 10 ] 0
Polling page always armed
ForceSafepoint 13
Deoptimize 61
CollectForMetadataAllocation 861
ShenandoahFullGC 2
ShenandoahInitMark 137
ShenandoahFinalMarkStartEvac 137
ShenandoahInitUpdateRefs 53
ShenandoahFinalUpdateRefs 51
Exit 1
2 VM operations coalesced during safepoint
Maximum sync time 14 ms
Maximum vm operation time (except for Exit VM operation) 1492 ms
I hope the report did not come out too negative, because I have very positive feeling about Shenandoah and I will definitely use Shenandoah in the future.
If you need more information, don’t hesitate to contact me. I cannot give you the source code of the application, but I can easily rerun the workload with different settings or logging .
Thanks
Petr
> On 8 May 2018, at 22:24, Roman Kennke <rkennke at redhat.com> wrote:
>
> Am 08.05.2018 um 22:01 schrieb Petr Janouch:
>> Hi everyone,
>>
>> I have done some experiments with Shenandoah on our production application (Unfortunately not in production). I did not see any explicit bug, but my experience wasn’t smooth either. Before I clog this mailing list with my report ;), I would like to ask if you are interested into my feedback and if this mailing list is the right place to give it.
>>
>
> Yes, and yes. Of course!
>
> Thanks in advance!
> Roman
>
>
More information about the shenandoah-dev
mailing list