Degenerated GC pauses for 5-10 seconds

Fri Jan 11 14:03:08 UTC 2019

*Setup: *
- openjdk-shenandoah-jdk8-latest-linux-x86_64-release from
builds.shipilev.net, downloaded on 2018-12-31.
- -Xmx55g -Xms55g -XX:+UseShenandoahGC -XX:+AlwaysPreTouch
  // Also, not sure if relevant:
  -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

*Problem:*
Shenandoah does an excellent job dealing with ~4GB/sec allocation rate on
a 55GB heap staying under 100ms max pause time and ~500ms max cycle time.
However,  on particularly nasty spikes, it drops into the degenerated mode
which causes a 5-10 second STW pause. So far I observed that three times,
each with a huge pause. There doesn't seem to be other "normal"
degenerated cycles, so I assume every degenerate cycle causes a pause like
that.

This is how it looks in the GC log:

Trigger: Free (5631M) is below minimum threshold (5632M)
Free: 5631M (460 regions), Max regular: 16384K, Max humongous: 425984K,
External frag: 93%, Internal frag: 23%
Evacuation Reserve: 2816M (176 regions), Max regular: 16384K
2019-01-08T16:47:04.358+0000: 340115.832: [Concurrent resetUsing 25 of 25
workers for concurrent reset
Failed to allocate 2048K
Cancelling GC: Allocation Failure
 46G->52G(55G), 11069.097 ms]
2019-01-08T16:47:15.433+0000: 340126.907: [Pause Init Mark (process
weakrefs)Using 25 of 25 workers for init marking
Pacer for Mark. Expected Live: 1917M, Free: 0M, Non-Taxable: 0M, Alloc Tax
Rate: infx
, 4.328 ms]
2019-01-08T16:47:15.437+0000: 340126.911: [Concurrent marking (process
weakrefs)Using 25 of 25 workers for concurrent marking
 52G->52G(55G), 6.329 ms]
Free: 0M (0 regions), Max regular: 0K, Max humongous: 0K, External frag:
0%, Internal frag: 0%
Evacuation Reserve: 2816M (176 regions), Max regular: 16384K
Pacer for Idle. Initial: 1126M, Alloc Tax Rate: 1.0x
Trigger: Handle Allocation Failure
Free: 0M (0 regions), Max regular: 0K, Max humongous: 0K, External frag:
0%, Internal frag: 0%
Evacuation Reserve: 2816M (176 regions), Max regular: 16384K
2019-01-08T16:47:15.449+0000: 340126.923: [Pause Degenerated GC (Mark)Using
25 of 25 workers for stw degenerated gc
Adaptive CSet Selection. Target Free: 5632M, Actual Free: 2864M, Max CSet:
2346M, Min Garbage: 2768M
Collectable Garbage: 50937M (99% of total), 1644M CSet, 3297 CSet regions
Immediate Garbage: 48M (0% of total), 3 regions
Pacer for Evacuation. Used CSet: 52582M, Free: 48M, Non-Taxable: 4M, Alloc
Tax Rate: 2644.4x
Pacer for Update Refs. Used: 54960M, Free: 48M, Non-Taxable: 4M, Alloc Tax
Rate: 1382.0x
Used: before: 53334M, after: 2377M
Internal frag: before: 0.3%, after: 0.2%
External frag: before: 75.4%, after: 14.8%
 52G->2377M(55G), 310.040 ms]
Free: 51124M (3197 regions), Max regular: 16384K, Max humongous: 47071232K,
External frag: 11%, Internal frag: 0%
Evacuation Reserve: 2816M (176 regions), Max regular: 16384K
Pacer for Idle. Initial: 1126M, Alloc Tax Rate: 1.0x

The line in question is *Cancelling GC: Allocation Failure*
* 46G->52G(55G), 11069.097 ms]*. If I understand degenerated GC mode
correctly, it should STW and pick up from where the concurrent cycle left
off. That, presumably, means that the STW pause shouldn't take longer than
the whole cycle time (in my case, ~500ms on average). Where does this huge
pause come from? Is it a bug or an expected behavior I misunderstood? In
either case, the GC log seems a bit light on the details.

*Links:*
- Excerpt from log 1:
https://gist.github.com/alexander-yakushev/8bffd97bac440328f0e5e5a2b33b6793
- Excerpt from log 2:
https://gist.github.com/alexander-yakushev/549a1a600f53ed61249b8ae8184a3971
- 20Mb GC file 1 (for bigger context): https://ufile.io/ka6lr. Search for
"Allocation Failure" to find the spot.

Would be happy to provide more data if necessary. Thanks for your amazing
work!

Best regards,
Alex Yakushev