Crash with Solr on CentOS 7.4

Thu Nov 30 11:08:44 UTC 2017

On 11/30/2017 11:58 AM, Ere Maijala wrote:
> Aleksey Shipilev kirjoitti 28.11.2017 klo 18.04:
>> On 11/28/2017 04:47 PM, Ere Maijala wrote:
> With Shenandoah, when doing some indexing work with Solr, I'm getting frequent Pause Fulls like this:
> 
> --------------
> 2017-11-29T18:47:08.141+0200: 31981.188: Total time for which application threads were stopped:
> 0.0002862 seconds, Stopping threads took: 0.0001059 seconds
> Concurrent marking triggered. Free: 3485M, Free Threshold: 3491M; Allocated: 19343M, Alloc
> Threshold: 0M
> 2017-11-29T18:47:11.884+0200: 31984.931: [Pause Init Mark, 60.475 ms]
> 2017-11-29T18:47:11.945+0200: 31984.991: Total time for which application threads were stopped:
> 0.0607656 seconds, Stopping threads took: 0.0000775 seconds
> 2017-11-29T18:47:11.945+0200: 31984.991: [Concurrent marking2017-11-29T18:47:14.173+0200: 31987.220:
> Total time for which application threads were stopped: 0.0222083 seconds, Stopping threads took:
> 0.0220003 sec
> onds
> Cancelling concurrent GC: Allocation Failure
>  26G->29G(31G), 3202.487 ms]
> Adjusting free threshold to: 21% (6666M)
> 2017-11-29T18:47:15.147+0200: 31988.194: [Pause Final MarkAdaptive CSet selection: free target =
> 8253M, actual free = 13878M; min cset = 0M, max cset = 10408M
> Total Garbage: 22774M
> Immediate Garbage: 13878M, 1821 regions (61% of total)
> Garbage to be collected: 8336M (36% of total), 1135 regions
> Live objects to be evacuated: 317M
> Live/garbage ratio in collected regions: 3%
> Free: 0M, 0 regions (0% of total)
> Cancelling concurrent GC: Out of memory for evacuation
> , 2107.541 ms]
> 2017-11-29T18:47:17.255+0200: 31990.302: Total time for which application threads were stopped:
> 2.1078017 seconds, Stopping threads took: 0.0000444 seconds
> Concurrent marking triggered. Free: 6M, Free Threshold: 6666M; Allocated: 11M, Alloc Threshold: 0M
> Adjusting free threshold to: 41% (13015M)
> 2017-11-29T18:47:19.158+0200: 31992.205: [Pause Full 29G->3579M(31G), 10726.904 ms]
> 2017-11-29T18:47:29.885+0200: 32002.932: Total time for which application threads were stopped:
> 12.6193783 seconds, Stopping threads took: 0.0000324 seconds
> Periodic GC triggered. Time since last GC: 305058 ms, Guaranteed Interval: 300000 ms
> 2017-11-29T18:47:29.895+0200: 32002.942: [Pause Init Mark, 59.263 ms]
> --------------
> 
> 
> The heap is quite large for this workload, but lowering it caused the Pause Fulls to become even
> more frequent. 

This seems to be caused by the too overloaded heap: concurrent mark started at 26G, and ran into
problems at 29G, Final Mark is then degraded to STW, and cleaned up some space. But then we ran into
OOME during evacuation. The amount of live data is still low, 3500M (after Full GC), so this must be
heuristics misbehaving. You can give it more room with tuning down -XX:ShenandoahMinFreeThreshold:
it defaults to 3% of the heap, and you probably need 10-20% to absorb the allocations while
concurrent cycle is running.

We have plans to mitigate these problems, but right now you might need some tuning.

-Aleksey