Crash with Solr on CentOS 7.4

Thu Nov 30 10:58:14 UTC 2017

Aleksey Shipilev kirjoitti 28.11.2017 klo 18.04:
> On 11/28/2017 04:47 PM, Ere Maijala wrote:
>> Thanks for the information. I'll update and add ShenandoahVerify to the config. Meanwhiel here is
>> the hs_err log: <https://gist.github.com/EreMaijala/28c60588e5ffd2998eb8513e9c36ad0f>
> 
>  From a brief look, it seems that native GC thread SEGVed on accessing some native data structure. I
> suspect that is a taskqueue or GCThread itself. fastdebug build would give a telling stack-trace and
> maybe even a good assert message:
>    https://builds.shipilev.net/openjdk-shenandoah-jdk8-fastdebug/

I haven't had this happen again with the newer OpenJDK builds. Then 
again, it only happened once on one of three nodes to begin with, so 
it's probably quite rate.

I'm seeing results that are contradictory to my expectations regarding 
Shenandoah, though. I've been running G1GC where full GC has been quite 
rare, but when it happens, it's long enough to bring the SolrCloud 
cluster down. With Shenandoah, when doing some indexing work with Solr, 
I'm getting frequent Pause Fulls like this:

--------------
2017-11-29T18:47:08.141+0200: 31981.188: Total time for which 
application threads were stopped: 0.0002862 seconds, Stopping threads 
took: 0.0001059 seconds
Concurrent marking triggered. Free: 3485M, Free Threshold: 3491M; 
Allocated: 19343M, Alloc Threshold: 0M
2017-11-29T18:47:11.884+0200: 31984.931: [Pause Init Mark, 60.475 ms]
2017-11-29T18:47:11.945+0200: 31984.991: Total time for which 
application threads were stopped: 0.0607656 seconds, Stopping threads 
took: 0.0000775 seconds
2017-11-29T18:47:11.945+0200: 31984.991: [Concurrent 
marking2017-11-29T18:47:14.173+0200: 31987.220: Total time for which 
application threads were stopped: 0.0222083 seconds, Stopping threads 
took: 0.0220003 sec
onds
Cancelling concurrent GC: Allocation Failure
  26G->29G(31G), 3202.487 ms]
Adjusting free threshold to: 21% (6666M)
2017-11-29T18:47:15.147+0200: 31988.194: [Pause Final MarkAdaptive CSet 
selection: free target = 8253M, actual free = 13878M; min cset = 0M, max 
cset = 10408M
Total Garbage: 22774M
Immediate Garbage: 13878M, 1821 regions (61% of total)
Garbage to be collected: 8336M (36% of total), 1135 regions
Live objects to be evacuated: 317M
Live/garbage ratio in collected regions: 3%
Free: 0M, 0 regions (0% of total)
Cancelling concurrent GC: Out of memory for evacuation
, 2107.541 ms]
2017-11-29T18:47:17.255+0200: 31990.302: Total time for which 
application threads were stopped: 2.1078017 seconds, Stopping threads 
took: 0.0000444 seconds
Concurrent marking triggered. Free: 6M, Free Threshold: 6666M; 
Allocated: 11M, Alloc Threshold: 0M
Adjusting free threshold to: 41% (13015M)
2017-11-29T18:47:19.158+0200: 31992.205: [Pause Full 29G->3579M(31G), 
10726.904 ms]
2017-11-29T18:47:29.885+0200: 32002.932: Total time for which 
application threads were stopped: 12.6193783 seconds, Stopping threads 
took: 0.0000324 seconds
Periodic GC triggered. Time since last GC: 305058 ms, Guaranteed 
Interval: 300000 ms
2017-11-29T18:47:29.895+0200: 32002.942: [Pause Init Mark, 59.263 ms]
--------------

The heap is quite large for this workload, but lowering it caused the 
Pause Fulls to become even more frequent. The example above is from the 
old JVM from CentOS 7.4 but it happens also with a nightly build from 
Nov 29. I'd be happy to test with any additional GC options or logging 
options that work with Java 8 to see if they affect the results but 
decided not turn the tuning knobs blindly.

--Ere