SIGSEGV on PhaseIdealLoop::split_up?

Uwe Schindler uschindler at apache.org
Wed Jan 30 14:21:58 UTC 2019


Hi,

> > A reproducer would be very nice. Did you try to reproduce with Replay
> Compilation?
> 
> I haven't tried to reproduce it, but it's popping up quite a bit
> recently, see here for a backlog:
> 
> https://lucene.markmail.org/search/%22jenkins+server%22+PhaseIdealLoop
> ::split_up+list:org.apache.lucene.java-dev+order:date-backward
> 
> For example this one
> https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/3472/
> 
> is:
> 
>   [junit4] # JRE version: OpenJDK Runtime Environment (11.0+28) (build
> 11+28)
>   [junit4] # Java VM: OpenJDK 64-Bit Server VM (11+28, mixed mode,
> tiered, g1 gc, linux-amd64)
> 
> Some of those builds are still on the server (and contain hs logs).
> What worries me is that this only happens on Uwe's machine -- may be
> related to particular hardware config it happens on.

I don't think there is a hardware fault. The machine is quite stable and its also running virtual machines (Oracle Virtualbox) with Windows, MacOSX, and Solaris to test Lucene also on those platforms. If there would be a hardware issue, this would hardly work correct. But as the issues we see don't happen on the inner virtual machines that remove some advanced CPU features, maybe that's special here. So it could be caused by some special CPU feature of this machine that is not be used on the VBOX machines also testing.

Another important thing here is: This machine is the only Lucene Test machine, that checks recent JDK versions. The other Jenkins machines only run with JDK 8 (the minimum requirement of Lucene/Solr). From the statistics: The recent failures don't happen with Java 8 and Java 9, but started with Java 10 or later! (because we only see the bug on runs using those versions).
 
> A repro isn't going to be easy (are they ever? ;) as those tests run
> pretty much at random within a single forked JVM and I bet it's just
> some unusual pattern that tiggers the problem. Looking at where the
> problem occurs it seems there is a common core related to compiling
> this method:
> 
> Current CompileTask:
> C2:1534619 50541  s!   4
> org.apache.lucene.index.ConcurrentMergeScheduler::merge (280 bytes)

It's also easy to reproduce, because the exact JDK version and test params are printed at the beginning of the build log, for the example mentioned before:

https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/3472/consoleText

-print-java-info:
[java-info] java version "11"
[java-info] OpenJDK Runtime Environment (11+28, Oracle Corporation)
[java-info] OpenJDK 64-Bit Server VM (11+28, Oracle Corporation)
[java-info] Test args: [-XX:-UseCompressedOops -XX:+UseG1GC]

I thinks that's all needed to reproduce. Test args is some additional JVM options we use when running test suite (here we disbale compressed oops and we use G1GC). To run testsuite you can pass this to ant's command line.

> The path leading to it may differ (when you diff those different
> hs_err logs against each other), but it seems to be caused by merge
> compilation in all cases I looked at.
> 
> I can monitor this and attach new logs to the Jira issue
> (LUCENE-8668). Uwe will be at Fosdem so I'm sure he'll be ready to
> figure it out together with you, should you be there.
> 
> Dawid
> 
> 
> On Wed, Jan 30, 2019 at 11:22 AM Nils Eliasson <nils.eliasson at oracle.com>
> wrote:
> >
> > Sorry, too fast. You had already tested on various builds.
> >
> > Regards,
> >
> > Nils
> >
> > On 2019-01-30 10:57, Nils Eliasson wrote:
> > > Hi Dawid,
> > >
> > > The hs_err-file is from a JDK 10 build. Would you mind testing with
> > > JDK 11 or JDK 12-ea?
> > >
> > > What build of Lucene was this run against? Can point me to the
> > > relevant jar? I will try reproducing with 7.6.0.
> > >
> > > Regards,
> > >
> > > Nils
> > >
> > > On 2019-01-30 10:27, Dawid Weiss wrote:
> > >> Hello,
> > >>
> > >> There's quite a few of those JVM errors that popped up recently on one
> > >> of Lucene's CI machines:
> > >>
> > >> https://issues.apache.org/jira/browse/LUCENE-8668
> > >>
> > >> Happens on various JVMs (see the above issue). Would it be something
> > >> familiar to any of you? A known issue or should we try to keep digging
> > >> (for a repro, for example)?
> > >>
> > >> Dawid



More information about the hotspot-dev mailing list