G1GC/ JIT compilation bug hunt.

Mon Aug 19 14:45:53 PDT 2013

On Aug 19, 2013, at 1:29 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:

> Thanks Vladimir, the tiered compilation hint was really very useful. I
> managed to reproduce this error on a 1.8 fastbuild and I can dump
> pretty much anything I want. But I still cannot figure out what's
> wrong -- it's beyond me. Here are the things I've tried per your
> suggestions
> 
> - turning off -XX:-OptimizePtrCompare or -XX:-EliminateAutoBox
> doesn't help (error still present),
> - I do have a -Xbatch with a single compiler thread -- it was more
> difficult to hit an error seed (the tests are randomized) but it's
> still possible.
> 
> I know it would be best to provide a stand-along reproduce package but
> it's not trivial given how complex Lucene tests and testing framework
> is. Can I provide you with anything that would be helpful except the
> above? :) Specifically I can:
> 
> 1) dump an opto assembly from a failed and non-failed run
> 2) provide an opto assembly from a g1gc run vs. any other gc (which
> doesn't seem to exhibit the problem),
> 3) provide a -XX:+PrintCompilation -XX:+PrintCompilation2 or a verbose
> hotspot.log.
> 
> Let me know if any of these (or anything else) would be useful. If
> not, I'll try to extract a stand-alone package that would reproduce
> the issue, although this is a real killer to pull off.

How many total compiles are there?  Maybe you can try to limit the methods being compiled with:

-XX:CompileCommand=compileonly,foo::bar

or

-XX:CompileCommand=exclude,foo::bar

One problem I see though is that using compileonly also prevents inlining which might hide the issue.

-- Chris

> 
> Dawid
> 
> 
> On Fri, Aug 16, 2013 at 9:50 AM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> On 8/15/13 11:45 PM, Dawid Weiss wrote:
>>>> 
>>>> It is with high probability Compiler problem.
>>> 
>>> 
>>> I believe so. I've re-run the tests with 1.8b102 and the problem is
>>> still there, although it's more difficult to show -- I ran a 100 full
>>> builds yesterday, five of them tripped on assertions that should be
>>> unreachable.
>> 
>> 
>> We switched on -XX:+TieredCompilation by default in b102. Switch it off to
>> use only C2 compiler which has the problem.
>> 
>> 
>>> 
>>>> G1 has larger write-barrier code then other GCs. It can affect inlining
>>>> decisions. You can try to change -XX:InlineSmallCode=1000 value. It
>>>> controls
>>>> inlining of methods which were already compiled.
>>>> 
>>>> You can also try -Xbatch -XX:CICompilerCount=1 to get serial
>>>> compilations.
>>> 
>>> 
>>> Thanks for these tips, Vladimir -- very helpful. I hope you don't mind
>>> me asking one more question - we had a discussion with another Lucene
>>> developer yesterday -- is -Xbatch deterministic in the sense that if
>>> you run a single thread/ deterministic piece of code it will always
>>> trigger compiles at the same time? What happens if there are two
>>> uncoordinated threads that hit a set of the same methods (and thus
>>> when the compiler kicks in the statistics will probably be different
>>> for each independent run)?
>> 
>> 
>> -Xbatch (equivalent to -XX:-BackgroubdCompilation) will block only thread
>> which first put compilation task on compile queue. Other threads check that
>> the task in the queue and resume execution without waiting.
>> You still can't get full determinism with several java threads, as you
>> notice. But it can reduce some variations in inlining decision because
>> compilation will be executed by one Compiler thread (instead of 2 by
>> default). So if compilation tasks are put on queue at the same order in
>> different runs you most likely will get the same code generation. Of cause
>> usually the order is slightly different (especially during startup when
>> there are a lot of compilation requests) so you can still get different
>> results.
>> 
>> 
>>> 
>>> This question originated from a broader discussion where we were
>>> wondering how you, the compiler-guru guys approach the debugging in
>>> case something like this pops up -- a bug that is very hard to
>>> reproduce, that manifests itself rarely and for which pretty much any
>>> change at the Java level changes the compilation and thus generates
>>> completely different code. This seems to be a tough nut to crack.
>> 
>> 
>> We usually try to reproduce the problem with debug version of VM which have
>> a lot asserts and we may hit one which helps identify the problem. You are
>> lucky if you can reproduce a problem in debug VM in debugger.
>> We try to get assembler output of compiled method during run when it
>> crushes. hs_err file has address and offset in compiled code and small code
>> snippet which helps to find the code. After that we "look hard" on assembler
>> code and try to figure out what is wrong with it and which compiler part can
>> generate such code pattern.
>> There is debug flag -XX:AbortVMOnException==java.lang.NullPointerException
>> which allow to abort VM on exceptions. And with -XX:+ShowMessageBoxOnError
>> flag we allow to attach debugger to VM when it happened.
>> When we get only core file it is tough. We try to use Serviceability Agent
>> to extract information and compiled code from it and other data.
>> 
>> An other suggestion for you. Since you can avoid problem with switched off
>> EA you can try to switch off only
>> 
>> -XX:-OptimizePtrCompare    "Use escape analysis to optimize pointers
>> compare"
>> -XX:-EliminateAutoBox      "Control optimizations for autobox elimination"
>> 
>> Vladimir
>> 
>>> 
>>> Dawid
>>> 
>>