G1GC/ JIT compilation bug hunt.

Mon Aug 19 16:40:23 PDT 2013

On Aug 19, 2013, at 4:12 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:

> Christian,
> 
> 'exclude' or 'compileonly'  will not help because the failure reproduction depends on inlining.

Right.  This brings me back to something we discussed a couple times already:  compileonly should not disable inlining of callees.  Someone should fix this.

-- Chris

> 
> Dawid,
> 
> It is great that you can build fastdebug VM. I assume that if I give you a patch to try you can also build with it.
> 
> First, you can run with next options and then send zipped output (related part before the method compilation and optoassembler output, don't use hsdis for that) and hs_err file when test fails:
> 
> -XX:CICompilerCount=1 -XX:+PrintInlining -XX:+TraceLoopOpts -XX:-BlockLayoutByFrequency -XX:-BlockLayoutRotateLoops -XX:CompileCommand=print,org.apache.lucene.index.FreqProxTermsWriterPerField::flush -XX:AbortVMOnException=java.lang.AssertionError
> 
> Also, please, pull latest C2 Compiler changes from http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot
> 
> Looking on java code in FreqProxTermsWriterPerField::flush() and based on LUCENE-5168 report generated code somehow missed EOF check. Am I right? This is why the Assert is thrown?
> 
> And the failing code is next:
> 
>        } else {
>          final int code = freq.readVInt();
>          if (!readTermFreq) {
>            docID += code;
>            termFreq = -1;
>          } else {
>            docID += code >>> 1;
>            if ((code & 1) != 0) {
>              termFreq = 1;
>            } else {
>              termFreq = freq.readVInt();
>            }
>          }
> 
>          assert docID != postings.lastDocIDs[termID];
>        }
> 
> Could you also send latest version of FreqProxTermsWriterPerField.java you are testing?
> 
> Thanks,
> Vladimir
> 
> On 8/19/13 2:45 PM, Christian Thalinger wrote:
>> 
>> On Aug 19, 2013, at 1:29 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>> 
>>> Thanks Vladimir, the tiered compilation hint was really very useful. I
>>> managed to reproduce this error on a 1.8 fastbuild and I can dump
>>> pretty much anything I want. But I still cannot figure out what's
>>> wrong -- it's beyond me. Here are the things I've tried per your
>>> suggestions
>>> 
>>> - turning off -XX:-OptimizePtrCompare or -XX:-EliminateAutoBox
>>> doesn't help (error still present),
>>> - I do have a -Xbatch with a single compiler thread -- it was more
>>> difficult to hit an error seed (the tests are randomized) but it's
>>> still possible.
>>> 
>>> I know it would be best to provide a stand-along reproduce package but
>>> it's not trivial given how complex Lucene tests and testing framework
>>> is. Can I provide you with anything that would be helpful except the
>>> above? :) Specifically I can:
>>> 
>>> 1) dump an opto assembly from a failed and non-failed run
>>> 2) provide an opto assembly from a g1gc run vs. any other gc (which
>>> doesn't seem to exhibit the problem),
>>> 3) provide a -XX:+PrintCompilation -XX:+PrintCompilation2 or a verbose
>>> hotspot.log.
>>> 
>>> Let me know if any of these (or anything else) would be useful. If
>>> not, I'll try to extract a stand-alone package that would reproduce
>>> the issue, although this is a real killer to pull off.
>> 
>> How many total compiles are there?  Maybe you can try to limit the methods being compiled with:
>> 
>> -XX:CompileCommand=compileonly,foo::bar
>> 
>> or
>> 
>> -XX:CompileCommand=exclude,foo::bar
>> 
>> One problem I see though is that using compileonly also prevents inlining which might hide the issue.
>> 
>> -- Chris
>> 
>>> 
>>> Dawid
>>> 
>>> 
>>> On Fri, Aug 16, 2013 at 9:50 AM, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>> On 8/15/13 11:45 PM, Dawid Weiss wrote:
>>>>>> 
>>>>>> It is with high probability Compiler problem.
>>>>> 
>>>>> 
>>>>> I believe so. I've re-run the tests with 1.8b102 and the problem is
>>>>> still there, although it's more difficult to show -- I ran a 100 full
>>>>> builds yesterday, five of them tripped on assertions that should be
>>>>> unreachable.
>>>> 
>>>> 
>>>> We switched on -XX:+TieredCompilation by default in b102. Switch it off to
>>>> use only C2 compiler which has the problem.
>>>> 
>>>> 
>>>>> 
>>>>>> G1 has larger write-barrier code then other GCs. It can affect inlining
>>>>>> decisions. You can try to change -XX:InlineSmallCode=1000 value. It
>>>>>> controls
>>>>>> inlining of methods which were already compiled.
>>>>>> 
>>>>>> You can also try -Xbatch -XX:CICompilerCount=1 to get serial
>>>>>> compilations.
>>>>> 
>>>>> 
>>>>> Thanks for these tips, Vladimir -- very helpful. I hope you don't mind
>>>>> me asking one more question - we had a discussion with another Lucene
>>>>> developer yesterday -- is -Xbatch deterministic in the sense that if
>>>>> you run a single thread/ deterministic piece of code it will always
>>>>> trigger compiles at the same time? What happens if there are two
>>>>> uncoordinated threads that hit a set of the same methods (and thus
>>>>> when the compiler kicks in the statistics will probably be different
>>>>> for each independent run)?
>>>> 
>>>> 
>>>> -Xbatch (equivalent to -XX:-BackgroubdCompilation) will block only thread
>>>> which first put compilation task on compile queue. Other threads check that
>>>> the task in the queue and resume execution without waiting.
>>>> You still can't get full determinism with several java threads, as you
>>>> notice. But it can reduce some variations in inlining decision because
>>>> compilation will be executed by one Compiler thread (instead of 2 by
>>>> default). So if compilation tasks are put on queue at the same order in
>>>> different runs you most likely will get the same code generation. Of cause
>>>> usually the order is slightly different (especially during startup when
>>>> there are a lot of compilation requests) so you can still get different
>>>> results.
>>>> 
>>>> 
>>>>> 
>>>>> This question originated from a broader discussion where we were
>>>>> wondering how you, the compiler-guru guys approach the debugging in
>>>>> case something like this pops up -- a bug that is very hard to
>>>>> reproduce, that manifests itself rarely and for which pretty much any
>>>>> change at the Java level changes the compilation and thus generates
>>>>> completely different code. This seems to be a tough nut to crack.
>>>> 
>>>> 
>>>> We usually try to reproduce the problem with debug version of VM which have
>>>> a lot asserts and we may hit one which helps identify the problem. You are
>>>> lucky if you can reproduce a problem in debug VM in debugger.
>>>> We try to get assembler output of compiled method during run when it
>>>> crushes. hs_err file has address and offset in compiled code and small code
>>>> snippet which helps to find the code. After that we "look hard" on assembler
>>>> code and try to figure out what is wrong with it and which compiler part can
>>>> generate such code pattern.
>>>> There is debug flag -XX:AbortVMOnException==java.lang.NullPointerException
>>>> which allow to abort VM on exceptions. And with -XX:+ShowMessageBoxOnError
>>>> flag we allow to attach debugger to VM when it happened.
>>>> When we get only core file it is tough. We try to use Serviceability Agent
>>>> to extract information and compiled code from it and other data.
>>>> 
>>>> An other suggestion for you. Since you can avoid problem with switched off
>>>> EA you can try to switch off only
>>>> 
>>>> -XX:-OptimizePtrCompare    "Use escape analysis to optimize pointers
>>>> compare"
>>>> -XX:-EliminateAutoBox      "Control optimizations for autobox elimination"
>>>> 
>>>> Vladimir
>>>> 
>>>>> 
>>>>> Dawid
>>>>> 
>>>> 
>>