G1GC/ JIT compilation bug hunt.

Mon Aug 19 16:12:07 PDT 2013

Christian,

'exclude' or 'compileonly'  will not help because the failure 
reproduction depends on inlining.

Dawid,

It is great that you can build fastdebug VM. I assume that if I give you 
a patch to try you can also build with it.

First, you can run with next options and then send zipped output 
(related part before the method compilation and optoassembler output, 
don't use hsdis for that) and hs_err file when test fails:

-XX:CICompilerCount=1 -XX:+PrintInlining -XX:+TraceLoopOpts 
-XX:-BlockLayoutByFrequency -XX:-BlockLayoutRotateLoops 
-XX:CompileCommand=print,org.apache.lucene.index.FreqProxTermsWriterPerField::flush 
-XX:AbortVMOnException=java.lang.AssertionError

Also, please, pull latest C2 Compiler changes from 
http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot

Looking on java code in FreqProxTermsWriterPerField::flush() and based 
on LUCENE-5168 report generated code somehow missed EOF check. Am I 
right? This is why the Assert is thrown?

And the failing code is next:

         } else {
           final int code = freq.readVInt();
           if (!readTermFreq) {
             docID += code;
             termFreq = -1;
           } else {
             docID += code >>> 1;
             if ((code & 1) != 0) {
               termFreq = 1;
             } else {
               termFreq = freq.readVInt();
             }
           }

           assert docID != postings.lastDocIDs[termID];
         }

Could you also send latest version of FreqProxTermsWriterPerField.java 
you are testing?

Thanks,
Vladimir

On 8/19/13 2:45 PM, Christian Thalinger wrote:
>
> On Aug 19, 2013, at 1:29 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>
>> Thanks Vladimir, the tiered compilation hint was really very useful. I
>> managed to reproduce this error on a 1.8 fastbuild and I can dump
>> pretty much anything I want. But I still cannot figure out what's
>> wrong -- it's beyond me. Here are the things I've tried per your
>> suggestions
>>
>> - turning off -XX:-OptimizePtrCompare or -XX:-EliminateAutoBox
>> doesn't help (error still present),
>> - I do have a -Xbatch with a single compiler thread -- it was more
>> difficult to hit an error seed (the tests are randomized) but it's
>> still possible.
>>
>> I know it would be best to provide a stand-along reproduce package but
>> it's not trivial given how complex Lucene tests and testing framework
>> is. Can I provide you with anything that would be helpful except the
>> above? :) Specifically I can:
>>
>> 1) dump an opto assembly from a failed and non-failed run
>> 2) provide an opto assembly from a g1gc run vs. any other gc (which
>> doesn't seem to exhibit the problem),
>> 3) provide a -XX:+PrintCompilation -XX:+PrintCompilation2 or a verbose
>> hotspot.log.
>>
>> Let me know if any of these (or anything else) would be useful. If
>> not, I'll try to extract a stand-alone package that would reproduce
>> the issue, although this is a real killer to pull off.
>
> How many total compiles are there?  Maybe you can try to limit the methods being compiled with:
>
> -XX:CompileCommand=compileonly,foo::bar
>
> or
>
> -XX:CompileCommand=exclude,foo::bar
>
> One problem I see though is that using compileonly also prevents inlining which might hide the issue.
>
> -- Chris
>
>>
>> Dawid
>>
>>
>> On Fri, Aug 16, 2013 at 9:50 AM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>> On 8/15/13 11:45 PM, Dawid Weiss wrote:
>>>>>
>>>>> It is with high probability Compiler problem.
>>>>
>>>>
>>>> I believe so. I've re-run the tests with 1.8b102 and the problem is
>>>> still there, although it's more difficult to show -- I ran a 100 full
>>>> builds yesterday, five of them tripped on assertions that should be
>>>> unreachable.
>>>
>>>
>>> We switched on -XX:+TieredCompilation by default in b102. Switch it off to
>>> use only C2 compiler which has the problem.
>>>
>>>
>>>>
>>>>> G1 has larger write-barrier code then other GCs. It can affect inlining
>>>>> decisions. You can try to change -XX:InlineSmallCode=1000 value. It
>>>>> controls
>>>>> inlining of methods which were already compiled.
>>>>>
>>>>> You can also try -Xbatch -XX:CICompilerCount=1 to get serial
>>>>> compilations.
>>>>
>>>>
>>>> Thanks for these tips, Vladimir -- very helpful. I hope you don't mind
>>>> me asking one more question - we had a discussion with another Lucene
>>>> developer yesterday -- is -Xbatch deterministic in the sense that if
>>>> you run a single thread/ deterministic piece of code it will always
>>>> trigger compiles at the same time? What happens if there are two
>>>> uncoordinated threads that hit a set of the same methods (and thus
>>>> when the compiler kicks in the statistics will probably be different
>>>> for each independent run)?
>>>
>>>
>>> -Xbatch (equivalent to -XX:-BackgroubdCompilation) will block only thread
>>> which first put compilation task on compile queue. Other threads check that
>>> the task in the queue and resume execution without waiting.
>>> You still can't get full determinism with several java threads, as you
>>> notice. But it can reduce some variations in inlining decision because
>>> compilation will be executed by one Compiler thread (instead of 2 by
>>> default). So if compilation tasks are put on queue at the same order in
>>> different runs you most likely will get the same code generation. Of cause
>>> usually the order is slightly different (especially during startup when
>>> there are a lot of compilation requests) so you can still get different
>>> results.
>>>
>>>
>>>>
>>>> This question originated from a broader discussion where we were
>>>> wondering how you, the compiler-guru guys approach the debugging in
>>>> case something like this pops up -- a bug that is very hard to
>>>> reproduce, that manifests itself rarely and for which pretty much any
>>>> change at the Java level changes the compilation and thus generates
>>>> completely different code. This seems to be a tough nut to crack.
>>>
>>>
>>> We usually try to reproduce the problem with debug version of VM which have
>>> a lot asserts and we may hit one which helps identify the problem. You are
>>> lucky if you can reproduce a problem in debug VM in debugger.
>>> We try to get assembler output of compiled method during run when it
>>> crushes. hs_err file has address and offset in compiled code and small code
>>> snippet which helps to find the code. After that we "look hard" on assembler
>>> code and try to figure out what is wrong with it and which compiler part can
>>> generate such code pattern.
>>> There is debug flag -XX:AbortVMOnException==java.lang.NullPointerException
>>> which allow to abort VM on exceptions. And with -XX:+ShowMessageBoxOnError
>>> flag we allow to attach debugger to VM when it happened.
>>> When we get only core file it is tough. We try to use Serviceability Agent
>>> to extract information and compiled code from it and other data.
>>>
>>> An other suggestion for you. Since you can avoid problem with switched off
>>> EA you can try to switch off only
>>>
>>> -XX:-OptimizePtrCompare    "Use escape analysis to optimize pointers
>>> compare"
>>> -XX:-EliminateAutoBox      "Control optimizations for autobox elimination"
>>>
>>> Vladimir
>>>
>>>>
>>>> Dawid
>>>>
>>>
>