G1GC/ JIT compilation bug hunt.
Christian Thalinger
christian.thalinger at oracle.com
Mon Aug 19 16:40:23 PDT 2013
On Aug 19, 2013, at 4:12 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
> Christian,
>
> 'exclude' or 'compileonly' will not help because the failure reproduction depends on inlining.
Right. This brings me back to something we discussed a couple times already: compileonly should not disable inlining of callees. Someone should fix this.
-- Chris
>
> Dawid,
>
> It is great that you can build fastdebug VM. I assume that if I give you a patch to try you can also build with it.
>
> First, you can run with next options and then send zipped output (related part before the method compilation and optoassembler output, don't use hsdis for that) and hs_err file when test fails:
>
> -XX:CICompilerCount=1 -XX:+PrintInlining -XX:+TraceLoopOpts -XX:-BlockLayoutByFrequency -XX:-BlockLayoutRotateLoops -XX:CompileCommand=print,org.apache.lucene.index.FreqProxTermsWriterPerField::flush -XX:AbortVMOnException=java.lang.AssertionError
>
> Also, please, pull latest C2 Compiler changes from http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot
>
> Looking on java code in FreqProxTermsWriterPerField::flush() and based on LUCENE-5168 report generated code somehow missed EOF check. Am I right? This is why the Assert is thrown?
>
> And the failing code is next:
>
> } else {
> final int code = freq.readVInt();
> if (!readTermFreq) {
> docID += code;
> termFreq = -1;
> } else {
> docID += code >>> 1;
> if ((code & 1) != 0) {
> termFreq = 1;
> } else {
> termFreq = freq.readVInt();
> }
> }
>
> assert docID != postings.lastDocIDs[termID];
> }
>
> Could you also send latest version of FreqProxTermsWriterPerField.java you are testing?
>
> Thanks,
> Vladimir
>
> On 8/19/13 2:45 PM, Christian Thalinger wrote:
>>
>> On Aug 19, 2013, at 1:29 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>>
>>> Thanks Vladimir, the tiered compilation hint was really very useful. I
>>> managed to reproduce this error on a 1.8 fastbuild and I can dump
>>> pretty much anything I want. But I still cannot figure out what's
>>> wrong -- it's beyond me. Here are the things I've tried per your
>>> suggestions
>>>
>>> - turning off -XX:-OptimizePtrCompare or -XX:-EliminateAutoBox
>>> doesn't help (error still present),
>>> - I do have a -Xbatch with a single compiler thread -- it was more
>>> difficult to hit an error seed (the tests are randomized) but it's
>>> still possible.
>>>
>>> I know it would be best to provide a stand-along reproduce package but
>>> it's not trivial given how complex Lucene tests and testing framework
>>> is. Can I provide you with anything that would be helpful except the
>>> above? :) Specifically I can:
>>>
>>> 1) dump an opto assembly from a failed and non-failed run
>>> 2) provide an opto assembly from a g1gc run vs. any other gc (which
>>> doesn't seem to exhibit the problem),
>>> 3) provide a -XX:+PrintCompilation -XX:+PrintCompilation2 or a verbose
>>> hotspot.log.
>>>
>>> Let me know if any of these (or anything else) would be useful. If
>>> not, I'll try to extract a stand-alone package that would reproduce
>>> the issue, although this is a real killer to pull off.
>>
>> How many total compiles are there? Maybe you can try to limit the methods being compiled with:
>>
>> -XX:CompileCommand=compileonly,foo::bar
>>
>> or
>>
>> -XX:CompileCommand=exclude,foo::bar
>>
>> One problem I see though is that using compileonly also prevents inlining which might hide the issue.
>>
>> -- Chris
>>
>>>
>>> Dawid
>>>
>>>
>>> On Fri, Aug 16, 2013 at 9:50 AM, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>> On 8/15/13 11:45 PM, Dawid Weiss wrote:
>>>>>>
>>>>>> It is with high probability Compiler problem.
>>>>>
>>>>>
>>>>> I believe so. I've re-run the tests with 1.8b102 and the problem is
>>>>> still there, although it's more difficult to show -- I ran a 100 full
>>>>> builds yesterday, five of them tripped on assertions that should be
>>>>> unreachable.
>>>>
>>>>
>>>> We switched on -XX:+TieredCompilation by default in b102. Switch it off to
>>>> use only C2 compiler which has the problem.
>>>>
>>>>
>>>>>
>>>>>> G1 has larger write-barrier code then other GCs. It can affect inlining
>>>>>> decisions. You can try to change -XX:InlineSmallCode=1000 value. It
>>>>>> controls
>>>>>> inlining of methods which were already compiled.
>>>>>>
>>>>>> You can also try -Xbatch -XX:CICompilerCount=1 to get serial
>>>>>> compilations.
>>>>>
>>>>>
>>>>> Thanks for these tips, Vladimir -- very helpful. I hope you don't mind
>>>>> me asking one more question - we had a discussion with another Lucene
>>>>> developer yesterday -- is -Xbatch deterministic in the sense that if
>>>>> you run a single thread/ deterministic piece of code it will always
>>>>> trigger compiles at the same time? What happens if there are two
>>>>> uncoordinated threads that hit a set of the same methods (and thus
>>>>> when the compiler kicks in the statistics will probably be different
>>>>> for each independent run)?
>>>>
>>>>
>>>> -Xbatch (equivalent to -XX:-BackgroubdCompilation) will block only thread
>>>> which first put compilation task on compile queue. Other threads check that
>>>> the task in the queue and resume execution without waiting.
>>>> You still can't get full determinism with several java threads, as you
>>>> notice. But it can reduce some variations in inlining decision because
>>>> compilation will be executed by one Compiler thread (instead of 2 by
>>>> default). So if compilation tasks are put on queue at the same order in
>>>> different runs you most likely will get the same code generation. Of cause
>>>> usually the order is slightly different (especially during startup when
>>>> there are a lot of compilation requests) so you can still get different
>>>> results.
>>>>
>>>>
>>>>>
>>>>> This question originated from a broader discussion where we were
>>>>> wondering how you, the compiler-guru guys approach the debugging in
>>>>> case something like this pops up -- a bug that is very hard to
>>>>> reproduce, that manifests itself rarely and for which pretty much any
>>>>> change at the Java level changes the compilation and thus generates
>>>>> completely different code. This seems to be a tough nut to crack.
>>>>
>>>>
>>>> We usually try to reproduce the problem with debug version of VM which have
>>>> a lot asserts and we may hit one which helps identify the problem. You are
>>>> lucky if you can reproduce a problem in debug VM in debugger.
>>>> We try to get assembler output of compiled method during run when it
>>>> crushes. hs_err file has address and offset in compiled code and small code
>>>> snippet which helps to find the code. After that we "look hard" on assembler
>>>> code and try to figure out what is wrong with it and which compiler part can
>>>> generate such code pattern.
>>>> There is debug flag -XX:AbortVMOnException==java.lang.NullPointerException
>>>> which allow to abort VM on exceptions. And with -XX:+ShowMessageBoxOnError
>>>> flag we allow to attach debugger to VM when it happened.
>>>> When we get only core file it is tough. We try to use Serviceability Agent
>>>> to extract information and compiled code from it and other data.
>>>>
>>>> An other suggestion for you. Since you can avoid problem with switched off
>>>> EA you can try to switch off only
>>>>
>>>> -XX:-OptimizePtrCompare "Use escape analysis to optimize pointers
>>>> compare"
>>>> -XX:-EliminateAutoBox "Control optimizations for autobox elimination"
>>>>
>>>> Vladimir
>>>>
>>>>>
>>>>> Dawid
>>>>>
>>>>
>>
More information about the hotspot-dev
mailing list