G1GC/ JIT compilation bug hunt.

Mon Aug 26 18:07:32 PDT 2013

Dawid,

Let it crush with AbortVMOnException (without ShowMessageBoxOnError) and look on call stack in hs_err_pidxxx.log file. 
Jited code usually have uncommon trap for exceptions so the method most likely was deoptimized and you will see 
interpeter frames on call stack instead of compiled/jited. There should be deoptimization event listed down in 
hs_err_pidxxx.log which may say in which address in jited code it hit uncommon trap. You can look on it in optoassembly 
output. Then during method compilation (-XX:+PrintCompilation -Xbatch -XX:CICompilerCount=1 should give you info when 
the method is compiled, you can verify it in Compile() by this->method()->print()) in debugger stop in 
ciEnv::register_method() after call to nmethod::new_nmethod(). nm->insts_begin() shows where instructions start. Add 
instruction offset from optoassembly and set breakepoint at that address. This way you can stop in jited code.

G1 affects inlining since its write barriers code size is larger then other collectors (it has pre and post barriers for 
oop stores).

I still did not have time to look on your logs.

Vladimir

On 8/26/13 12:56 PM, Dawid Weiss wrote:
>> An ideal scenario and what I want to achieve is I'd like to stop on an assertion somewhere in the jitted code and go back to it to follow the execution (so that I can see what code is responsible for the issing
> increment). I'll need to sync this up (manually or otherwise) with the
> optoassembly too [...]
>
> Would somebody be so kind and provide some light on whether the above
> is feasible? I'm getting up to speed with visual studio -- haven't
> used assembly-level debugging in a longer while -- but I'm having
> troubles connecting the dots with verbose optoassembly output (the
> addresses don't seem to match). Thanks and again, apologies if it's
> something trivial but I couldn't find any helpful resources on the
> net.
>
> Dawid
>
> On Sun, Aug 25, 2013 at 9:43 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>> I know why the assertion is hit. I still don't know (and probably
>> won't figure it out) why it is so difficult to reproduce or which
>> optimization leads to it.
>>
>> In short, the following code in ByteSliceReader#readByte:
>>
>>    public byte readByte() {
>>      assert !eof();
>>      assert upto <= limit;
>>      if (upto == limit) {
>>        nextSlice();
>>      }
>>      return buffer[upto++];
>>    }
>>
>> is inlined in a number of places. If I include:
>>
>>    public byte readByte() {
>>      assert !eof();
>>      assert upto <= limit;
>>      if (upto == limit) {
>>        nextSlice();
>>        baz = upto;
>>      }
>>      return buffer[upto++];
>>    }
>>    static int baz;
>>
>> I see a number of assignments to 'baz'. This is because methods like
>> readVInt fork out many conditional branches, as in:
>>
>>    public int readVInt() throws IOException {
>>      byte b = readByte();
>>      if (b >= 0) return b;
>>      int i = b & 0x7F;
>>      b = readByte();
>>      i |= (b & 0x7F) << 7;
>>      if (b >= 0) return i;
>>      b = readByte();
>>      i |= (b & 0x7F) << 14;
>>      if (b >= 0) return i;
>>      b = readByte();
>>      i |= (b & 0x7F) << 21;
>>      if (b >= 0) return i;
>>      b = readByte();
>>      // Warning: the next ands use 0x0F / 0xF0 - beware copy/paste errors:
>>      i |= (b & 0x0F) << 28;
>>      if ((b & 0xF0) == 0) return i;
>>      throw new IOException("Invalid vInt detected (too many bits)");
>>    }
>>
>> One of these executions (when upto == limit, and possibly with some
>> other upper condition) leads to missing increment on buffer[upto++] --
>> upto remains unmodified (but the value at buffer[upto] is returned
>> properly). Because upto is moved to a register in many of those
>> inlined methods I suspect either the register may be incremented and
>> not saved or the increment may be removed due to some local escape
>> analysis magic (that has to be incorrect, obviously). How G1GC
>> interacts with all this (remember, it only happens with G1GC) I have
>> no clue.
>>
>> Anyway, I've tried looking at -XX:+PrintEscapeAnalysis but I couldn't
>> make much sense of it. I also dug into some of the opto assembly
>> branches trying to map them back into the original bytecode, but it's
>> a very long process -- had to map PcDesc manually to the assembly
>> offset, then go to bytecode again. Then I fired visual studio and
>> halted it like Vladimir mentioned:
>> -XX:AbortVMOnException=java.lang.AssertionError
>> -XX:+ShowMessageBoxOnError
>>
>> Once I attach to the halted JVM I "break all" and I see the reported
>> thread's stack trace halted inside ntdll (with the closest JVM frame
>> stopped at get_deopt_original_pc). The frames below it don't seem to
>> make much sense to me -- they don't seem to point to any addresses
>> reported in the compilation log (PcDesc clauses). I'm a complete
>> newbie to this, so I apologize if it's a lame question, but how can I
>> find out which part of the original jitted code this assertion was
>> invoked from?
>>
>> An ideal scenario and what I want to achieve is I'd like to stop on an
>> assertion somewhere in the jitted code and go back to it to follow the
>> execution (so that I can see what code is responsible for the missing
>> increment). I'll need to sync this up (manually or otherwise) with the
>> optoassembly too, otherwise I'll probably lose any remaining hair
>> trying to figure out what's what :)
>>
>> A hint (or any other pointer; RTFM) would be very welcome.
>>
>> Dawid
>>
>>
>> On Tue, Aug 20, 2013 at 11:32 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>>> One follow-up with respect to those  encoded integers -- I swapped the
>>> correct and incorrect sequence in my last e-mail, sorry. I then also
>>> added a sysout here:
>>>
>>> @@ -451,6 +453,7 @@
>>>                 termFreq = 1;
>>>               } else {
>>>                 termFreq = freq.readVInt();
>>> +              System.err.print("# (tf): " + termFreq + " ");
>>>               }
>>>             }
>>>
>>> With this the assertion is no longer hit but it shows where the
>>> difference in numbers comes from:
>>>
>>> Correct sequence is:
>>>
>>> # (eof)
>>> # (code): 0
>>> # (tf): 3
>>> # (code): 4
>>> # (tf): 3
>>> # (code): 2
>>> # (tf): 5
>>> # (code): 2
>>> # (tf): 2
>>> # (code): 3
>>> # (code): 5
>>> # (code): 2
>>>
>>> And the incorrect, assertion-hit run shows:
>>>
>>> # (eof)
>>> # (code): 0
>>> # (code): 3
>>> # (code): 4
>>> # (code): 3
>>> # (code): 2
>>> # (code): 5
>>> # (code): 2
>>> # (code): 2
>>> # (code): 3
>>>
>>> So there's something odd happening after the first readvint which
>>> returns 0 -- the next number that should be read by:
>>>
>>> termFreq = freq.readVInt();
>>>
>>> is re-read as the next code.
>>>
>>> Dawid
>>>
>>>
>>> On Tue, Aug 20, 2013 at 11:11 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>>>> Thanks for comments guys.
>>>>
>>>>> It is great that you can build fastdebug VM. I assume that if I give you a
>>>>> patch to try you can also build with it.
>>>>
>>>> Yes, absolutely.
>>>>
>>>>> First, you can run with next options and then send zipped output (related
>>>>> part before the method compilation and optoassembler output, don't use hsdis
>>>>> for that) and hs_err file when test fails:
>>>>
>>>> I did that. Used the options you requested -- a full log of all I did
>>>> is included in the ZIP file here:
>>>> http://www.cs.put.poznan.pl/dweiss/tmp/debug-files.zip
>>>>
>>>> Cat debug-files\debug-log.txt. There are several runs included in that
>>>> ZIP file, in short:
>>>>
>>>> - jdk18-fastdebug, b102, full run (no abort on assertion, explained in
>>>> debug-log.txt):
>>>>    001-no-abort-on-assertion
>>>>    002-no-abort-on-assertion
>>>>
>>>> - jdk18-fastdebug, b102, abort on assertion (includes mem dumps)
>>>>    003-with-abort-on-assertion
>>>>    004-with-abort-on-assertion
>>>>
>>>> - jdk18-fastdebug, hsx tip, abort on assertion (includes mem dumps)
>>>>    005-hsx-tip
>>>>    006-hsx-tip
>>>>
>>>> - jdk18-fastdebug, hsx tip, excluded readvint method inlining (passed
>>>> build, so only compilation logs available).
>>>>    007-excluded-readvint
>>>>
>>>>> Looking on java code in FreqProxTermsWriterPerField::flush() and based on
>>>>> LUCENE-5168 report generated code somehow missed EOF check. Am I right? This
>>>>> is why the Assert is thrown?
>>>>
>>>> It's not the only method it trips on. Depending on the seed the
>>>> problem shows up in different places. For the dumps I included the
>>>> issue seems to be here:
>>>>
>>>>>          } else {
>>>>>            final int code = freq.readVInt();
>>>>>            if (!readTermFreq) {
>>>>>              docID += code;
>>>>>              termFreq = -1;
>>>>
>>>> If I add sysouts as in:
>>>>
>>>> Index: core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>> ===================================================================
>>>> --- core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>>       (revision 1512807)
>>>> +++ core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>>       (working copy)
>>>> @@ -427,6 +427,7 @@
>>>>           //System.out.println("  cycle");
>>>>           final int termFreq;
>>>>           if (freq.eof()) {
>>>> +          System.err.print("# (eof)");
>>>>             if (postings.lastDocCodes[termID] != -1) {
>>>>               // Return last doc
>>>>               docID = postings.lastDocIDs[termID];
>>>> @@ -442,6 +443,7 @@
>>>>             }
>>>>           } else {
>>>>             final int code = freq.readVInt();
>>>> +          System.err.print("# (code): " + code + " ");
>>>>             if (!readTermFreq) {
>>>>               docID += code;
>>>>               termFreq = -1;
>>>>
>>>> then for a constant seed you'll get an identical sequence of values.
>>>> Once the assertion hits though, the sequence deviates. Comparing a
>>>> passing run (excluding readvint) and a failing run I get a lot of
>>>> initial aligned outputs and then a deviation:
>>>>
>>>> (correct run):
>>>> # (eof)
>>>> # (eof)
>>>> # (eof)
>>>> # (code): 0
>>>> # (code): 3
>>>> # (code): 4
>>>> # (code): 3
>>>>
>>>> (incorrect run):
>>>> # (eof)
>>>> # (eof)
>>>> # (eof)
>>>> # (code): 0
>>>> # (code): 4    <- wtf?
>>>> # (code): 2
>>>> # (code): 2
>>>>
>>>> How can a variable encoded integer be misinterpreted from 3 to 4 is
>>>> beyond me, sorry. But it's not an EOF condition I think, at least the
>>>> deviation happens where the original run had was entering (code)
>>>> branch.
>>>>
>>>>> Could you also send latest version of FreqProxTermsWriterPerField.java you
>>>>> are testing?
>>>>
>>>> I'm testing against a fixed branch (the info is included in debug-log.txt).
>>>>
>>>> If there's anything else I can do, let me know. I'm also curious how
>>>> you approach debugging this thing. It may be time-based so I don't
>>>> think it'll reproduce on your machine out of the box, but who knows.
>>>> All I know is that, for example, if you add one more sysout before the
>>>> while loop it suddenly stops reproducing so probably something gets
>>>> compiled differently... Wild :)
>>>>
>>>> Dawid