G1GC/ JIT compilation bug hunt.

Mon Aug 26 12:56:46 PDT 2013

> An ideal scenario and what I want to achieve is I'd like to stop on an assertion somewhere in the jitted code and go back to it to follow the execution (so that I can see what code is responsible for the issing
increment). I'll need to sync this up (manually or otherwise) with the
optoassembly too [...]

Would somebody be so kind and provide some light on whether the above
is feasible? I'm getting up to speed with visual studio -- haven't
used assembly-level debugging in a longer while -- but I'm having
troubles connecting the dots with verbose optoassembly output (the
addresses don't seem to match). Thanks and again, apologies if it's
something trivial but I couldn't find any helpful resources on the
net.

Dawid

On Sun, Aug 25, 2013 at 9:43 PM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
> I know why the assertion is hit. I still don't know (and probably
> won't figure it out) why it is so difficult to reproduce or which
> optimization leads to it.
>
> In short, the following code in ByteSliceReader#readByte:
>
>   public byte readByte() {
>     assert !eof();
>     assert upto <= limit;
>     if (upto == limit) {
>       nextSlice();
>     }
>     return buffer[upto++];
>   }
>
> is inlined in a number of places. If I include:
>
>   public byte readByte() {
>     assert !eof();
>     assert upto <= limit;
>     if (upto == limit) {
>       nextSlice();
>       baz = upto;
>     }
>     return buffer[upto++];
>   }
>   static int baz;
>
> I see a number of assignments to 'baz'. This is because methods like
> readVInt fork out many conditional branches, as in:
>
>   public int readVInt() throws IOException {
>     byte b = readByte();
>     if (b >= 0) return b;
>     int i = b & 0x7F;
>     b = readByte();
>     i |= (b & 0x7F) << 7;
>     if (b >= 0) return i;
>     b = readByte();
>     i |= (b & 0x7F) << 14;
>     if (b >= 0) return i;
>     b = readByte();
>     i |= (b & 0x7F) << 21;
>     if (b >= 0) return i;
>     b = readByte();
>     // Warning: the next ands use 0x0F / 0xF0 - beware copy/paste errors:
>     i |= (b & 0x0F) << 28;
>     if ((b & 0xF0) == 0) return i;
>     throw new IOException("Invalid vInt detected (too many bits)");
>   }
>
> One of these executions (when upto == limit, and possibly with some
> other upper condition) leads to missing increment on buffer[upto++] --
> upto remains unmodified (but the value at buffer[upto] is returned
> properly). Because upto is moved to a register in many of those
> inlined methods I suspect either the register may be incremented and
> not saved or the increment may be removed due to some local escape
> analysis magic (that has to be incorrect, obviously). How G1GC
> interacts with all this (remember, it only happens with G1GC) I have
> no clue.
>
> Anyway, I've tried looking at -XX:+PrintEscapeAnalysis but I couldn't
> make much sense of it. I also dug into some of the opto assembly
> branches trying to map them back into the original bytecode, but it's
> a very long process -- had to map PcDesc manually to the assembly
> offset, then go to bytecode again. Then I fired visual studio and
> halted it like Vladimir mentioned:
> -XX:AbortVMOnException=java.lang.AssertionError
> -XX:+ShowMessageBoxOnError
>
> Once I attach to the halted JVM I "break all" and I see the reported
> thread's stack trace halted inside ntdll (with the closest JVM frame
> stopped at get_deopt_original_pc). The frames below it don't seem to
> make much sense to me -- they don't seem to point to any addresses
> reported in the compilation log (PcDesc clauses). I'm a complete
> newbie to this, so I apologize if it's a lame question, but how can I
> find out which part of the original jitted code this assertion was
> invoked from?
>
> An ideal scenario and what I want to achieve is I'd like to stop on an
> assertion somewhere in the jitted code and go back to it to follow the
> execution (so that I can see what code is responsible for the missing
> increment). I'll need to sync this up (manually or otherwise) with the
> optoassembly too, otherwise I'll probably lose any remaining hair
> trying to figure out what's what :)
>
> A hint (or any other pointer; RTFM) would be very welcome.
>
> Dawid
>
>
> On Tue, Aug 20, 2013 at 11:32 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>> One follow-up with respect to those  encoded integers -- I swapped the
>> correct and incorrect sequence in my last e-mail, sorry. I then also
>> added a sysout here:
>>
>> @@ -451,6 +453,7 @@
>>                termFreq = 1;
>>              } else {
>>                termFreq = freq.readVInt();
>> +              System.err.print("# (tf): " + termFreq + " ");
>>              }
>>            }
>>
>> With this the assertion is no longer hit but it shows where the
>> difference in numbers comes from:
>>
>> Correct sequence is:
>>
>> # (eof)
>> # (code): 0
>> # (tf): 3
>> # (code): 4
>> # (tf): 3
>> # (code): 2
>> # (tf): 5
>> # (code): 2
>> # (tf): 2
>> # (code): 3
>> # (code): 5
>> # (code): 2
>>
>> And the incorrect, assertion-hit run shows:
>>
>> # (eof)
>> # (code): 0
>> # (code): 3
>> # (code): 4
>> # (code): 3
>> # (code): 2
>> # (code): 5
>> # (code): 2
>> # (code): 2
>> # (code): 3
>>
>> So there's something odd happening after the first readvint which
>> returns 0 -- the next number that should be read by:
>>
>> termFreq = freq.readVInt();
>>
>> is re-read as the next code.
>>
>> Dawid
>>
>>
>> On Tue, Aug 20, 2013 at 11:11 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>>> Thanks for comments guys.
>>>
>>>> It is great that you can build fastdebug VM. I assume that if I give you a
>>>> patch to try you can also build with it.
>>>
>>> Yes, absolutely.
>>>
>>>> First, you can run with next options and then send zipped output (related
>>>> part before the method compilation and optoassembler output, don't use hsdis
>>>> for that) and hs_err file when test fails:
>>>
>>> I did that. Used the options you requested -- a full log of all I did
>>> is included in the ZIP file here:
>>> http://www.cs.put.poznan.pl/dweiss/tmp/debug-files.zip
>>>
>>> Cat debug-files\debug-log.txt. There are several runs included in that
>>> ZIP file, in short:
>>>
>>> - jdk18-fastdebug, b102, full run (no abort on assertion, explained in
>>> debug-log.txt):
>>>   001-no-abort-on-assertion
>>>   002-no-abort-on-assertion
>>>
>>> - jdk18-fastdebug, b102, abort on assertion (includes mem dumps)
>>>   003-with-abort-on-assertion
>>>   004-with-abort-on-assertion
>>>
>>> - jdk18-fastdebug, hsx tip, abort on assertion (includes mem dumps)
>>>   005-hsx-tip
>>>   006-hsx-tip
>>>
>>> - jdk18-fastdebug, hsx tip, excluded readvint method inlining (passed
>>> build, so only compilation logs available).
>>>   007-excluded-readvint
>>>
>>>> Looking on java code in FreqProxTermsWriterPerField::flush() and based on
>>>> LUCENE-5168 report generated code somehow missed EOF check. Am I right? This
>>>> is why the Assert is thrown?
>>>
>>> It's not the only method it trips on. Depending on the seed the
>>> problem shows up in different places. For the dumps I included the
>>> issue seems to be here:
>>>
>>>>         } else {
>>>>           final int code = freq.readVInt();
>>>>           if (!readTermFreq) {
>>>>             docID += code;
>>>>             termFreq = -1;
>>>
>>> If I add sysouts as in:
>>>
>>> Index: core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>> ===================================================================
>>> --- core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>      (revision 1512807)
>>> +++ core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>      (working copy)
>>> @@ -427,6 +427,7 @@
>>>          //System.out.println("  cycle");
>>>          final int termFreq;
>>>          if (freq.eof()) {
>>> +          System.err.print("# (eof)");
>>>            if (postings.lastDocCodes[termID] != -1) {
>>>              // Return last doc
>>>              docID = postings.lastDocIDs[termID];
>>> @@ -442,6 +443,7 @@
>>>            }
>>>          } else {
>>>            final int code = freq.readVInt();
>>> +          System.err.print("# (code): " + code + " ");
>>>            if (!readTermFreq) {
>>>              docID += code;
>>>              termFreq = -1;
>>>
>>> then for a constant seed you'll get an identical sequence of values.
>>> Once the assertion hits though, the sequence deviates. Comparing a
>>> passing run (excluding readvint) and a failing run I get a lot of
>>> initial aligned outputs and then a deviation:
>>>
>>> (correct run):
>>> # (eof)
>>> # (eof)
>>> # (eof)
>>> # (code): 0
>>> # (code): 3
>>> # (code): 4
>>> # (code): 3
>>>
>>> (incorrect run):
>>> # (eof)
>>> # (eof)
>>> # (eof)
>>> # (code): 0
>>> # (code): 4    <- wtf?
>>> # (code): 2
>>> # (code): 2
>>>
>>> How can a variable encoded integer be misinterpreted from 3 to 4 is
>>> beyond me, sorry. But it's not an EOF condition I think, at least the
>>> deviation happens where the original run had was entering (code)
>>> branch.
>>>
>>>> Could you also send latest version of FreqProxTermsWriterPerField.java you
>>>> are testing?
>>>
>>> I'm testing against a fixed branch (the info is included in debug-log.txt).
>>>
>>> If there's anything else I can do, let me know. I'm also curious how
>>> you approach debugging this thing. It may be time-based so I don't
>>> think it'll reproduce on your machine out of the box, but who knows.
>>> All I know is that, for example, if you add one more sysout before the
>>> while loop it suddenly stops reproducing so probably something gets
>>> compiled differently... Wild :)
>>>
>>> Dawid