G1GC/ JIT compilation bug hunt.

Sun Aug 25 12:43:30 PDT 2013

I know why the assertion is hit. I still don't know (and probably
won't figure it out) why it is so difficult to reproduce or which
optimization leads to it.

In short, the following code in ByteSliceReader#readByte:

  public byte readByte() {
    assert !eof();
    assert upto <= limit;
    if (upto == limit) {
      nextSlice();
    }
    return buffer[upto++];
  }

is inlined in a number of places. If I include:

  public byte readByte() {
    assert !eof();
    assert upto <= limit;
    if (upto == limit) {
      nextSlice();
      baz = upto;
    }
    return buffer[upto++];
  }
  static int baz;

I see a number of assignments to 'baz'. This is because methods like
readVInt fork out many conditional branches, as in:

  public int readVInt() throws IOException {
    byte b = readByte();
    if (b >= 0) return b;
    int i = b & 0x7F;
    b = readByte();
    i |= (b & 0x7F) << 7;
    if (b >= 0) return i;
    b = readByte();
    i |= (b & 0x7F) << 14;
    if (b >= 0) return i;
    b = readByte();
    i |= (b & 0x7F) << 21;
    if (b >= 0) return i;
    b = readByte();
    // Warning: the next ands use 0x0F / 0xF0 - beware copy/paste errors:
    i |= (b & 0x0F) << 28;
    if ((b & 0xF0) == 0) return i;
    throw new IOException("Invalid vInt detected (too many bits)");
  }

One of these executions (when upto == limit, and possibly with some
other upper condition) leads to missing increment on buffer[upto++] --
upto remains unmodified (but the value at buffer[upto] is returned
properly). Because upto is moved to a register in many of those
inlined methods I suspect either the register may be incremented and
not saved or the increment may be removed due to some local escape
analysis magic (that has to be incorrect, obviously). How G1GC
interacts with all this (remember, it only happens with G1GC) I have
no clue.

Anyway, I've tried looking at -XX:+PrintEscapeAnalysis but I couldn't
make much sense of it. I also dug into some of the opto assembly
branches trying to map them back into the original bytecode, but it's
a very long process -- had to map PcDesc manually to the assembly
offset, then go to bytecode again. Then I fired visual studio and
halted it like Vladimir mentioned:
-XX:AbortVMOnException=java.lang.AssertionError
-XX:+ShowMessageBoxOnError

Once I attach to the halted JVM I "break all" and I see the reported
thread's stack trace halted inside ntdll (with the closest JVM frame
stopped at get_deopt_original_pc). The frames below it don't seem to
make much sense to me -- they don't seem to point to any addresses
reported in the compilation log (PcDesc clauses). I'm a complete
newbie to this, so I apologize if it's a lame question, but how can I
find out which part of the original jitted code this assertion was
invoked from?

An ideal scenario and what I want to achieve is I'd like to stop on an
assertion somewhere in the jitted code and go back to it to follow the
execution (so that I can see what code is responsible for the missing
increment). I'll need to sync this up (manually or otherwise) with the
optoassembly too, otherwise I'll probably lose any remaining hair
trying to figure out what's what :)

A hint (or any other pointer; RTFM) would be very welcome.

Dawid

On Tue, Aug 20, 2013 at 11:32 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
> One follow-up with respect to those  encoded integers -- I swapped the
> correct and incorrect sequence in my last e-mail, sorry. I then also
> added a sysout here:
>
> @@ -451,6 +453,7 @@
>                termFreq = 1;
>              } else {
>                termFreq = freq.readVInt();
> +              System.err.print("# (tf): " + termFreq + " ");
>              }
>            }
>
> With this the assertion is no longer hit but it shows where the
> difference in numbers comes from:
>
> Correct sequence is:
>
> # (eof)
> # (code): 0
> # (tf): 3
> # (code): 4
> # (tf): 3
> # (code): 2
> # (tf): 5
> # (code): 2
> # (tf): 2
> # (code): 3
> # (code): 5
> # (code): 2
>
> And the incorrect, assertion-hit run shows:
>
> # (eof)
> # (code): 0
> # (code): 3
> # (code): 4
> # (code): 3
> # (code): 2
> # (code): 5
> # (code): 2
> # (code): 2
> # (code): 3
>
> So there's something odd happening after the first readvint which
> returns 0 -- the next number that should be read by:
>
> termFreq = freq.readVInt();
>
> is re-read as the next code.
>
> Dawid
>
>
> On Tue, Aug 20, 2013 at 11:11 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>> Thanks for comments guys.
>>
>>> It is great that you can build fastdebug VM. I assume that if I give you a
>>> patch to try you can also build with it.
>>
>> Yes, absolutely.
>>
>>> First, you can run with next options and then send zipped output (related
>>> part before the method compilation and optoassembler output, don't use hsdis
>>> for that) and hs_err file when test fails:
>>
>> I did that. Used the options you requested -- a full log of all I did
>> is included in the ZIP file here:
>> http://www.cs.put.poznan.pl/dweiss/tmp/debug-files.zip
>>
>> Cat debug-files\debug-log.txt. There are several runs included in that
>> ZIP file, in short:
>>
>> - jdk18-fastdebug, b102, full run (no abort on assertion, explained in
>> debug-log.txt):
>>   001-no-abort-on-assertion
>>   002-no-abort-on-assertion
>>
>> - jdk18-fastdebug, b102, abort on assertion (includes mem dumps)
>>   003-with-abort-on-assertion
>>   004-with-abort-on-assertion
>>
>> - jdk18-fastdebug, hsx tip, abort on assertion (includes mem dumps)
>>   005-hsx-tip
>>   006-hsx-tip
>>
>> - jdk18-fastdebug, hsx tip, excluded readvint method inlining (passed
>> build, so only compilation logs available).
>>   007-excluded-readvint
>>
>>> Looking on java code in FreqProxTermsWriterPerField::flush() and based on
>>> LUCENE-5168 report generated code somehow missed EOF check. Am I right? This
>>> is why the Assert is thrown?
>>
>> It's not the only method it trips on. Depending on the seed the
>> problem shows up in different places. For the dumps I included the
>> issue seems to be here:
>>
>>>         } else {
>>>           final int code = freq.readVInt();
>>>           if (!readTermFreq) {
>>>             docID += code;
>>>             termFreq = -1;
>>
>> If I add sysouts as in:
>>
>> Index: core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>> ===================================================================
>> --- core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>      (revision 1512807)
>> +++ core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>      (working copy)
>> @@ -427,6 +427,7 @@
>>          //System.out.println("  cycle");
>>          final int termFreq;
>>          if (freq.eof()) {
>> +          System.err.print("# (eof)");
>>            if (postings.lastDocCodes[termID] != -1) {
>>              // Return last doc
>>              docID = postings.lastDocIDs[termID];
>> @@ -442,6 +443,7 @@
>>            }
>>          } else {
>>            final int code = freq.readVInt();
>> +          System.err.print("# (code): " + code + " ");
>>            if (!readTermFreq) {
>>              docID += code;
>>              termFreq = -1;
>>
>> then for a constant seed you'll get an identical sequence of values.
>> Once the assertion hits though, the sequence deviates. Comparing a
>> passing run (excluding readvint) and a failing run I get a lot of
>> initial aligned outputs and then a deviation:
>>
>> (correct run):
>> # (eof)
>> # (eof)
>> # (eof)
>> # (code): 0
>> # (code): 3
>> # (code): 4
>> # (code): 3
>>
>> (incorrect run):
>> # (eof)
>> # (eof)
>> # (eof)
>> # (code): 0
>> # (code): 4    <- wtf?
>> # (code): 2
>> # (code): 2
>>
>> How can a variable encoded integer be misinterpreted from 3 to 4 is
>> beyond me, sorry. But it's not an EOF condition I think, at least the
>> deviation happens where the original run had was entering (code)
>> branch.
>>
>>> Could you also send latest version of FreqProxTermsWriterPerField.java you
>>> are testing?
>>
>> I'm testing against a fixed branch (the info is included in debug-log.txt).
>>
>> If there's anything else I can do, let me know. I'm also curious how
>> you approach debugging this thing. It may be time-based so I don't
>> think it'll reproduce on your machine out of the box, but who knows.
>> All I know is that, for example, if you add one more sysout before the
>> while loop it suddenly stops reproducing so probably something gets
>> compiled differently... Wild :)
>>
>> Dawid