G1GC/ JIT compilation bug hunt.

Thu Sep 5 13:09:29 PDT 2013

My quest continues :)

I thought it'd be interesting to see how far back I can trace this
issue. I fetched the official binaries for jdk17 (windows, 32-bit) and
did a binary search with the failing Lucene test command. The results
show that, in short:

...
jdk1.7.0_03: PASSES
jdk1.7.0_04: FAILS
...

and are consistent before and after. jdk1.7.0_04, 64-bit does *NOT*
exhibit the issue (and neither does any version afterwards, it only
happens on 32-bit; perhaps it's because of smaller number of available
registers and the need to spill?).

jdk1.7.0_04 was when G1GC was "officially" made supported but I don't
think this plays a big difference. I'll see if I can bsearch on
mercurial revisions to see which particular revision introduced the
problem. Anyway, the problem has to be a long-standing issue and not a
regression. Which makes it even more interesting I guess.

Dawid

On Mon, Sep 2, 2013 at 11:31 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
> Oh, I forgot to mention - that incremented value cannot be saved to
> [esp + #136] immediately because there's a value read based on the
> unincremented value (buffer[upto++]). This is all fine, but the
> problem is that the incremented spill slot [esp + #144] is never
> copied to that other spill slot that is read from the other inline.
>
> Again -- I have no clue what makes this particular code path so
> special but I can see it in the debugger clearly.
>
> Dawid
>
> On Mon, Sep 2, 2013 at 11:17 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>> I've traced the assembly on a running instance. The problem is quite
>> interesting -- the code in question is purely computational, I still
>> don't see how G1GC can be of importance here. In short what's
>> happening is a method gets inlined in a larger loop more than once and
>> a field access is placed on the esp spill. The problem is both of
>> these inlines use a different spill slot and one of the updates is not
>> properly propagated -- it is saved to one spill slot, then the code
>> execution flow returns to the top of the loop and another inline reads
>> a stale value from another spill slot. When you disable escape
>> analysis everything works as expected.
>>
>> The crucial block is:
>>
>> 7e25   B1020: # B405 B1021 <- B1017 B1019  Freq: 0.00811184   // XXX
>> jump from b1017
>> 7e25   MOV    EBX,[ESP + #136]                  // ebx = 80e
>> 7e2c   MOV    ECX,EBX
>> 7e2e   INC    ECX                               // ecx = 80f, increment.
>> 7e2f   MOV    [ESP + #144],ECX                  // this is the
>> incremented spill but it never gets propagated to [esp + #136] or
>> re-read from memory.
>> 7e36   MOV    EBX,[ESP + #132]
>> 7e3d   MOV    [EBX + #12],ECX ! Field:
>> org/apache/lucene/index/ByteSliceReader.upto
>> 7e40   MOV    EBX,[ESP + #152]
>> 7e47   TEST   EBX,EBX
>> 7e49   Jne    B405  P=0.999999 C=-1.000000      // taken, null check
>>
>>
>> The incremented local value of upto gets saved to [ESP + #144] and to
>> memory but upon reiterating the loop it is re-read from [ESP + #136]
>> so it looks like a dependency is missed somewhere in the graph causing
>> escape analysis to fail.
>>
>> Ehm, obviously I have no idea how or where to fix this in the JVM code :)
>>
>> Full optoassembly dump and annotated opto assembly analysis is here:
>> http://www.cs.put.poznan.pl/dweiss/tmp/hotspot-bug-analysis.zip
>>
>> Let me know if I can do anything else.
>>
>> Dawid
>>
>> On Tue, Aug 27, 2013 at 9:15 AM, Dawid Weiss <dawid.weiss at gmail.com> wrote:
>>> Thanks Vladimir. I had the impression I may be looking at the
>>> interpreter. Just for completeness, I also thought/tried the following
>>> ways to stop at the jitted code:
>>>
>>> 1) call Windows debugger API endpoints from kernel32 --
>>>
>>> http://msdn.microsoft.com/en-us/library/windows/desktop/ms679297(v=vs.85).aspx
>>>
>>> via JNI/ JNA. This worked surprisingly well -- you can even wait for
>>> the debugger to attached to the code in a loop, works like a charm. :)
>>> Although once you break into the debugger I again couldn't make sense
>>> of the stack -- probably the infrequent call causes a deopt again and
>>> we're back in the interpreter.
>>>
>>> 2) use Unsafe to force a null pointer dereference :) This should send
>>> a debugger signal, right? I didn't try it though, didn't have the time
>>> yet. I suspect it may be moved to an infrequent trap too and we're
>>> back where we started, eh.
>>>
>>>> I still did not have time to look on your logs.
>>>
>>> No worries, it's been a very very educational experience for me.
>>> Besides, I used to write tons of assembly (not on x86) and it actually
>>> feels very good to be back at the assembly mnemonic level ;)
>>>
>>> Don't spoil my weekend pet project :) And seriously -- if you want to
>>> check out those logs let me know in advance and I'll prepare an update
>>> with hsx-tip if needed.
>>>
>>> Dawid
>>>
>>> On Tue, Aug 27, 2013 at 3:07 AM, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>> Dawid,
>>>>
>>>> Let it crush with AbortVMOnException (without ShowMessageBoxOnError) and
>>>> look on call stack in hs_err_pidxxx.log file. Jited code usually have
>>>> uncommon trap for exceptions so the method most likely was deoptimized and
>>>> you will see interpeter frames on call stack instead of compiled/jited.
>>>> There should be deoptimization event listed down in hs_err_pidxxx.log which
>>>> may say in which address in jited code it hit uncommon trap. You can look on
>>>> it in optoassembly output. Then during method compilation
>>>> (-XX:+PrintCompilation -Xbatch -XX:CICompilerCount=1 should give you info
>>>> when the method is compiled, you can verify it in Compile() by
>>>> this->method()->print()) in debugger stop in ciEnv::register_method() after
>>>> call to nmethod::new_nmethod(). nm->insts_begin() shows where instructions
>>>> start. Add instruction offset from optoassembly and set breakepoint at that
>>>> address. This way you can stop in jited code.
>>>>
>>>> G1 affects inlining since its write barriers code size is larger then other
>>>> collectors (it has pre and post barriers for oop stores).
>>>>
>>>> I still did not have time to look on your logs.
>>>>
>>>> Vladimir
>>>>
>>>>
>>>> On 8/26/13 12:56 PM, Dawid Weiss wrote:
>>>>>>
>>>>>> An ideal scenario and what I want to achieve is I'd like to stop on an
>>>>>> assertion somewhere in the jitted code and go back to it to follow the
>>>>>> execution (so that I can see what code is responsible for the issing
>>>>>
>>>>> increment). I'll need to sync this up (manually or otherwise) with the
>>>>> optoassembly too [...]
>>>>>
>>>>> Would somebody be so kind and provide some light on whether the above
>>>>> is feasible? I'm getting up to speed with visual studio -- haven't
>>>>> used assembly-level debugging in a longer while -- but I'm having
>>>>> troubles connecting the dots with verbose optoassembly output (the
>>>>> addresses don't seem to match). Thanks and again, apologies if it's
>>>>> something trivial but I couldn't find any helpful resources on the
>>>>> net.
>>>>>
>>>>> Dawid
>>>>>
>>>>> On Sun, Aug 25, 2013 at 9:43 PM, Dawid Weiss <dawid.weiss at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> I know why the assertion is hit. I still don't know (and probably
>>>>>> won't figure it out) why it is so difficult to reproduce or which
>>>>>> optimization leads to it.
>>>>>>
>>>>>> In short, the following code in ByteSliceReader#readByte:
>>>>>>
>>>>>>    public byte readByte() {
>>>>>>      assert !eof();
>>>>>>      assert upto <= limit;
>>>>>>      if (upto == limit) {
>>>>>>        nextSlice();
>>>>>>      }
>>>>>>      return buffer[upto++];
>>>>>>    }
>>>>>>
>>>>>> is inlined in a number of places. If I include:
>>>>>>
>>>>>>    public byte readByte() {
>>>>>>      assert !eof();
>>>>>>      assert upto <= limit;
>>>>>>      if (upto == limit) {
>>>>>>        nextSlice();
>>>>>>        baz = upto;
>>>>>>      }
>>>>>>      return buffer[upto++];
>>>>>>    }
>>>>>>    static int baz;
>>>>>>
>>>>>> I see a number of assignments to 'baz'. This is because methods like
>>>>>> readVInt fork out many conditional branches, as in:
>>>>>>
>>>>>>    public int readVInt() throws IOException {
>>>>>>      byte b = readByte();
>>>>>>      if (b >= 0) return b;
>>>>>>      int i = b & 0x7F;
>>>>>>      b = readByte();
>>>>>>      i |= (b & 0x7F) << 7;
>>>>>>      if (b >= 0) return i;
>>>>>>      b = readByte();
>>>>>>      i |= (b & 0x7F) << 14;
>>>>>>      if (b >= 0) return i;
>>>>>>      b = readByte();
>>>>>>      i |= (b & 0x7F) << 21;
>>>>>>      if (b >= 0) return i;
>>>>>>      b = readByte();
>>>>>>      // Warning: the next ands use 0x0F / 0xF0 - beware copy/paste
>>>>>> errors:
>>>>>>      i |= (b & 0x0F) << 28;
>>>>>>      if ((b & 0xF0) == 0) return i;
>>>>>>      throw new IOException("Invalid vInt detected (too many bits)");
>>>>>>    }
>>>>>>
>>>>>> One of these executions (when upto == limit, and possibly with some
>>>>>> other upper condition) leads to missing increment on buffer[upto++] --
>>>>>> upto remains unmodified (but the value at buffer[upto] is returned
>>>>>> properly). Because upto is moved to a register in many of those
>>>>>> inlined methods I suspect either the register may be incremented and
>>>>>> not saved or the increment may be removed due to some local escape
>>>>>> analysis magic (that has to be incorrect, obviously). How G1GC
>>>>>> interacts with all this (remember, it only happens with G1GC) I have
>>>>>> no clue.
>>>>>>
>>>>>> Anyway, I've tried looking at -XX:+PrintEscapeAnalysis but I couldn't
>>>>>> make much sense of it. I also dug into some of the opto assembly
>>>>>> branches trying to map them back into the original bytecode, but it's
>>>>>> a very long process -- had to map PcDesc manually to the assembly
>>>>>> offset, then go to bytecode again. Then I fired visual studio and
>>>>>> halted it like Vladimir mentioned:
>>>>>> -XX:AbortVMOnException=java.lang.AssertionError
>>>>>> -XX:+ShowMessageBoxOnError
>>>>>>
>>>>>> Once I attach to the halted JVM I "break all" and I see the reported
>>>>>> thread's stack trace halted inside ntdll (with the closest JVM frame
>>>>>> stopped at get_deopt_original_pc). The frames below it don't seem to
>>>>>> make much sense to me -- they don't seem to point to any addresses
>>>>>> reported in the compilation log (PcDesc clauses). I'm a complete
>>>>>> newbie to this, so I apologize if it's a lame question, but how can I
>>>>>> find out which part of the original jitted code this assertion was
>>>>>> invoked from?
>>>>>>
>>>>>> An ideal scenario and what I want to achieve is I'd like to stop on an
>>>>>> assertion somewhere in the jitted code and go back to it to follow the
>>>>>> execution (so that I can see what code is responsible for the missing
>>>>>> increment). I'll need to sync this up (manually or otherwise) with the
>>>>>> optoassembly too, otherwise I'll probably lose any remaining hair
>>>>>> trying to figure out what's what :)
>>>>>>
>>>>>> A hint (or any other pointer; RTFM) would be very welcome.
>>>>>>
>>>>>> Dawid
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 20, 2013 at 11:32 AM, Dawid Weiss <dawid.weiss at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> One follow-up with respect to those  encoded integers -- I swapped the
>>>>>>> correct and incorrect sequence in my last e-mail, sorry. I then also
>>>>>>> added a sysout here:
>>>>>>>
>>>>>>> @@ -451,6 +453,7 @@
>>>>>>>                 termFreq = 1;
>>>>>>>               } else {
>>>>>>>                 termFreq = freq.readVInt();
>>>>>>> +              System.err.print("# (tf): " + termFreq + " ");
>>>>>>>               }
>>>>>>>             }
>>>>>>>
>>>>>>> With this the assertion is no longer hit but it shows where the
>>>>>>> difference in numbers comes from:
>>>>>>>
>>>>>>> Correct sequence is:
>>>>>>>
>>>>>>> # (eof)
>>>>>>> # (code): 0
>>>>>>> # (tf): 3
>>>>>>> # (code): 4
>>>>>>> # (tf): 3
>>>>>>> # (code): 2
>>>>>>> # (tf): 5
>>>>>>> # (code): 2
>>>>>>> # (tf): 2
>>>>>>> # (code): 3
>>>>>>> # (code): 5
>>>>>>> # (code): 2
>>>>>>>
>>>>>>> And the incorrect, assertion-hit run shows:
>>>>>>>
>>>>>>> # (eof)
>>>>>>> # (code): 0
>>>>>>> # (code): 3
>>>>>>> # (code): 4
>>>>>>> # (code): 3
>>>>>>> # (code): 2
>>>>>>> # (code): 5
>>>>>>> # (code): 2
>>>>>>> # (code): 2
>>>>>>> # (code): 3
>>>>>>>
>>>>>>> So there's something odd happening after the first readvint which
>>>>>>> returns 0 -- the next number that should be read by:
>>>>>>>
>>>>>>> termFreq = freq.readVInt();
>>>>>>>
>>>>>>> is re-read as the next code.
>>>>>>>
>>>>>>> Dawid
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 20, 2013 at 11:11 AM, Dawid Weiss <dawid.weiss at gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Thanks for comments guys.
>>>>>>>>
>>>>>>>>> It is great that you can build fastdebug VM. I assume that if I give
>>>>>>>>> you a
>>>>>>>>> patch to try you can also build with it.
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes, absolutely.
>>>>>>>>
>>>>>>>>> First, you can run with next options and then send zipped output
>>>>>>>>> (related
>>>>>>>>> part before the method compilation and optoassembler output, don't use
>>>>>>>>> hsdis
>>>>>>>>> for that) and hs_err file when test fails:
>>>>>>>>
>>>>>>>>
>>>>>>>> I did that. Used the options you requested -- a full log of all I did
>>>>>>>> is included in the ZIP file here:
>>>>>>>> http://www.cs.put.poznan.pl/dweiss/tmp/debug-files.zip
>>>>>>>>
>>>>>>>> Cat debug-files\debug-log.txt. There are several runs included in that
>>>>>>>> ZIP file, in short:
>>>>>>>>
>>>>>>>> - jdk18-fastdebug, b102, full run (no abort on assertion, explained in
>>>>>>>> debug-log.txt):
>>>>>>>>    001-no-abort-on-assertion
>>>>>>>>    002-no-abort-on-assertion
>>>>>>>>
>>>>>>>> - jdk18-fastdebug, b102, abort on assertion (includes mem dumps)
>>>>>>>>    003-with-abort-on-assertion
>>>>>>>>    004-with-abort-on-assertion
>>>>>>>>
>>>>>>>> - jdk18-fastdebug, hsx tip, abort on assertion (includes mem dumps)
>>>>>>>>    005-hsx-tip
>>>>>>>>    006-hsx-tip
>>>>>>>>
>>>>>>>> - jdk18-fastdebug, hsx tip, excluded readvint method inlining (passed
>>>>>>>> build, so only compilation logs available).
>>>>>>>>    007-excluded-readvint
>>>>>>>>
>>>>>>>>> Looking on java code in FreqProxTermsWriterPerField::flush() and based
>>>>>>>>> on
>>>>>>>>> LUCENE-5168 report generated code somehow missed EOF check. Am I
>>>>>>>>> right? This
>>>>>>>>> is why the Assert is thrown?
>>>>>>>>
>>>>>>>>
>>>>>>>> It's not the only method it trips on. Depending on the seed the
>>>>>>>> problem shows up in different places. For the dumps I included the
>>>>>>>> issue seems to be here:
>>>>>>>>
>>>>>>>>>          } else {
>>>>>>>>>            final int code = freq.readVInt();
>>>>>>>>>            if (!readTermFreq) {
>>>>>>>>>              docID += code;
>>>>>>>>>              termFreq = -1;
>>>>>>>>
>>>>>>>>
>>>>>>>> If I add sysouts as in:
>>>>>>>>
>>>>>>>> Index:
>>>>>>>> core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>>>>>> ===================================================================
>>>>>>>> ---
>>>>>>>> core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>>>>>>       (revision 1512807)
>>>>>>>> +++
>>>>>>>> core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java
>>>>>>>>       (working copy)
>>>>>>>> @@ -427,6 +427,7 @@
>>>>>>>>           //System.out.println("  cycle");
>>>>>>>>           final int termFreq;
>>>>>>>>           if (freq.eof()) {
>>>>>>>> +          System.err.print("# (eof)");
>>>>>>>>             if (postings.lastDocCodes[termID] != -1) {
>>>>>>>>               // Return last doc
>>>>>>>>               docID = postings.lastDocIDs[termID];
>>>>>>>> @@ -442,6 +443,7 @@
>>>>>>>>             }
>>>>>>>>           } else {
>>>>>>>>             final int code = freq.readVInt();
>>>>>>>> +          System.err.print("# (code): " + code + " ");
>>>>>>>>             if (!readTermFreq) {
>>>>>>>>               docID += code;
>>>>>>>>               termFreq = -1;
>>>>>>>>
>>>>>>>> then for a constant seed you'll get an identical sequence of values.
>>>>>>>> Once the assertion hits though, the sequence deviates. Comparing a
>>>>>>>> passing run (excluding readvint) and a failing run I get a lot of
>>>>>>>> initial aligned outputs and then a deviation:
>>>>>>>>
>>>>>>>> (correct run):
>>>>>>>> # (eof)
>>>>>>>> # (eof)
>>>>>>>> # (eof)
>>>>>>>> # (code): 0
>>>>>>>> # (code): 3
>>>>>>>> # (code): 4
>>>>>>>> # (code): 3
>>>>>>>>
>>>>>>>> (incorrect run):
>>>>>>>> # (eof)
>>>>>>>> # (eof)
>>>>>>>> # (eof)
>>>>>>>> # (code): 0
>>>>>>>> # (code): 4    <- wtf?
>>>>>>>> # (code): 2
>>>>>>>> # (code): 2
>>>>>>>>
>>>>>>>> How can a variable encoded integer be misinterpreted from 3 to 4 is
>>>>>>>> beyond me, sorry. But it's not an EOF condition I think, at least the
>>>>>>>> deviation happens where the original run had was entering (code)
>>>>>>>> branch.
>>>>>>>>
>>>>>>>>> Could you also send latest version of FreqProxTermsWriterPerField.java
>>>>>>>>> you
>>>>>>>>> are testing?
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm testing against a fixed branch (the info is included in
>>>>>>>> debug-log.txt).
>>>>>>>>
>>>>>>>> If there's anything else I can do, let me know. I'm also curious how
>>>>>>>> you approach debugging this thing. It may be time-based so I don't
>>>>>>>> think it'll reproduce on your machine out of the box, but who knows.
>>>>>>>> All I know is that, for example, if you add one more sysout before the
>>>>>>>> while loop it suddenly stops reproducing so probably something gets
>>>>>>>> compiled differently... Wild :)
>>>>>>>>
>>>>>>>> Dawid