RFR for JDK-8030284 TEST_BUG: intermittent StackOverflow in RMI bench/serial test

Sat Dec 21 00:01:55 UTC 2013

On 12/19/13 8:29 PM, David Holmes wrote:
> If you were always one frame from the end then it is not so surprising that a
> simple change pushes you past the limit :) Try running the shell test with
> additional recursive loads and see when it fails.

David doesn't seem surprised, but I guess I still am. :-)

Tristan, do you think you could do some investigation here, regarding the shell 
script based test's stack consumption? Run the shell-based test with some 
different values for -Xss and see how low you have to set it before it generates 
a stack overflow.

>> It's also kind of strange that in the two stack traces I've seen (I
>> think I managed to capture only one in the bug report though) the
>> StackOverflowError occurs on loading exactly the 50th class. Since we're
>> observing intermittent behavior (happens sometimes but not others) the
>> stack size is apparently variable. Since it's variable I'd expect to see
>> it failing at different times, possibly the 49th or 48th recursive
>> classload, not just the 50th. And in such circumstances, do we know what
>> the default stack size is?
>
> Classloading consumes a reasonable chunk of stack so if the variance elsewhere
> is quite small it is not that surprising that the test always fails on the 50th
> class. I would not expect run-to-run stack usage variance to be high unless
> there is some random component to the test.

Hm. There should be no variance in stack usage coming from the test itself. I 
believe the test does the same thing every time.

The thing I'm concerned about is whether the Java-based test is doing something 
different from the shell-based test, because of the execution environment (jtreg 
or other). We may end up simply raising the stack limit anyway, but I still find 
it hard to believe that the shell-based test was consistently just a few frames 
shy of a stack overflow.

The failure is intermittent; we've seen it twice in JPRT (our internal 
build&test system). Possible sources of the intermittency are from the different 
machines on which the test executes. So environmental factors could be at play. 
How does the JVM determine the default stack size? Could different test runs on 
different machines be running with different stack sizes?

Another source of variance is the JIT. I believe JIT-compiled code consumes 
stack differently from interpreted code. At least, I've seen differences in 
stack usage between -Xint and -Xcomp runs, and in the absence of these options 
(which means -Xmixed, I guess) the results sometimes vary unpredictably. I guess 
this might have to do with when the JIT compiler decides to kick in.

This test does perform a bunch of iterations, so JIT compilation could be a factor.

>> I don't know if you were able to reproduce this issue. If you were, it
>> would be good to understand in more detail exactly what's going on.
>
> FWIW there was a recent change in 7u to bump up the number of stack shadow pages
> in hotspot as "suddenly" StackOverflow tests were crashing instead of triggering
> StackOverflowError. So something started using more stack in a way the caused
> there to not be enough space to process a stackoverflow properly. Finding the
> exact cause can be somewhat tedious.

This seems like a different problem. We're seeing actual StackOverflowErrors, 
not crashes. Good to look out for this, though.

s'marks

>
> Cheers,
> David
>
>> s'marks