RFR for JDK-8031374: java/util/concurrent/ConcurrentQueues/OfferRemoveLoops.java fails Intermittently

Chris Hegarty chris.hegarty at oracle.com
Mon Feb 24 09:05:05 UTC 2014


Tristan,

Can you please file a new bug and bring in the changes that Martin pushed to the 166 CVS.
  http://gee.cs.oswego.edu/cgi-bin/viewcvs.cgi/jsr166/src/test/jtreg/util/concurrent/ConcurrentQueues/OfferRemoveLoops.java?view=co

Create a changeset, and I can do the OpenJDK review.

-Chris.

On 24 Feb 2014, at 03:03, Tristan Yan <tristan.yan at oracle.com> wrote:

> I am ok with this unfix for now. Martin. And thank you for bringing the improvement to jsr166 CVS.
> 
> Regards
> 
> Tristan
> 
>  
> 
> 发件人: Martin Buchholz [mailto:martinrb at google.com] 
> 发送时间: Monday, February 24, 2014 10:59 AM
> 收件人: Tristan Yan
> 抄送: core-libs-dev
> 主题: Re: 答复: 答复: RFR for JDK-8031374: java/util/concurrent/ConcurrentQueues/OfferRemoveLoops.java fails Intermittently
> 
>  
> 
> Hi Tristan,
> 
>  
> 
> I don't think we should change the test without a sufficiently solid reason, which usually means reproducing the failure, even if e.g. only once in 1000 times.  Here we don't know which queue implementation is causing the failure, and we don't know what chunk size was being used, which might cause the failure to be reproduced more reliably.  A single test failure might be due to cosmic rays!  Let's leave unfixed this while it cannot be reproduced.  You can check in my general purpose improvements to the test from jsr166 CVS.
> 
>  
> 
> On Sun, Feb 23, 2014 at 6:41 PM, Tristan Yan <HYPERLINK "mailto:tristan.yan at oracle.com"tristan.yan at oracle.com> wrote:
> 
> Hi Martin
> 
> This test failed once in our nightly test. So this is a real case failure. But unfortunately we couldn’t know more except the log we had.
> 
> 10 extra seconds may need serve 1024 loop totally, also in windows Thread.yield() doesn’t give up CPU most of times, then we can may have the situation that remover is trying to remove object from a empty queue but it doesn’t find anything a couple of times(it doesn’t give up cpu time) then offer thread get cpu time. And offer thread insert a couple of elements and queue is full.  Offer thread tries to give up CPU but Thread.yield() doesn’t really give up. Let’s assume it takes 1024 loop again. And offer thread got timeout.  Then remover thread try to remove elements, and it takes another 1024 loop to get to timeout as well. So 10 seconds may need distribute to 2048 loop at most. Every loop has 5 ms foreach. That’s why I simulate this case with a Thread.sleep(20). I admit I can’t reproduce it without changing code, this is all theoretical analysis. But this is closest possible reason that I can deduce. 
> 
> Thank you.
> 
> Tristan
> 
>  
> 
> 发件人: Martin Buchholz [mailto:HYPERLINK "mailto:martinrb at google.com"martinrb at google.com] 
> 发送时间: Monday, February 24, 2014 10:16 AM
> 
> 
> 收件人: Tristan Yan
> 抄送: core-libs-dev
> 
> 主题: Re: 答复: RFR for JDK-8031374: java/util/concurrent/ConcurrentQueues/OfferRemoveLoops.java fails Intermittently
> 
>  
> 
> I may very well be missing something, but the actual extra timeout for threads to finish is 10 whole seconds, which should be long enough to process a single chunk, even on Windows.  
> 
>  
> 
> If you can repro this consistently, try to find out which queue implementation is affected.  
> 
>  
> 
> We can also shrink max queue size and chunk size to reduce time to traverse the queue elements.
> 
>  
> 
> On Sun, Feb 23, 2014 at 4:23 PM, Tristan Yan <HYPERLINK "mailto:tristan.yan at oracle.com"tristan.yan at oracle.com> wrote:
> 
> Thank you for reviewing this. Martin
> 
>  
> 
> The original bug report is here: https://bugs.openjdk.java.net/browse/JDK-8031374.
> 
>  
> 
> I haven’t reproduced this bug but I can simulate to reproduce this failure by changing  yield() in remover thread to Thread.sleep(20).
> 
> You have commented “You've completely broken the intent of this code, which was to poll ***occasionally*** for quitting time”. I tried to not changed the logic for the test. This failure comes when only 1st rounds(1024 loop) wasn’t finished in given quitting time(before timeout). Which was 300ms. One solution is increasing default timeout as you suggested. But question is how big should we increase. When the test is really slow and could not finish 1st round(1024 loop) before timeout, I couldn’t figure out what’s the reason timeout value. Also this definitely will slow down the tests when it run in fast machine. Which is the case most of time. So I took other step that skip wait if test doesn't finish 1st round(1024 loop) before timeout. I won’t say I completely broke the intent of the code here because it’s rare case.(Only if the machine is slow and slow enough that the test doesn't finish 1st round before timeout).
> 
>  
> 
> The reason that replace Thread.yield to Thread.sleep(0L) because David Holmes point out in the bug “sleep will certainly give up the CPU, whereas yield is not required to.” Also in other mail, David pointed I should use Thread.sleep(10L) instead of Thread.sleep(0L) preferably a bit longer as we don't know how the OS will behave if the sleep time requested is less than the natural resolution of the sleep mechanism. For the experiment I’ve done in Windows. Thread.yeild() or Thread.sleep(10L) won’t guarantee current thread give up the CPU. This is a hint to OS. This makes in windows remover and offer thread could wait to each other more more than other other operating system. This is also the one of the reason that can explain why we've seen this in windows only.
> 
>  
> 
> Regards
> 
> Tristan
> 
>  
> 
>  
> 
>  
> 
> 发件人: Martin Buchholz [mailto:HYPERLINK "mailto:martinrb at google.com"martinrb at google.com] 
> 发送时间: Monday, February 24, 2014 3:47 AM
> 收件人: Tristan Yan
> 抄送: core-libs-dev
> 主题: Re: RFR for JDK-8031374: java/util/concurrent/ConcurrentQueues/OfferRemoveLoops.java fails Intermittently
> 
>  
> 
> Hi Tristan,
> 
>  
> 
> (thanks for working on my old crappy tests, and apologies for always giving you a hard time)
> 
>  
> 
> I couldn't find the bug report.  Can you provide a URL?
> 
>  
> 
> Thread.stop is only called in case the test has already failed (presumed "hung"), as a last resort.
> 
>  
> 
> Perhaps the timeout used in the test (300 ms) is too small on some systems?
> 
>  
> 
> +            protected void giveupCPU(){
> +                try {
> +                    Thread.sleep(0L);
> +                } catch (InterruptedException ignore) {}
>              }
> 
>  
> 
> I know of no reason why Thread.sleep(0) should be any more effective than Thread.yield.  If it was more effective, then why wouldn't Thread.yield on that platform be fixed to be implemented exactly the same way?  IOW if this is a problem, fix the JDK!
> 
>  
> 
>              /** Polls occasionally for quitting time. */
>              protected boolean quittingTime(long i) {
> -                return (i % 1024) == 0 && quittingTime();
> +                return stopRequest || quittingTime() && (i % 1024 == 0 || i < 1024);
> +            }
> 
>  
> 
> You've completely broken the intent of this code, which was to poll ***occasionally*** for quitting time.
> 
>  
> 
>  
> 
> On Sun, Feb 23, 2014 at 1:40 AM, Tristan Yan <HYPERLINK "mailto:tristan.yan at oracle.com"tristan.yan at oracle.com> wrote:
> 
> Hi All
> Could you please help to review fix for JDK-803174.
> 
> http://cr.openjdk.java.net/~tyan/JDK-8031374/webrev.00/
> 
> Description:
> There are a couple of code change for this code.
> 1. Original code has used known problematic Thread.stop. Which could cause ThreadDeath as all we know. I change it to a customized stopThread method.
> 2. Test failed in windows, I analyze the failure by changing Thread.yield() in remover thread to Thread.sleep(50). This is a simulation for slow machine. Which shows exact same failures as we can see in bug description. By adding more debug info, we can see although we tried to give up CPU by using Thread.yield().(I have tried to use Thread.sleep(1L) as well, the result is same), there is no guarantee that os will pick up a new thread to execute. In Windows, this is more obvious. Because the execution is slow. even when the timeout happens, offer thread and remove thread won’t have chance to get the point that i % 1024 == 0. This will cause the failure like we see in the log. My fix here is when the timeout happens, but i is still less than 1024. Stop offer thread and remover thread right away instead letting them continuously wait the point to i == 1024.
> 3. I replace Thread.yield to Thread.sleep(0L). I saw a couple of discussion that Thread.yield is not required to give up CPU.
> 
> Thank you
> Tristan 
> 
>  
> 
>  
> 
>  




More information about the core-libs-dev mailing list