Project Loom VirtualThreads hang

Thu Dec 29 22:23:36 UTC 2022

I realized I left off the list, which causes some discussion with Alex to be removed.

——
I think the check + exception was not just an assertion. It could actually happen, if Producer B offers N values, and Producer A resumes before consumer makes any progress. This is probably not the case after adding the check.

Alex
——

Thanks for the very thoughtful analysis.

Luckily, I am pretty certain it is a trivial fix for the out of order as well - simply ensure that “next tail” != head when checking if there is space in the buffer. The consumer cannot race ahead because it must read a non-null value. I posted the updated code. If Producer A never gets rescheduled the system will hang, but that would be considered a critical failure (which is what prompted the original inquiry).

To your other points:

1. yes, it requires significant cognitive reasoning to get right - oops.
2. this is why next() is used - to avoid problems with integer math
3. the condition check is really an assert. all else being correct it should never occur.

Luckily, as you point out, the bug in the ring buffer had nothing to do with the problems being reported: 1) that an “unparked” vthread never runs, and 2) that Thread.yield() does not behave as expected with vthreads. Happily, both seem to be addressed in JDK 20.

——
But doing those tracking steps myself, it looks like i rushed to conclusions. We only observe elements issued by producer B in wrong order, but that's only a problem if it is meant to be a FIFO queue.

Alex
——

> On Dec 28, 2022, at 12:10 PM, Alex Otenko <oleksandr.otenko at gmail.com> wrote:
> 
> Hi Robert, 
> 
> Since you have a reproducer that doesn't use RingBuffer, this is a little unnecessary, but I can explain a bit.
> 
> 1. When you see multiple atomic operations, you wonder if things can happen out of order. Who can mutate tail counter and who can mutate array value, and what precludes more mutations of tail and array values? If you can't prove constructively that it can't happen, you need to assume that it may happen. 
> 
> 2. When you see integer arithmetic, you wonder what happens upon wraparound, and when does that happen?
> 
> 3. When you see a condition check (and an indicator that it is an error - an exception is thrown), you wonder what may be the flip side of that condition (may it go undetected?)
> 
> Those are red flags that drive investigation. Even without a proof of how things go wrong, I would raise questions about its safety.
> 
> However, in this case we can even show how to deadlock.
> 
> Producer A and B, RingBuffer of size N.
> 
> Consumer emptied the buffer, so head == tail.
> 
> Producer A ready to offer, increments tail, and gets suspended before it updates array at position tail. (E.g. interrupt preempts the thread)
> 
> Producer B offers N values. Observe that tail wraparound occurs.
> 
> Consumer consumes all N values. Now array cell at tail is null again.
> 
> Producer B offers N-2 values. Now array cell at tail is not null again. 
> 
> Consumer consumes at least 1 value. Now array cell at tail is null.
> 
> Producer A resumes, stores the value, because the correctness condition is met. (the flip side of exception throw - the inconsistency went unobserved)
> 
> Done. Now Consumer will not be able to access array cell with the value Producer A has just stored, and will wait indefinitely. You may need to track head to see why that is the case.
> 
> 
> Alex
> 
> On Wed, 28 Dec 2022, 12:50 Robert Engels, <rengels at ix.netcom.com <mailto:rengels at ix.netcom.com>> wrote:
> 
> Alex,
> 
> You write:
> 
> > I won't go into detail, but the producers not synchronizing between themselves leads to hangs.
> 
> 
> Can you explain? Im fairly certain the producers do valid synchronization. As I said, using native threads the code runs to completion fine  
> 
> Now that I understand the issue I was able to reproduce the issue very simply without any queues. See SpinTest posted to the project. The “starver” thread fails to make any progress - this should not be possible is Thread.yield() was fair. 
> 
> > On Dec 27, 2022, at 11:15 PM, Alex Otenko <oleksandr.otenko at gmail.com <mailto:oleksandr.otenko at gmail.com>> wrote:
> > I won't go into detail, but the producers not synchronizing between themselves leads to hangs.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20221229/9aebafe9/attachment-0001.htm>