Project Loom VirtualThreads hang

Tue Jan 3 14:03:51 UTC 2023

Hi Ron. 

I think there is another traditional use case of “virtual threads” which is to increase efficiency for short cycle tasks. 

You can write a simple blocking queue and a message passing system - and with as little as 32 threads you will see > 50% system cpu time due to trashing in the scheduler. 

A virtual thread should be far more efficient in regards to user space locking/blocking/wake-up - my tests show only a few % system cpu time. To do this well though the vthread scheduler needs to be super efficient and the locking constructs rewritten knowing they’re on green threads - because the spin time/block decisions are much different (eg upwards of 16 microseconds to perform an os switch to another thread on the cpu vs a few hundred nanoseconds for a green thread switch). 

> On Jan 3, 2023, at 3:51 AM, Ron Pressler <ron.pressler at oracle.com> wrote:
> 
>  Hi.
> 
> Let me first address some of the topics that have come up in this thread before I get to the main point.
> 
> The meaning of Thread.yield is “inform the runtime that this thread doesn’t have anything useful to do at the moment, but it might again in the future”. What the runtime does with that information is up to the runtime. As Alan said, in JDK 20 we’ve changed the implementation to have the scheduler spend more time to more aggressively look for other work to do, but Thread.yield shouldn’t be used for scheduling control. It should be used (and very sparingly) to say “I temporarily have nothing useful to do.”
> 
> Virtual threads are scheduled preemptively, not cooperatively. This means that the runtime makes the decision when to deschedule (preempt) one thread and schedule another without cooperation from user code. However, the virtual thread scheduler currently does not employ time-sharing, i.e. it does not decide to preempt a thread based on it exceeding some allotted time-slice quota on the CPU. The reason for that is we haven’t yet identified a use-case where time-sharing can help for the use-cases virtual threads address (although we’re very interested to hear about such use-cases if anyone comes across one). Virtual threads are mostly intended to write servers, non-realtime kernels primarily employ time-sharing when the CPU is at 100%, servers don’t usually run at 100% CPU and when they do people aren’t generally happy with the result. So servers don’t rely on time-sharing even without virtual threads (at least not the kind that requires special support), but when we identify a use case where time-sharing could help server workloads we can consider adding it to the preemption considerations.
> 
> Now for the central point. Virtual threads have scalability benefits for a single reason: their high number. This is due to Little’s law. Replacing a set of platform threads with virtual threads — as the JEP explains — should not yield any significant benefits. The benefits come when employing a very high number of virtual threads. For example, the Helidon framework creates 3 million new virtual threads every seconds under high load. As a rule of thumb, if your application doesn’t create a few thousand or so brand-new virtual threads every second, then you’re not using them in the manner for which they were intended and will not see substantial benefits. Virtual threads replace short individual *tasks* in your application, not platform threads. They are best thought of as a business logic entity representing a task rather than an “execution resource.” For those who are used to managing threads as resources, this is a big change in how they think about threads, and will take some getting used to.
> 
> — Ron
> 
>> On 27 Dec 2022, at 22:27, robert engels <rengels at ix.netcom.com> wrote:
>> 
>> Hi devs,
>> 
>> First, 
>> 
>> Thanks for this amazing work!!! It literally solves the only remaining problem Java had.
>> 
>> Sorry for the long email.
>> 
>> I have been very excited to test-drive Project Loom in JDK19. I have extensive experience in highly concurrent systems/HFT/HPC, so I usually :) know what I am doing.
>> 
>> For the easiest test, I took a highly threaded (connection based) server based system (Java port of Go’s nats.io message broker), and converted the threads to virtual threads. The project (jnatsd) is available here. The ‘master’ branch runs very well with excellent performance, but I thought switching to virtual threads might be able to improve things over using async IO, channels, etc. (I have a branch for this that works as well, but it is much more complex, and didn’t provide a huge performance benefit)/
>> 
>> There are two branches ’simple_virtual_threads’ and ‘virtual_threads’.
>> 
>> In the former, it is literally a 2 line change to enable the virtual threads but it doesn’t work. I narrowed it down the issue that LockSupport.unpark(thread) does not work consistently. At some point, the virtual thread is never scheduled again. I enabled the debug options and I see that the the virtual thread is in:
>> 
>> yield0:365, Continuation (jdk.internal.vm)
>> yield:357, Continuation (jdk.internal.vm)
>> yieldContinuation:370, VirtualThread (java.lang)
>> park:499, VirtualThread (java.lang)
>> parkVirtualThread:2606, System$2 (java.lang)
>> park:54, VirtualThreads (jdk.internal.misc)
>> park:369, LockSupport (java.util.concurrent.locks)
>> run:88, Connection$ConnectionWriter (com.robaho.jnatsd)
>> run:287, VirtualThread (java.lang)
>> lambda$new$0:174, VirtualThread$VThreadContinuation (java.lang)
>> run:-1, VirtualThread$VThreadContinuation$$Lambda$50/0x0000000801065670 (java.lang)
>> enter0:327, Continuation (jdk.internal.vm)
>> enter:320, Continuation (jdk.internal.vm)
>> The instance state is:
>> 
>> this = {VirtualThread$VThreadContinuation at 1775} 
>>  target = {VirtualThread$VThreadContinuation$lambda at 1777} 
>>   arg$1 = {VirtualThread at 1699}
>>    scheduler = {ForkJoinPool at 1781} 
>>    cont = {VirtualThread$VThreadContinuation at 1775} 
>>    runContinuation = {VirtualThread$lambda at 1782} 
>>    state = 2
>>    parkPermit = true
>>    carrierThread = null
>>    termination = null
>>    eetop = 0
>>    tid = 76
>>    name = ""
>>    interrupted = false
>>    contextClassLoader = {ClassLoaders$AppClassLoader at 1784} 
>>    inheritedAccessControlContext = {AccessControlContext at 1785} 
>>    holder = null
>>    threadLocals = null
>>    inheritableThreadLocals = null
>>    extentLocalBindings = null
>>    interruptLock = {Object at 1786} 
>>    parkBlocker = null
>>    nioBlocker = null
>>    Thread.cont = null
>>    uncaughtExceptionHandler = null
>>    threadLocalRandomSeed = 0
>>    threadLocalRandomProbe = 0
>>    threadLocalRandomSecondarySeed = 0
>>    container = {ThreadContainers$RootContainer$CountingRootContainer at 1787} 
>>    headStackableScopes = null
>>   arg$2 = {Connection$ConnectionWriter at 1780} 
>>  scope = {ContinuationScope at 1776} 
>>  parent = null
>>  child = null
>>  tail = {StackChunk at 1778} 
>>  done = false
>>  mounted = false
>>  yieldInfo = null
>>  preempted = false
>>  extentLocalCache = null
>> scope = {ContinuationScope at 1776} 
>> child = null
>> 
>> As you see in the above, the parkPermit is true, but it never runs again.
>> 
>> In the latter branch, ‘virtual_threads’, I changed the lock-free RingBuffer class to use simple synchronized primitives - under the assumption that with virtual threads lock/wait/notify should be highly efficient. It worked, but it was nearly 2x slower than the original thread based lock-free implementation. So, I added a ’spin loop’ in the RingBuffer methods. This code is completely optional and can be no-op’d, and I was able to increase performance to above that of the Thread based version.
>> 
>> I dug a little deeper, and decided that using Thread.yield() should be even more efficient than LockSupport.parkNanos(1) - problem is that changing that simple line brings back the hangs. I think there is very little semantic difference between LockSupport.parkNanos(1) and Thread.yield() but the latter should avoid any timer scheduling. The RingBuffer code there is fairly trivial.
>> 
>> So, before I dig deeper, is this a known issue that Thread.yield() does not work as expected? Is it is known issue that LockSupport.unpark() fails to reschedule threads?
>> 
>> Is it possible because the VirtualThreads do not implement the Java memory model properly?
>> 
>> Any ideas how to further diagnose?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/loom-dev/attachments/20230103/652f13ff/attachment-0001.htm>